Another harder to explain example

This program was sent to me by Ilya Kriveshko and shows a somewhat more complicated performance pattern.  Here is his explanation:
I saw your page on double performance problems in HotSpot runtime, and (in disbelief) tried to run some more tests. To my amazement, I found that this may not even be an 8-byte boundary alignment problem. I decided not to use recursion the way you did in your Fibonacci calculation (to take that out of the equation), but use simple iteration instead. This would ensure that in the same test batch the same function would be called with the same stack frame alignment.

I use iteration to run several tests to ensure that separate test batches are run with sliding stack alignment. (Pardon my wording - I'm sure you'll understand what I mean when you look at the code.)

What I ran into showed me that the alignment problem may not be as simple as your conclusion tells. It appears that several (i.e. more than two) different levels of performance may be achieved depending on alignment on single, double, triple, etc... word boundary. Please, compile and run the attached program (based on yours, but uses iteration and computes non-sense number) and look at the pattern of varying execution times. It appears that every 2hd test takes 2-3 times longer. But on top of that, every 16th test takes 5-6 times longer. ... Actually, I just pulled the test data into a spreadsheet and made a barchart out of it.  You can see for yourself. There is a visible harmonic pattern to it.

Also, running my program with -server also exhibits a similar pattern, although the peaks are not nearly as pronounced. Overall my test rogram runs significantly faster with -server. I attached the chart for the -server run as well.

I just thought you might like to know this, since you were interested in the problem in the first place.

Also, I was hoping that you could put my test program on your website and add a comment on Sun's site referring to it, since I don't have web space readily available to me.

Thanks,
--
Ilya A. Kriveshko
And here is the program DoubleTest.java and charts of the results for client JVM and server JVM

A possible explanation?

My own best guess as to why there are complicated patterns in the results is that the penalty for a misaligned double varies depending on whether the double also crosses other boundaries such as a cache-line boundary or a page boundary.  This might explain the complex patterns in these results, but doesn't explain why the server JVM shows variations when other tests indicated that it didn't suffer from the double misalignment problem.  I suspect it would take a detailed understanding of hotspot and x86 behaviour to fully explain these results.  I was able to stabilize the performance variations somewhat in my experiments by inserting unused extra local int variables, but not fully.  Perhaps one of the hotspot engineers will eventually enlighten us.