-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
hill climbing #191
hill climbing #191
Conversation
double amount = (hitRateChange >= 0) ? stepSize : -stepSize; | ||
|
||
double nextStepSize = (Math.Abs(hitRateChange) >= HillClimberRestartThreshold) | ||
? HillClimberStepPercent * (amount >= 0 ? 1 : -1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this supposed to be multiplied by the maximum?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Without max, amount is computed as a percent change to the ratio of window to main - since it's a percentage I just directly add it to mainRatio (also a percentage), and pass that into the ComputeQueueCapacity() function I already had. Probably I should clean this up, - I was excited to see it working.
In your code I think amount is computed as the actual number of slots, and setAdjustment() adds/subtracts this number from the main and window queue capacity (I couldn't find where this is implemented searching your code, but it seems like the amount unit must be number of slots).
Do you have any clamping to prevent drift to an invalid state?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The increase/decrease methods check that it doesn’t go beyond a bound. Sounds like it should have the same result?
Congrats on getting it working! Have you tried the stress test scenario yet (corda & loop)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh in mine it’s number of units where an entry may take multiple (e.g. memory bound). Not sure if that adds a wrinkle or works fine in your approach.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't tested exhaustively yet, but I think it is working well.
Uniform weight for sure makes it much simpler - I was puzzled for a while about all the different cases handled in your evictFromMain method, then I realized that weighting results in more complicated queue configurations that are unreachable for me.
I will try combining corda and loop - I made a unit test called WhenHitRateFluctuatesWindowIsAdapted that does a sanity check by just manipulating the hit rate and it works as expected. Since I copy pasted all your carefully tuned parameters and the core logic, I think on the same traces it will produce a very similar if not identical result.
It is really an ingenious addition - probably 5 or 6 lines of code that increases hit rate by > 5%.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wow, that's crazy. I wonder what AMD could be doing???
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird. I read in Agner Fog's architecture manual that zen rapidly adjusts the clock speed and it can make it hard to measure performance. It will be under constant load during the benchmark so it seems unlikely it would clock down, and anyway it always affects the same data size - likely it's related to the cache somehow. It's quite a big fluctuation.
I tried running with affinity to stick to a single core and did a hack to make sure the array is always at a 4 byte aligned address. Neither made any difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The Zen 1 can do two memory read operations or one read and one write operation, but not
two write operations, in the same clock cycle. The Zen 2 can do two memory read and one
write operation per clock cycle.
Maybe it is not realizing that the writes are to the same cache line and is creating a store buffer data dependency? Ideally it would coalesce the writes into one memory operation, but if not then I suppose it would be slower. How was the frequency performance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point - frequency
is better with block on AMD and matches expectations. The strange part is that it varies run to run - hence force alignment etc trying to reduce variables.
I pulled all the data for the eviction test, and block is equal or better in all cases:
I wondered if the difference you saw in your CI results was due to the difference between AMD and Intel architectures, but my result is not totally the same since you had both frequency and increment with similar degradation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How does the read throughput compare? An increment is the common case, but strangely that was faster in my CI runs too even though it claimed the sketch was slower (sequenced steps on the same machine).
Very confusing, but I'm glad the end result benchmarks are all showing our ideas turned out well, but still.. 🤷♂️
I haven't tested pure read. For read+write, I tested size=500 and it's similar to eviction. From memory I think it was about 9.5 million ops/sec for flat, ~10.5 for flat avx and block, and about 11.5 million for block AVX. Block is consistently better, and the AVX version gives a larger benefit. I will do proper comparison with all sizes when I get a chance. I can't argue with the results, but I would like to understand the cause of the fluctuations. I will run the tests on the current Azure offerings, I think the AMD skus are based on Zen 2. |
Use hill climbing to optimize hit rate by adapting the size of the window and main segments based on changes in the hit rate at run time.
This significantly improves hit rate for ARC OLTP, see below.
ARC OLTP
Previous result without hill climbing:
ARC Database