-
Notifications
You must be signed in to change notification settings - Fork 658
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor of ActiveDefrag to reduce latencies #1242
base: unstable
Are you sure you want to change the base?
Conversation
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## unstable #1242 +/- ##
============================================
+ Coverage 70.62% 70.80% +0.17%
============================================
Files 117 117
Lines 63313 63358 +45
============================================
+ Hits 44716 44859 +143
+ Misses 18597 18499 -98
|
ffdf32a
to
4054b1f
Compare
@zvi-code Would you mind also helping taking a look on this PR. |
@JimB123 , I skimmed through the code, at a very high level, it looks good (and the results you posted are very promising!). I will followup with a more detailed review of the code. |
@zvi-code The Valkey event loop supports creation of various timer events. But this For defrag, there are 2 things that need to happen. First, we need to make a decision if we should begin to defrag. It makes sense to decide this using the main In the old code, if we target 10% of the CPU, that was done by letting defrag run for (a continuous) 10ms every time the 100ms In the new code, with the same 10% CPU target, we run the defrag job for only 500us, but schedule it on it's own dedicated timer so that it can run more often. The code modulates the frequency rather than the duration. (In extreme cases, anti-starvation must still modulate the duration as starvation impacts the frequency.)
I haven't looked at the lazy expiration before, but looking at it now, that's exactly what I'm saying. We shouldn't be doing ANYTHING in
OK, wow, this is another thing I haven't looked at before. IMO, This |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly looks good, much more readable than before! Some minor comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replied to some of @madolson comments. Will publish update.
3f7d0ad
to
825da4f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lgtm. @zvi-code would still like you to take a look if you have time.
@valkey-io/core-team there is a small major change here in the form of a config that determines how much time we spend on active Defrag. Please take a look and provide 👍 or 👎 . You can see the definition here:
Other than that, this change just better amortizes the active-defrag work to reduce latency spikes. |
@JimB123 , added several comments, will continue to complete the review tomorrow |
825da4f
to
a943d5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just cleaning up merge conflicts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few more merge conflict resolutions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good, just waiting for Zvi review and getting TSC consensus. I pinged them on the slack as well to take a look here.
This config is similar to how we control active rehashing and active expire. I just want to put it into context to confirm this is the config we want. In general, we have Binbin recently adjusted active rehashing to do work based on For active defrag, we already have |
The hz mechanism can get very out of sync under high load, and it seems more straightforward to decouple the active-defrag work from the main server cron and control it separately and let it precisely control it's usage. Active defrag does a lot more work than the rehashing cron, so was causing higher latency spikes. Anything else you want to add @JimB123 ? |
What's the relation between |
|
The configs limit each other then. If you set Even if you set Can't we just use |
This PR decouples the cycle time from serverCron so not hit by the hz limits. It independently calculates how much time it should be spending doing defrag and schedules its own timer when active defrag is needed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still have some questions about risk that this change will actually increase tail latencies in some scenarios. Added specific comments\questions in relevant code
* Note that dutyCycleUs addresses starvation. If the wait time was long, we will compensate | ||
* with a proportionately long duty-cycle. This won't significantly affect perceived | ||
* latency, because clients are already being impacted by the long cycle time which caused | ||
* the starvation of the timer. */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While it is correct, this is not a good behaviour, especially from tail latency perspective. A better behaviour would increase the "measure" time interval to keep smooth run time of defrag even when there exists a periodic spike. Now a sporadic long running command\cron-job will cause defrag to run in sporadic long running spikes as well, isn't it so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dutyCycleUs is the upper bound of the time we can spend doing defrag at a given time, not the lower bound. We bias against latency spikes, especially when the system is busy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@madolson that's not quite right. If the timer is getting starved, we will increase the duty-cycle to combat the starvation. This won't occur unless there is a delay to the timer.
@zvi, I don't agree with this being bad behavior though. And the tail latency is unimportant in this case. Defrag is possibly the most critical maintenance task in the system. With today's defrag, serverCron
is subject to starvation. A customer generating a series of slow commands will severely impede defrag. There is an attempt at anti-starvation code (which I removed) in whileBlockedCron
. Reference: https://github.com/valkey-io/valkey/pull/1242/files#diff-1abc5651133d108c0c420d9411925373c711133e7748d9e4f4c97d5fb543fdd9L1617-L1626
(Note - in re-reviewing that code, I think we still need a hook there - as timers aren't run while blocked. I'll add a separate comment on that.)
Consider a command that takes 2 seconds to execute - and delay's all of the timers. In order to get 10% of the CPU (for example), we need to run for 200ms to compensate for the starvation. HOWEVER, that 200ms is small as compared to the 2-seconds that the original command took. Tail latencies are dominated by the slow command, and not by active defrag. Likewise, if the event loop is slowed due to overload, the tail latencies are dominated by the overload, not the additional fraction for defrag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JimB123 , I think we should implement a simple mechanism that would smooth usage peaks when possible, but it's not a blocker for this PR and can be suggested in followup improvement.
That explains it. Thanks! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this feature. I only reviewed the configs.
# Minimal effort for defrag in CPU percentage, to be used when the lower | ||
# threshold is reached | ||
# threshold is reached. | ||
# Note: this is not actually a cycle time, but is an overall CPU percentage | ||
# active-defrag-cycle-min 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first sentence already says it's in CPU percentage. No need to mention the same thing twice. Edit the first sentence to make it clearer instead, if needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to keep the old names "cycle-min" and "cycle-max" (as requested by Maddy), I think the note is a necessary clarification.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keep the note, but rewrite the previous sentence and/or the note so it doesn't repeat itself.
# Maximal effort for defrag in CPU percentage, to be used when the upper | ||
# threshold is reached | ||
# threshold is reached. | ||
# Note: this is not actually a cycle time, but is an overall CPU percentage | ||
# active-defrag-cycle-max 25 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto
0ed8c4b
to
6cf4898
Compare
At this point, have rebased and either addressed or responded to all comments. |
6cf4898
to
d110d12
Compare
Signed-off-by: Jim Brunner <[email protected]>
d110d12
to
4d2777b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Refer to: #1141
This update refactors the defrag code to:
With this update, the defrag cycle time is reduced to 500us, with more frequent cycles. This results in much more predictable latencies, with a dramatic reduction in tail latencies.
(See #1141 for more complete details.)
This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g.
activeDefragSdsDict()
, andmoduleDefragGlobals()
). These are out of scope for this update and left for a future update.I fixed
kvstoreDictLUTDefrag
because it was using up to 7ms on a CME single shard.During unit tests, the following max latencies were measured (in verbose mode):
In addition, the following test was run on both old and new versions of the software:
Configuration was set so that both old/new clusters would use 10% defrag CPU.
Defrag time OLD: 120 sec
Defrag time NEW: 105 sec
Times were based on a 5-second poll. (I didn't have logs running.)
The improvement in run time is believed to be due to the unskewed nature of the new code which provides a more accurate 10% of the CPU.
This is the OLD distribution for the GET benchmark while defragging:
This is the equivalent NEW distribution:
You can see in the new distribution that there is a very slight increase in latencies <= 99.951%, in exchange for a huge reduction in tail latencies.
A CONTROL test WITHOUT Active Defrag running shows a very similar distribution with a variation of roughly 500us in the tail latencies (as expected):