Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor of ActiveDefrag to reduce latencies #1242

Merged
merged 2 commits into from
Dec 3, 2024

Conversation

JimB123
Copy link
Contributor

@JimB123 JimB123 commented Oct 31, 2024

Refer to: #1141

This update refactors the defrag code to:

  • Make the overall code more readable and maintainable
  • Reduce latencies incurred during defrag processing

With this update, the defrag cycle time is reduced to 500us, with more frequent cycles. This results in much more predictable latencies, with a dramatic reduction in tail latencies.

(See #1141 for more complete details.)

This update is focused mostly on the high-level processing, and does NOT address lower level functions which aren't currently timebound (e.g. activeDefragSdsDict(), and moduleDefragGlobals()). These are out of scope for this update and left for a future update.

I fixed kvstoreDictLUTDefrag because it was using up to 7ms on a CME single shard.

During unit tests, the following max latencies were measured (in verbose mode):

Unit test name Old Latency ms New Latency ms
Active defrag main dictionary: cluster 8 0
Active defrag big keys: cluster 8 0
Active defrag main dictionary: standalone 8 0
Active defrag - AOF loading 20 0
Active defrag big keys: standalone 8 0
Active defrag big list: standalone 8 0

In addition, the following test was run on both old and new versions of the software:

// Create fragmented host
./src/valkey-benchmark -r 10000000 -n 10000000 -d 3 -t set 
./src/valkey-benchmark -r 9000000 -n 10000000 -d 11 -t set 
./src/valkey-benchmark -r 8000000 -n 10000000 -d 19 -t set 
./src/valkey-benchmark -r 7000000 -n 10000000 -d 27 -t set

// Enable defrag while running some traffic
./src/valkey-cli config set activedefrag yes; ./src/valkey-benchmark -r 7000000 -c 1 -n 1000000 -l -t get

Configuration was set so that both old/new clusters would use 10% defrag CPU.
Defrag time OLD: 120 sec
Defrag time NEW: 105 sec
Times were based on a 5-second poll. (I didn't have logs running.)

The improvement in run time is believed to be due to the unskewed nature of the new code which provides a more accurate 10% of the CPU.

This is the OLD distribution for the GET benchmark while defragging:

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 5)
50.000% <= 0.031 milliseconds (cumulative count 953602)
96.875% <= 0.039 milliseconds (cumulative count 980927)
98.438% <= 0.047 milliseconds (cumulative count 990566)
99.219% <= 0.055 milliseconds (cumulative count 993349)
99.609% <= 0.351 milliseconds (cumulative count 996097)
99.805% <= 0.535 milliseconds (cumulative count 998320)
99.902% <= 0.559 milliseconds (cumulative count 999088)
99.951% <= 0.759 milliseconds (cumulative count 999516)
99.976% <= 10.191 milliseconds (cumulative count 999765)
99.988% <= 10.231 milliseconds (cumulative count 999889)
99.994% <= 10.263 milliseconds (cumulative count 999941)
99.997% <= 10.335 milliseconds (cumulative count 999971)
99.998% <= 10.687 milliseconds (cumulative count 999985)
99.999% <= 10.759 milliseconds (cumulative count 999994)
100.000% <= 10.823 milliseconds (cumulative count 999997)
100.000% <= 11.167 milliseconds (cumulative count 999999)
100.000% <= 11.175 milliseconds (cumulative count 1000000)
100.000% <= 11.175 milliseconds (cumulative count 1000000)

This is the equivalent NEW distribution:

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 8)
50.000% <= 0.031 milliseconds (cumulative count 934278)
93.750% <= 0.039 milliseconds (cumulative count 969396)
98.438% <= 0.055 milliseconds (cumulative count 985215)
99.219% <= 0.535 milliseconds (cumulative count 993656)
99.609% <= 0.543 milliseconds (cumulative count 997369)
99.805% <= 0.551 milliseconds (cumulative count 998365)
99.902% <= 0.583 milliseconds (cumulative count 999070)
99.951% <= 0.783 milliseconds (cumulative count 999515)
99.976% <= 1.023 milliseconds (cumulative count 999775)
99.988% <= 1.047 milliseconds (cumulative count 999903)
99.994% <= 1.071 milliseconds (cumulative count 999945)
99.997% <= 1.175 milliseconds (cumulative count 999970)
99.998% <= 1.295 milliseconds (cumulative count 999985)
99.999% <= 1.527 milliseconds (cumulative count 999994)
100.000% <= 1.543 milliseconds (cumulative count 999997)
100.000% <= 1.551 milliseconds (cumulative count 999999)
100.000% <= 1.735 milliseconds (cumulative count 1000000)
100.000% <= 1.735 milliseconds (cumulative count 1000000)

You can see in the new distribution that there is a very slight increase in latencies <= 99.951%, in exchange for a huge reduction in tail latencies.

A CONTROL test WITHOUT Active Defrag running shows a very similar distribution with a variation of roughly 500us in the tail latencies (as expected):

Latency by percentile distribution:
0.000% <= 0.023 milliseconds (cumulative count 7)
50.000% <= 0.031 milliseconds (cumulative count 958519)
96.875% <= 0.039 milliseconds (cumulative count 983380)
98.438% <= 0.047 milliseconds (cumulative count 991683)
99.219% <= 0.055 milliseconds (cumulative count 993548)
99.609% <= 0.319 milliseconds (cumulative count 996120)
99.805% <= 0.527 milliseconds (cumulative count 998137)
99.902% <= 0.543 milliseconds (cumulative count 999134)
99.951% <= 0.567 milliseconds (cumulative count 999575)
99.976% <= 0.615 milliseconds (cumulative count 999759)
99.988% <= 0.767 milliseconds (cumulative count 999879)
99.994% <= 0.911 milliseconds (cumulative count 999940)
99.997% <= 1.015 milliseconds (cumulative count 999975)
99.998% <= 1.031 milliseconds (cumulative count 999991)
99.999% <= 1.039 milliseconds (cumulative count 999995)
100.000% <= 1.055 milliseconds (cumulative count 999998)
100.000% <= 1.063 milliseconds (cumulative count 999999)
100.000% <= 1.071 milliseconds (cumulative count 1000000)
100.000% <= 1.071 milliseconds (cumulative count 1000000)

Copy link

codecov bot commented Oct 31, 2024

Codecov Report

Attention: Patch coverage is 94.22492% with 19 lines in your changes missing coverage. Please review.

Project coverage is 70.78%. Comparing base (469d41f) to head (2b00b2c).
Report is 14 commits behind head on unstable.

Files with missing lines Patch % Lines
src/defrag.c 94.02% 19 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##           unstable    #1242      +/-   ##
============================================
+ Coverage     70.62%   70.78%   +0.15%     
============================================
  Files           117      118       +1     
  Lines         63313    63430     +117     
============================================
+ Hits          44716    44896     +180     
+ Misses        18597    18534      -63     
Files with missing lines Coverage Δ
src/ae.c 75.00% <100.00%> (ø)
src/config.c 77.63% <ø> (-1.21%) ⬇️
src/dict.c 97.11% <100.00%> (-0.26%) ⬇️
src/dict.h 100.00% <ø> (ø)
src/kvstore.c 96.32% <100.00%> (+0.03%) ⬆️
src/server.c 87.42% <100.00%> (-0.23%) ⬇️
src/server.h 100.00% <ø> (ø)
src/defrag.c 91.83% <94.02%> (+5.55%) ⬆️

... and 18 files with indirect coverage changes

@JimB123 JimB123 force-pushed the active-defrag branch 4 times, most recently from ffdf32a to 4054b1f Compare November 5, 2024 18:44
@madolson
Copy link
Member

madolson commented Nov 5, 2024

@zvi-code Would you mind also helping taking a look on this PR.

@zvi-code
Copy link
Contributor

zvi-code commented Nov 7, 2024

@JimB123 , I skimmed through the code, at a very high level, it looks good (and the results you posted are very promising!). I will followup with a more detailed review of the code.
Can you, please, provide an overview of the change in how the new running time is calculated? Specifically, you removed the server.hz from the timelimit calculation, are you saying we should stop using it in other places (like in expiry) ? is there a tradeoff ?, could there be imbalance between cron jobs that are scheduled based on different mechanisms ? Can you benchmark your change with large number of clients? [I think dynamic_hz is set to true by default]

@JimB123
Copy link
Contributor Author

JimB123 commented Nov 7, 2024

@zvi-code The Valkey event loop supports creation of various timer events. server.hz is used to control the rate for 1 specific timer event, the serverCron timer. This timer handles various periodic maintenance events and is, by default, executed every 100ms (10 times/sec as indicated by the server.hz value). It makes sense to have a single timer for these periodic maintenance event as we don't need a separate timer for every little periodic action.

But this serverCron is not the only timer. Evictions, for example, use a timer with a 0ms delay - it fires continuously (every event loop cycle) if we are over maxmemory and critical eviction processing is required. Modules have the ability to create various timers. (At Amazon, we have other internal timers.)

For defrag, there are 2 things that need to happen. First, we need to make a decision if we should begin to defrag. It makes sense to decide this using the main serverCron. We don't need a special timer just to check if we should perform defrag. However, once that decision is made, and we begin the defrag process, it does not make sense to perform the defragmentation using the 100ms server cron. A dedicated timer is warranted - just like is done with eviction.

In the old code, if we target 10% of the CPU, that was done by letting defrag run for (a continuous) 10ms every time the 100ms serverCron fired. That's just a bad idea!

In the new code, with the same 10% CPU target, we run the defrag job for only 500us, but schedule it on it's own dedicated timer so that it can run more often. The code modulates the frequency rather than the duration. (In extreme cases, anti-starvation must still modulate the duration as starvation impacts the frequency.)

are you saying we should stop using [server.hz] in other places (like in expiry) ?

I haven't looked at the lazy expiration before, but looking at it now, that's exactly what I'm saying. We shouldn't be doing ANYTHING in serverCron that attempts to use a measured percentage of the CPU. The server cron should remain a fast and predictable timer.

I think dynamic_hz is set to true by default

OK, wow, this is another thing I haven't looked at before. IMO, dynamic_hz is misguided. We have this nice periodic job (serverCron) that runs every 100ms. But one of the things it does is to perform certain actions that iterate over a variable list of clients. It doesn't want to create latency - so it only iterates over a few clients each cycle. BUT - now it might take a very long time to iterate over the list of clients, given the 100ms frequency. So we've decided to perform serverCron more often. This means that we are performing a lot of miscellaneous tasks (like checking if we need to defrag) more often - just because there are more clients? That doesn't make sense. A better solution would be to have a separate timer for a clients-cron. And THAT timer could be run at a higher rate (and dynamically adjusted) based on the number of clients.

This dynamic_hz creates the very weird behavior that with the OLD defrag, if there are more clients, latency caused by defrag would be reduced. That's a super strange behavior. In the new defrag code, the number of clients doesn't affect defrag latencies. It's now consistently low. In fact, the default 500us setting will result in latencies that are half of the lowest possible (1%) setting on the old defrag (at the standard HZ rate).

@madolson madolson self-requested a review November 7, 2024 20:26
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly looks good, much more readable than before! Some minor comments.

Copy link
Contributor Author

@JimB123 JimB123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replied to some of @madolson comments. Will publish update.

@JimB123 JimB123 force-pushed the active-defrag branch 2 times, most recently from 3f7d0ad to 825da4f Compare November 8, 2024 21:51
Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm. @zvi-code would still like you to take a look if you have time.

@madolson
Copy link
Member

madolson commented Nov 12, 2024

@valkey-io/core-team there is a small major change here in the form of a config that determines how much time we spend on active Defrag. Please take a look and provide 👍 or 👎 . You can see the definition here:

# The time spent (in microseconds) of the periodic active defrag process.  This
# affects the latency impact of active defrag on client commands.  Smaller numbers
# will result in less latency impact at the cost of increased defrag overhead.
# active-defrag-cycle-us 500

Other than that, this change just better amortizes the active-defrag work to reduce latency spikes.

@madolson madolson added release-notes This issue should get a line item in the release notes major-decision-pending Major decision pending by TSC team labels Nov 12, 2024
@zvi-code
Copy link
Contributor

@JimB123 , added several comments, will continue to complete the review tomorrow

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just cleaning up merge conflicts.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few more merge conflict resolutions.

Copy link
Member

@madolson madolson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good, just waiting for Zvi review and getting TSC consensus. I pinged them on the slack as well to take a look here.

@zuiderkwast
Copy link
Contributor

@valkey-io/core-team there is a small major change here in the form of a config that determines how much time we spend on active Defrag. Please take a look and provide 👍 or 👎 . You can see the definition here:

# The time spent (in microseconds) of the periodic active defrag process.  This
# affects the latency impact of active defrag on client commands.  Smaller numbers
# will result in less latency impact at the cost of increased defrag overhead.
# active-defrag-cycle-us 500

Other than that, this change just better amortizes the active-defrag work to reduce latency spikes.

This config is similar to how we control active rehashing and active expire. I just want to put it into context to confirm this is the config we want.

In general, we have hz to control how often cron runs. Smaller hz means less overhead but longer latency spikes.

Binbin recently adjusted active rehashing to do work based on hz in #977. The higher the hz, the shorter time active rehashing runs each time. Active rehashing only has a yes/no config, and the time spent is fixed to 1% of the CPU time after that PR.

For active defrag, we already have active-defrag-cycle-max which controls the max percentage of CPU time spent on active defrag. It seem to me that time spent on each cron cycle can be computed from active-defrag-cycle-max and hz. Then, what's the purpose of active-defrag-cycle-us?

@madolson madolson removed the major-decision-pending Major decision pending by TSC team label Nov 22, 2024
Copy link
Contributor

@zvi-code zvi-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still have some questions about risk that this change will actually increase tail latencies in some scenarios. Added specific comments\questions in relevant code

@zuiderkwast
Copy link
Contributor

If you set active-defrag-cycle-us to 500 and hz to 10, you effectively don't let defrag run more than 5ms per second, which is 0.5% of the CPU time.

This PR decouples the cycle time from serverCron so not hit by the hz limits. It independently calculates how much time it should be spending doing defrag and schedules its own timer when active defrag is needed.

That explains it. Thanks!

Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this feature. I only reviewed the configs.

@JimB123
Copy link
Contributor Author

JimB123 commented Nov 26, 2024

At this point, have rebased and either addressed or responded to all comments.

Copy link
Contributor

@zvi-code zvi-code left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Madelyn Olson <[email protected]>
@madolson madolson requested a review from zuiderkwast November 29, 2024 16:34
@zuiderkwast zuiderkwast removed their request for review December 2, 2024 18:48
Copy link
Contributor

@zuiderkwast zuiderkwast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a full review. I trust the other reviewers.

@madolson madolson merged commit 397201c into valkey-io:unstable Dec 3, 2024
48 checks passed
@JimB123 JimB123 deleted the active-defrag branch December 3, 2024 17:32
hpatro pushed a commit that referenced this pull request Dec 3, 2024
The recent PR (#1242) converted
Active Defrag to use `monotime`. In that change, a conversion was
performed to continue to use `ustime()` as part of the module interface.
Since this time is only used internally, and never actually exposed to
the module, we can convert this to use `monotime` directly.

Signed-off-by: Jim Brunner <[email protected]>
hpatro pushed a commit that referenced this pull request Dec 12, 2024
This update addresses several issues in defrag:
1. In the defrag redesign
(#1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`

Signed-off-by: Jim Brunner <[email protected]>
vudiep411 pushed a commit to Autxmaton/valkey that referenced this pull request Dec 15, 2024
Refer to:  valkey-io#1141

This update refactors the defrag code to:
* Make the overall code more readable and maintainable
* Reduce latencies incurred during defrag processing

With this update, the defrag cycle time is reduced to 500us, with more
frequent cycles. This results in much more predictable latencies, with a
dramatic reduction in tail latencies.

(See valkey-io#1141 for more complete
details.)

This update is focused mostly on the high-level processing, and does NOT
address lower level functions which aren't currently timebound (e.g.
`activeDefragSdsDict()`, and `moduleDefragGlobals()`). These are out of
scope for this update and left for a future update.

I fixed `kvstoreDictLUTDefrag` because it was using up to 7ms on a CME
single shard. See original github issue for performance details.

---------

Signed-off-by: Jim Brunner <[email protected]>
Signed-off-by: Madelyn Olson <[email protected]>
Co-authored-by: Madelyn Olson <[email protected]>
vudiep411 pushed a commit to Autxmaton/valkey that referenced this pull request Dec 15, 2024
The recent PR (valkey-io#1242) converted
Active Defrag to use `monotime`. In that change, a conversion was
performed to continue to use `ustime()` as part of the module interface.
Since this time is only used internally, and never actually exposed to
the module, we can convert this to use `monotime` directly.

Signed-off-by: Jim Brunner <[email protected]>
vudiep411 pushed a commit to Autxmaton/valkey that referenced this pull request Dec 15, 2024
…key-io#1430)

This update addresses several issues in defrag:
1. In the defrag redesign
(valkey-io#1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`

Signed-off-by: Jim Brunner <[email protected]>
kronwerk pushed a commit to kronwerk/valkey that referenced this pull request Jan 27, 2025
…key-io#1430)

This update addresses several issues in defrag:
1. In the defrag redesign
(valkey-io#1242), a bug was introduced
where `server.cronloops` was no longer being incremented in the
`whileBlockedCron()`. This resulted in some memory statistics not being
updated while blocked.
2. In the test case for AOF loading, we were seeing errors due to defrag
latencies. However, running the math, the latencies are justified given
the extremely high CPU target of the testcase. Adjusted the expected
latency check to allow longer latencies for this case where defrag is
undergoing starvation while AOF loading is in progress.
3. A "stage" is passed a "target". For the main dictionary and expires,
we were passing in a `kvstore*`. However, on flushall or swapdb, the
pointer may change. It's safer and more stable to use an index for the
DB (a DBID). Then if the pointer changes, we can detect the change, and
simply abort the stage. (If there's still fragmentation to deal with,
we'll pick it up again on the next cycle.)
4. We always start a new stage on a new defrag cycle. This gives the new
stage time to run, and prevents latency issues for certain stages which
don't operate incrementally. However, often several stages will require
almost no work, and this will leave a chunk of our CPU allotment unused.
This is mainly an issue in starvation situations (like AOF loading or
LUA script) - where defrag is running infrequently, with a large
duty-cycle. This change allows a new stage to be initiated if we still
have a standard duty-cycle remaining. (This can happen during starvation
situations where the planned duty cycle is larger than the standard
cycle. Most likely this isn't a concern for real scenarios, but it was
observed in testing.)
5. Minor comment correction in `server.h`

Signed-off-by: Jim Brunner <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
major-decision-approved Major decision approved by TSC team release-notes This issue should get a line item in the release notes
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants