[NEW] Active Defrag - duty cycle and high level refactoring #1141

JimB123 · 2024-10-09T22:11:27Z

Overview

In Valkey, it's possible to accumulate fragmentation in blocks of memory managed by jemalloc. In order to recover this fragmented memory, we have a process (Active Defrag) that scans over all of the data structures in Valkey and decides if any of the individual allocations needs to be relocated. Jemalloc helps us to identify allocations which need to be relocated. After relocating allocations on lesser-used memory pages, jemalloc is able to reuse the pages or return them to the OS.

Issue/Problems

Duty Cycle: Active Defrag has configuration values which determine the intended percentage of CPU to be used based on a gradient of the fragmentation percentage. However, Active Defrag performs its work on the 100ms serverCron timer. It then computes a duty cycle and performs a single long cycle. For example, if the intended CPU is computed to be 10%, Active Defrag will perform 10ms of work on this 100ms timer cron.

This type of cycle introduces large latencies on the client (up to 25ms with default configurations)
This mechanism is subject to starvation when slow commands delay the serverCron

Maintainability: The current Active Defrag code is difficult to read & maintain. Refactoring of the high level control mechanisms and functions will allow us to more seamlessly adapt to new defragmentation needs. Specific examples include:

A single function (activeDefragCycle) includes the logic to start/stop/modify the defragmentation as well as performing one "step" of the defragmentation. This should be separated out, so that the actual defrag activity can be performed on an independent timer (see duty cycle above).
The code is focused on kvstores, with other actions just thrown in at the end (defragOtherGlobals). There's no mechanism to break this up to reduce latencies.
For the main dictionary (only), there is a mechanism to set aside large keys to be processed in a later step. However this code creates a separate list in each kvstore (main dict or not), bleeding/exposing internal defrag logic. We only need 1 list - inside defrag. This logic should be more contained for the main key store.
The structure is not well suited towards other non-main-dictionary items. For example, pub-sub and pub-sub-shard was added, but it's added in such a way that in CMD mode, with multiple DBs, we will defrag pub-sub repeatedly after each DB.
There are a set of dictDefragFunctions passed into the process. The same defrag functions are used for keys, expires, and pubsub. There's a bug where we would try to keep track of expiration entries for non-keys kvstores. However this is offset by a corresponding bug in dict.c which only invokes those callbacks if the dictionary uses embedded entries (currently unique to the keys dicts). This is contrary to the documentation and expected operation (and likely to cause future issues).

Description of the feature

Primarily, this feature will split activeDefragCycle into 2 functions.

One function will be called from serverCron to determine if a defrag cycle (a complete scan) needs to be started. It will also determine if the CPU expenditure needs to be adjusted.
The 2nd function will be a timer proc dedicated to performing defrag. This will be invoked independently from serverCron.

Once the functions are split, there is more control over the latency created by the defrag process. A new configuration will be used to determine the running time for the defrag timer proc. The default for this will be 500us (one-half of the current minimum time). Then the timer will be adjusted to achieve the desired CPU. As an example, 5% of CPU will run the defrag process for 500us every 10ms. This is much better than running for 5ms every 100ms.

The timer function will also adjust to compensate for starvation. If a slow command delays the timer, the process will run proportionately longer to ensure that the configured CPU is achieved. Given the presence of slow commands, the proportional extra time is insignificant to latency. This also addresses the overload case. At 100% CPU, if the event loop slows, defrag will run proportionately longer to achieve the configured CPU utilization.

Optionally, in low CPU situations, there would be little impact in utilizing more than the configured CPU. We could optionally allow the timer to pop more often (even with a 0ms delay) and the (tail) latency impact would not change.

Addressing maintainability:

The basic code structure can more clearly be organized around a "cycle".
Have clear begin/end functions and a set of "stages" to be executed.
Rather than stages being limited to "kvstore" type data, a cycle should be more flexible, incorporating the ability to incrementally perform arbitrary work. This will likely be necessary in the future for certain module types. It can be used today to address oddballs like defragOtherGlobals.
We can reduce some of the globals, and reduce some of the coupling. defrag_later should be removed from serverDb.
We should fix the improper use of dictDefragFunctions (and correct the compensating bug in dict.c)
Each stage should begin on a fresh cycle. So if there are non-time-bounded operations like kvstoreDictLUTDefrag, these would be less likely to introduce additional latency.

Status

@JimB123 is currently prototyping and will provide a PR.

The text was updated successfully, but these errors were encountered:

PingXie · 2024-11-19T21:48:20Z

Thanks @JimB123! I really like your proposal.

madolson assigned JimB123 Oct 9, 2024

JimB123 mentioned this issue Oct 31, 2024

Refactor of ActiveDefrag to reduce latencies #1242

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NEW] Active Defrag - duty cycle and high level refactoring #1141

[NEW] Active Defrag - duty cycle and high level refactoring #1141

JimB123 commented Oct 9, 2024

PingXie commented Nov 19, 2024

[NEW] Active Defrag - duty cycle and high level refactoring #1141

[NEW] Active Defrag - duty cycle and high level refactoring #1141

Comments

JimB123 commented Oct 9, 2024

Overview

Issue/Problems

Description of the feature

Status

PingXie commented Nov 19, 2024