kvserver: add roachtests for raft memory pressure #111259
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
A CRDB node can host (tens of) thousands of Raft instances, which operate semi-independently. Each instance has a limited footprint, e.g. it can pull up to a certain amount of data in memory, and have a certain amount of in-flight messages. There is no node-wide resource limit for Raft, and, as a result, under certain circumstances (#73376, #102840, #105338) a node gets overloaded and OOMs. There is a work track to add memory limits and prevent these overflows.
In the meantime, there are a number of tests (such as #110764) susceptible to this issue, and failing occasionally. We should have a roachtest that reproduces such high-memory-usage scenarios more reliably, and use it to measure and guide improvements.
Jira issue: CRDB-31840
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: