-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: raft receive queue may OOM under overload #71805
Comments
cc @cockroachdb/bulkio |
n5 is oomkilled. Prior to that, it was complaining about several ranges with >10s handle raft readies and claiming it has some unavailable ranges. Also seeing |
Curious, what's this? Agree that this looks like an instance of #71050 (which was accidentally closed - the backport is still open). |
@shermanCRL extended blathers so it can be configured to explicitly @-mention a GitHub team when that team's T- label is applied, so that members of that team receive notifications. |
roachtest.restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed with artifacts on release-21.2 @ 9b06fffa0e5ffed8aa92cd0d860380dafe993bd2:
|
roachtest.restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed with artifacts on release-21.2 @ a5a88a4db163f0915ae65649aa0264c7a913fbfb:
|
doing a Deep Sleuth on the tsdump Disk read mb/s and iops n3 memory - yellow and red are Go alloc and Go total, respectively, so that roughly doubled from 3.5 to 7gb A big part of the problem might be that n3 seems to be ruining its L0: Other nodes don't show anything like this. There's maybe a transient spike to like 6-9 but n3 is way above that towards the end of the graph. So one way to read this is Btw, all of the graphs that I've looked at indicate there isn't a load imbalance in this cluster. CPU, RAM, number of replicas and leaseholders, etc, are all pretty balanced before things go sideways on n3. |
Concretely, n3 is slow. It's receiving lots of AddSSTable log entries from other nodes (addressing lots of different replicas on n3). They all get put into the scheduler here: cockroach/pkg/kv/kvserver/store_raft.go Lines 148 to 151 in 98a66b5
Note that there's no backpressure. Even if the scheduler isn't making any progress, we're going to stuff the request into a slice: cockroach/pkg/kv/kvserver/store_raft.go Lines 182 to 185 in 98a66b5
Now we have a dozen or so scheduler threads. If pebble is slow, they're going to spend "lots of time" handling each request, so really we're just consuming from a firehose and pulling into memory until we explode. The quota pool isn't really going to help here. For one, the quota pool is per range, so we're still going to accumulate up to num_ranges*quota_pool_size in memory, which can easily be too much. Second, the quota pool isn't always active for all followers, for example one that is catching up. So there isn't good protection here. We need to talk about backpressuring the incoming raft message stream. If the inflight scheduler staged request size is too large, we need to stop consuming the firehose. The question is whether it's better to keep pulling & drop events, or whether we want to stop pulling. |
101437: base: reduce `RaftMaxInflightBytes` to 32 MB r=erikgrinaker a=erikgrinaker This patch reduces the default `RaftMaxInflightBytes` from 256 MB to 32 MB, to reduce the out-of-memory incidence during bulk operations like `RESTORE` on clusters with overloaded disks. `RaftMaxInflightBytes` specifies the maximum aggregate byte size of Raft log entries that a leader will send to a follower without hearing responses. As such, it also bounds the amount of replication data buffered in memory on the receiver. Individual messages can still exceed this limit (consider the default command size limit at 64 MB). Normally, `RaftMaxInflightMsgs` * `RaftMaxSizePerMsg` will bound this at 4 MB (128 messages at 32 KB each). However, individual messages are allowed to exceed the 32 KB limit, typically large AddSSTable commands that can be around 10 MB each. To prevent followers running out of memory, we place an additional total byte limit of 32 MB, which is 8 times more than normal. A survey of CC clusters over the past 30 days showed that, excluding a single outlier cluster, the total outstanding `raft.rcvd.queued_bytes` of any individual node never exceeded 500 MB, and was roughly 0 across all clusters for the majority of time. Touches #71805. Resolves #100341. Resolves #100804. Resolves #100983. Resolves #101426. Epic: none Release note (ops change): the amount of replication traffic in flight from a single Raft leader to a follower has been reduced from 256 MB to 32 MB, in order to reduce the chance of running out of memory during bulk write operations. This can be controlled via the environment variable `COCKROACH_RAFT_MAX_INFLIGHT_BYTES`. Co-authored-by: Erik Grinaker <[email protected]>
roachtest.restore2TB/nodes=6/cpus=8/pd-volume=2500GB failed with artifacts on release-21.2 @ c5a6b266917ee3846dbd7ae1126c6a5d55cf439b:
Reproduce
See: roachtest README
This test on roachdash | Improve this report!
Jira issue: CRDB-10779
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: