-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: excessive memory usage when receiving snapshots #73313
Comments
This one is proving difficult to track down. I've traced all uses of the returned |
I wonder if this is simply because the memory isn't being GCed fast enough -- I'm not sure if the heap dump includes memory that's eligible for GC but not yet GCed. The heap profile above was taken at 09:22:52.207. We have
|
That is a good thought, I hadn't considered the amount of RAM these machines have which will cause GC to be much more lenient. Given how much we've investigated already without turning up an alternative explanation, I'm not sure there's much to do here until we see this on another cluster at smaller RAM size, the theory being that we will never see it there due to more frequent GC pacing. |
Yeah, I'm going to run a couple of GC/pprof experiments to see how it behaves. |
Played around a bit with the GC, it appears like When reallocating 1 GB of memory inside a loop with 10 iterations, and verifying that no GC has taken place in the meanwhile, the profile only shows 1 GB of allocations even though we've allocated 10 GB across all iterations and this is confirmed to not be GCed by This seems to indicate that we do in fact allocate and hold on to the snapshot memory allocations for longer than expected, although I do not understand where that happens since I've carefully audited the code. Perhaps there are some subtle details about memory profiles and GC that I'm missing, or I just didn't audit carefully enough. Anyway, I think we've spent enough time on this already. I'm going to leave this until we see another instance of this on a different/smaller cluster. |
Thanks @dankinder -- the 1 hour snapshot timeout seems bad, and I think indicates some other underlying problem. Could you collect a debug.zip and submit it to https://upload.cockroachlabs.com/u/grinaker? |
Yup I just uploaded it @erikgrinaker |
Thanks, I'll have a look tomorrow! |
Awesome. Btw I realized that my problem probably is not the same as this issue, since the issue you were investigating seemed to be memory use within the Go heap, not cgo/jemalloc. I've been taking heap profiles when I see the memory spike from 6GB up to 60GB+ and I don't see the heap mem really increase. The metrics also indicate it's cgo mem. If you want I can create a separate issue. The only ways I'm seeing to capture a jemalloc profile are either to |
I had no idea about admission control, I just found out about it from reading #76282 but I will be trying that out as well to see if it mitigates this issue. |
Cool, worth a shot. I'm afraid I won't have a chance to look into this today, hope to get around to it tomorrow. I believe only Pebble uses cgo allocations (but I could be wrong), so it might be something at the storage layer then. |
Sorry for the delay, last few days have been hectic. I've looked at a bunch of heap profiles of around the time when memory usage is high, but the heaps don't show anything drastic. They all have a fair amount of snapshot memory usage though, both when memory usage is high and when it's normal, but it only accounts for something like 200 MB which seems reasonable. E.g.: I do see a fair bit of snapshot activity though, much of which seems to be load-based rebalancing. You're running with snapshot transfer rates of 8 MB/s, which is fairly low. As long as your disks/network can handle it without interfering with the workload, I'd recommend bumping these to at least 32 MB/s, which is the current default (many users run with 256 MB/s).
We've also seen a few instances where the rebalancer was too aggressive in moving replicas around, you can try increasing this to e.g. 0.1 and see if it has any effect:
We have an ongoing investigation into excessive Raft memory usage. Currently we're looking into command application rather than snapshots, and making some progress, but we should have a closer look at snapshots while we're at it (cc @tbg).
This was misleading: #79424 |
Awesome, thanks for taking a look |
@dankinder I don't know if this is directly related to the problems you're seeing, but you may want to follow along with #79752 and #79755. We've particularly seen this cause problems when there are a lot of SSTables flying around (due to bulk writes such as schema changes or imports) and stores struggle to keep up with the load, developing read amp that starts throttling the incoming SSTables long enough that they start piling up in the queues. |
Just to throw an idea here, I wonder if the unexpected memory accumulation described in the original comment could be attributed to something like a common bug described in https://go.dev/blog/generic-slice-functions (slices can reference memory from the slots between |
We've had (and fixed) a few issues like that previously. Definitely plausible. |
In a recent support escalation related to OOMs (see internal Slack thread), we noticed that an excessive amount of memory was being used by receiving Raft snapshots (snapshot-memory.pprof.gz):
This 21.2.0 cluster was running 72 nodes with 8 stores each, and there was constant rebalancing churn due to a suspected issue with the allocator, so as many as 8 in-flight snapshots with many more queued up. As shown in the profile, we're allocating a large amount of memory when receiving the snapshot gRPC messages, and apparently holding onto it elsewhere in the system for longer than we should. (The backup memory usage in that profile is being investigated separately, and can be ignored.)
We do not expect to see this amount of memory usage here. On the sender size, we are sending batches of KV pairs of around 256 KB, which is corroborated by the heap profile (buckets are max 304 KB):
cockroach/pkg/kv/kvserver/store_snapshot.go
Lines 348 to 351 in eec9dc3
On the receiver side, we limit the concurrency of inbound snapshots to 1 per store:
cockroach/pkg/kv/kvserver/store_snapshot.go
Line 582 in eec9dc3
We then use a
multiSSTWriter
to spool snapshots to disk, with a chunk size of 512 KB:cockroach/pkg/kv/kvserver/store_snapshot.go
Line 238 in eec9dc3
And then simply put the KV pair into the SST writer:
cockroach/pkg/kv/kvserver/store_snapshot.go
Line 268 in eec9dc3
Most likely, something inside or adjacent to Pebble ends up holding onto the key and/or value we're passing into
Put
here. However, attempts to reproduce this by simply feeding batches intoStore.receiveSnapshot()
have proven fruitless so far.Jira issue: CRDB-11535
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: