Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: excessive memory usage when receiving snapshots #73313

Open
erikgrinaker opened this issue Nov 30, 2021 · 17 comments
Open

kvserver: excessive memory usage when receiving snapshots #73313

erikgrinaker opened this issue Nov 30, 2021 · 17 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Nov 30, 2021

In a recent support escalation related to OOMs (see internal Slack thread), we noticed that an excessive amount of memory was being used by receiving Raft snapshots (snapshot-memory.pprof.gz):

Screenshot 2021-11-30 at 16 53 24

This 21.2.0 cluster was running 72 nodes with 8 stores each, and there was constant rebalancing churn due to a suspected issue with the allocator, so as many as 8 in-flight snapshots with many more queued up. As shown in the profile, we're allocating a large amount of memory when receiving the snapshot gRPC messages, and apparently holding onto it elsewhere in the system for longer than we should. (The backup memory usage in that profile is being investigated separately, and can be ignored.)

We do not expect to see this amount of memory usage here. On the sender size, we are sending batches of KV pairs of around 256 KB, which is corroborated by the heap profile (buckets are max 304 KB):

if bLen := int64(b.Len()); bLen >= kvSS.batchSize {
if err := kvSS.sendBatch(ctx, stream, b); err != nil {
return 0, err
}

On the receiver side, we limit the concurrency of inbound snapshots to 1 per store:

cleanup, err := s.reserveSnapshot(ctx, header)

We then use a multiSSTWriter to spool snapshots to disk, with a chunk size of 512 KB:

msstw, err := newMultiSSTWriter(ctx, kvSS.scratch, keyRanges, kvSS.sstChunkSize)

And then simply put the KV pair into the SST writer:

if err := msstw.Put(ctx, key, batchReader.Value()); err != nil {

Most likely, something inside or adjacent to Pebble ends up holding onto the key and/or value we're passing into Put here. However, attempts to reproduce this by simply feeding batches into Store.receiveSnapshot() have proven fruitless so far.

Jira issue: CRDB-11535

Epic CRDB-39898

@erikgrinaker erikgrinaker added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. T-kv-replication labels Nov 30, 2021
@erikgrinaker erikgrinaker added the C-investigation Further steps needed to qualify. C-label will change. label Nov 30, 2021
@erikgrinaker
Copy link
Contributor Author

This one is proving difficult to track down. I've traced all uses of the returned SnapshotRequest, and in particular, of the KVBatch byte slice and the key/value sub-slices. However, the key/value slices all end up being copied before they're stored anywhere, and the request itself should be eligible for garbage collection on the next kvBatchSnapshotStrategy.Receive loop iteration.

@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Nov 30, 2021

I wonder if this is simply because the memory isn't being GCed fast enough -- I'm not sure if the heap dump includes memory that's eligible for GC but not yet GCed. The heap profile above was taken at 09:22:52.207. We have MemStats dumps taken at 09:22:32.478 and 09:22:53.508:

NumGC here shows 31518 and 31520 respectively, so 2 GC cycles in 20 seconds. The LastGC fields are consistent with this. There's also about 55 GB heap memory currently allocated (these machines have 128 GB memory), so allocating ~10GB heap memory isn't necessarily that big of a deal. It isn't entirely unthinkable that we receive and process snapshots taking up ~10 GB memory in 10-20 seconds, considering the high rebalancing rate, multiple stores, network buffers, etc.

@tbg
Copy link
Member

tbg commented Dec 1, 2021

That is a good thought, I hadn't considered the amount of RAM these machines have which will cause GC to be much more lenient. Given how much we've investigated already without turning up an alternative explanation, I'm not sure there's much to do here until we see this on another cluster at smaller RAM size, the theory being that we will never see it there due to more frequent GC pacing.

@erikgrinaker
Copy link
Contributor Author

Yeah, I'm going to run a couple of GC/pprof experiments to see how it behaves.

@erikgrinaker
Copy link
Contributor Author

Played around a bit with the GC, it appears like pprof heap profiles only include reachable allocations (i.e. not allocations eligible for GC). Script: heap_test.go.txt.

When reallocating 1 GB of memory inside a loop with 10 iterations, and verifying that no GC has taken place in the meanwhile, the profile only shows 1 GB of allocations even though we've allocated 10 GB across all iterations and this is confirmed to not be GCed by NumGC and HeapAlloc -- the latter shows 10 GB allocations (9 GB of which are unreachable and pending GC), but only 1 GB turns up in the profile. These results hold even when I disable GC entirely with GOGC=off.

This seems to indicate that we do in fact allocate and hold on to the snapshot memory allocations for longer than expected, although I do not understand where that happens since I've carefully audited the code. Perhaps there are some subtle details about memory profiles and GC that I'm missing, or I just didn't audit carefully enough.

Anyway, I think we've spent enough time on this already. I'm going to leave this until we see another instance of this on a different/smaller cluster.

@dankinder
Copy link

dankinder commented Mar 29, 2022

Hi, I may be observing the same thing on one of our clusters where we've had trouble maintaining stability. This is a 16-node v21.2.7 cluster and each node has 64 vCPU, 64GB RAM, and 6 stores (900GB SSD). --cache=0.05 --max-sql-memory=0.1

The heap dumps I've captured don't show anything real interesting, but the metrics show a huge spike in CGo mem which OOMs the node:
Screen Shot 2022-03-29 at 3 12 53 PM
Screen Shot 2022-03-29 at 3 14 05 PM

It's hard to prove it is snapshot-related but I do see network activity and lease transfers happening before things go nuts:
Screen Shot 2022-03-29 at 3 14 27 PM
Screen Shot 2022-03-29 at 3 15 53 PM
Screen Shot 2022-03-29 at 3 15 19 PM

Here's what I see in the logs right around 18:11 when the lease transfers start. The "skipping snapshot" stuff is happening well before and after this, lots of lines like that. Note that another node (not this one which crashed) was recently re-added to the cluster and was receiving ranges from other nodes, so I expect that's why the send-snapshot timed out here.

I220329 18:11:05.069463 479699351 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603790  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:05.169173 479699052 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603791  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):49LEARNER
I220329 18:11:05.269695 479699354 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603792  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER
I220329 18:11:05.348228 479699213 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s84,r209766/55:‹/Table/82/9/-59{84112…-79271…}›] 603793  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s3):59LEARNER
I220329 18:11:05.569182 479697919 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s83,r504360/1:‹/Table/82/9/536{02296…-35683…}›] 603794  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):4LEARNER
I220329 18:11:05.569279 479700194 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s80,r466279/23:‹/Table/82/9/-28{41780…-36626…}›] 603795  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):28LEARNER
I220329 18:11:06.069476 479701407 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603796  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:06.167561 477562849 kv/kvserver/replica_command.go:1564 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603797  could not successfully add and upreplicate LEARNER replica(s) on [n1,s4], rolling back: ‹operation "send-snapshot" timed out after 1h0m0s›: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
I220329 18:11:06.168297 477562849 kv/kvserver/replica_command.go:2126 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603798  change replicas (add [] remove [(n1,s4):49LEARNER]): existing descriptor r195757:‹/Table/75/11/-646{7818859760271177/3793/583000273-6004820332880581/3793/834655436}› [(n5,s28):39, (n12,s71):48, (n14,s81):46, (n1,s4):49LEARNER, next=50, gen=1507]
I220329 18:11:06.169731 479701484 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603799  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):49LEARNER
I220329 18:11:06.179260 477562849 kv/kvserver/replica_raft.go:301 ⋮ [n14,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603800  proposing SIMPLE(r49) [(n1,s4):49LEARNER]: after=[(n5,s28):39 (n12,s71):48 (n14,s81):46] next=50
I220329 18:11:06.181561 477562849 kv/kvserver/replica_command.go:1830 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603801  rolled back LEARNER n1,s4 in r195757:‹/Table/75/11/-646{7818859760271177/3793/583000273-6004820332880581/3793/834655436}› [(n5,s28):39, (n12,s71):48, (n14,s81):46, (n1,s4):49LEARNER, next=50, gen=1507]
I220329 18:11:06.181849 477562849 kv/kvserver/replicate_queue.go:312 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603802  ‹operation "send-snapshot" timed out after 1h0m0s›: ‹rpc error: code = DeadlineExceeded desc = context deadline exceeded›
E220329 18:11:06.182047 477562849 kv/kvserver/queue.go:1088 ⋮ [n14,replicate,s81,r195757/46:‹/Table/75/11/-646{7818…-6004…}›] 603803  ‹operation "replicate queue process replica 195757" timed out after 13m0s›: failed to replicate after 5 retries
I220329 18:11:06.270684 479703075 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603804  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER
I220329 18:11:06.348154 479703479 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s84,r209766/55:‹/Table/82/9/-59{84112…-79271…}›] 603805  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s3):59LEARNER
I220329 18:11:06.569683 479703779 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s80,r466279/23:‹/Table/82/9/-28{41780…-36626…}›] 603806  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):28LEARNER
I220329 18:11:06.569670 479702721 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s83,r504360/1:‹/Table/82/9/536{02296…-35683…}›] 603807  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):4LEARNER
I220329 18:11:07.069202 479705812 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603808  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:07.269738 479705798 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603809  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER
I220329 18:11:07.348126 479706144 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s84,r209766/55:‹/Table/82/9/-59{84112…-79271…}›] 603810  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s3):59LEARNER
I220329 18:11:07.469558 479705111 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s80,r466279/23:‹/Table/82/9/-28{41780…-36626…}›] 603811  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):28LEARNER
I220329 18:11:07.570257 479706291 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s83,r504360/1:‹/Table/82/9/536{02296…-35683…}›] 603812  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):4LEARNER
I220329 18:11:08.070123 479706190 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s79,r122389/48:‹/Table/82/9/-749{8969…-5412…}›] 603813  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):64LEARNER
I220329 18:11:08.269451 479707005 kv/kvserver/raft_snapshot_queue.go:128 ⋮ [n14,raftsnapshot,s82,r453738/20:‹/Table/75/11/20{78193…-80654…}›] 603814  skipping snapshot; replica is likely a LEARNER in the process of being added: (n1,s4):25LEARNER

We've been fighting this for a little while now, seeing OOMs or just latency spikes, and we thought it was because of contention in our workload since we see a small percentage of query restarts happen when things go bad. But in finding this issue I'm wondering if all of that was just a symptom/red herring of some unusual range/snapshot activity.

This cluster has 13k ranges, with most of the data in 2 tables that each are pretty simple and use range min/max 256MB/1GB.

@erikgrinaker
Copy link
Contributor Author

Thanks @dankinder -- the 1 hour snapshot timeout seems bad, and I think indicates some other underlying problem. Could you collect a debug.zip and submit it to https://upload.cockroachlabs.com/u/grinaker?

@dankinder
Copy link

dankinder commented Mar 30, 2022

Yup I just uploaded it @erikgrinaker

@erikgrinaker
Copy link
Contributor Author

Thanks, I'll have a look tomorrow!

@dankinder
Copy link

Awesome. Btw I realized that my problem probably is not the same as this issue, since the issue you were investigating seemed to be memory use within the Go heap, not cgo/jemalloc. I've been taking heap profiles when I see the memory spike from 6GB up to 60GB+ and I don't see the heap mem really increase. The metrics also indicate it's cgo mem. If you want I can create a separate issue.

The only ways I'm seeing to capture a jemalloc profile are either to export MALLOC_CONF=prof:true" or ln -s prof:true /etc/malloc.conf and hope it gets one during a memory spike... wish I could get one on-demand. If you know of a better way, or anything else I should investigate, let me know...

@dankinder
Copy link

I had no idea about admission control, I just found out about it from reading #76282 but I will be trying that out as well to see if it mitigates this issue.

@erikgrinaker
Copy link
Contributor Author

Cool, worth a shot. I'm afraid I won't have a chance to look into this today, hope to get around to it tomorrow.

I believe only Pebble uses cgo allocations (but I could be wrong), so it might be something at the storage layer then.

@erikgrinaker
Copy link
Contributor Author

Sorry for the delay, last few days have been hectic.

I've looked at a bunch of heap profiles of around the time when memory usage is high, but the heaps don't show anything drastic. They all have a fair amount of snapshot memory usage though, both when memory usage is high and when it's normal, but it only accounts for something like 200 MB which seems reasonable. E.g.:

Screenshot 2022-04-05 at 15 15 45

I do see a fair bit of snapshot activity though, much of which seems to be load-based rebalancing. You're running with snapshot transfer rates of 8 MB/s, which is fairly low. As long as your disks/network can handle it without interfering with the workload, I'd recommend bumping these to at least 32 MB/s, which is the current default (many users run with 256 MB/s).

  • kv.snapshot_rebalance.max_rate
  • kv.snapshot_recovery.max_rate

We've also seen a few instances where the rebalancer was too aggressive in moving replicas around, you can try increasing this to e.g. 0.1 and see if it has any effect:

  • kv.allocator.range_rebalance_threshold

We have an ongoing investigation into excessive Raft memory usage. Currently we're looking into command application rather than snapshots, and making some progress, but we should have a closer look at snapshots while we're at it (cc @tbg).

the 1 hour snapshot timeout seems bad, and I think indicates some other underlying problem

This was misleading: #79424

@dankinder
Copy link

Awesome, thanks for taking a look

@erikgrinaker
Copy link
Contributor Author

@dankinder I don't know if this is directly related to the problems you're seeing, but you may want to follow along with #79752 and #79755. We've particularly seen this cause problems when there are a lot of SSTables flying around (due to bulk writes such as schema changes or imports) and stores struggle to keep up with the load, developing read amp that starts throttling the incoming SSTables long enough that they start piling up in the queues.

@erikgrinaker erikgrinaker added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Oct 25, 2023
@pav-kv pav-kv added the P-2 Issues/test failures with a fix SLA of 3 months label Jan 23, 2024
@pav-kv
Copy link
Collaborator

pav-kv commented Mar 21, 2024

Just to throw an idea here, I wonder if the unexpected memory accumulation described in the original comment could be attributed to something like a common bug described in https://go.dev/blog/generic-slice-functions (slices can reference memory from the slots between len(slice) and cap(slice)). Sounds unlikely, but worth to keep in mind.

@erikgrinaker
Copy link
Contributor Author

We've had (and fixed) a few issues like that previously. Definitely plausible.

@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-investigation Further steps needed to qualify. C-label will change. C-performance Perf of queries or internals. Solution not expected to change functional behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

4 participants