-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [sst raft oom] #71802
Comments
This seems related but not the same. I downloaded this file: and uploaded it to https://share.polarsignals.com/3cfb86e/ (an online version of The big icicle in the middle is memory allocated when sending SSTables in sideloaded raft proposals to other nodes as part of catching them up on the raft log. There are various limits (per replica) at play here, all of which are bundled up here: cockroach/pkg/kv/kvserver/store.go Lines 223 to 240 in 0e0082d
In particular, each individual append should be limited to 32KiB (though this is a "target size", i.e. can be overshot): Lines 132 to 135 in 0a040d6
and we are sending at most Lines 143 to 147 in 0a040d6
Naively, you would think that this means that the most we can allocate on the leader for each follower is 12832kb, i.e. 4mb. However, let's say that all entries are 32MB SSTs - then in effect we can get up to 12832MB, i.e. 4GB which can surely give us the problems we see here. (And don't forget, it could happen for multiple followers too, giving us another factor of The way messages are sent is that they are handed to an "outgoing queue" where they are put on a (large) buffered channel: cockroach/pkg/kv/kvserver/raft_transport.go Lines 566 to 572 in 54e004a
so it is possible that these 4GB are in memory all at once. On the other end of this channel we indeed have the problem fixed by #71748: we may hold on to the SSTs for even longer. But even with that PR merged, an even if that PR avoids seeing this crash, there is a problem here - pulling as much data into memory in itself is an issue. Ideally we would have a What I don't understand is why we "suddenly" have test coverage for these issues through backup/restore. @dt do you have any ideas why we're seeing these kinds of issues now? Perhaps some change in how the SSTs used by IMPORT/RESTORE are sized, or in the concurrency with which they are distributed, or any blocking that has been removed? |
cc @adityamaru |
Not blocking rc3 any more, since #71748 merged. |
This comment has been minimized.
This comment has been minimized.
@dt we were just revisiting this and the question came up again of whether something changed on the Bulk I/O side in terms of usage of AddSSTable. Are there any changes you suspect of having changed the access pattern? |
I don't know of anything that changed in IMPORT or RESTORE SST sizes. 21.1 saw a large number of fixes to improve the work distribution during RESTORE, since we previously saw cases where we were blocking on splits or downloading a an rewriting SSTables for periods during which we were not sending SSTs to KV, so there was a lot of work done to improve splitting throughput, pipeline downloads with sending, etc, basically all focused on keeping the sending of ingest SSTs saturated better. But most of that happened in 21.1 or early in 21.2, so I don't know if anything changed more recently. |
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ c754d101ccd7541b0f597dac2f37809c1a859bf2:
Same failure on other branches
|
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ 2c4bb88e5318fe187e1bf6cb134b31bd63f63528:
Same failure on other branches
|
Nothing new:
|
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ e9fd200d3567aa542da6cd1f255e4d2971cbdd9e:
Same failure on other branches
|
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ 6133ffd5459ae01d79e3dfd98528e557bb868eca:
Same failure on other branches
|
We have marked this test failure issue as stale because it has been |
Using #80155 as the main tracking issue. |
roachtest.tpccbench/nodes=9/cpu=4/multi-region failed with artifacts on release-21.2 @ c5a6b266917ee3846dbd7ae1126c6a5d55cf439b:
Reproduce
See: roachtest README
This test on roachdash | Improve this report!
Jira issue: CRDB-10778
The text was updated successfully, but these errors were encountered: