-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: use ClearRawRange
to truncate very large Raft logs
#75793
kvserver: use ClearRawRange
to truncate very large Raft logs
#75793
Conversation
f2c460d
to
d3f4ce7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):
// trigger it, unless the average log entry is <= 160 bytes. The key size is // ~16 bytes, so Pebble point deletion batches will be bounded at ~1.6MB. raftLogTruncationClearRangeThreshold = 100000
ClearRangeWithHeuristic
and MVCCClearTimeRange
both use a threshold of 64 keys. The comments there indicate those thresholds were pulled out of the air. Should this threshold be different? We are doing Raft log truncation much more frequently than clearing ranges due to rebalancing. 100k feels much too large to me. I think there is a useful bit of benchmarking and analysis to to figure out a better value here, though this value is fine for the purposes of this PR. Cc @sumeerbhola for thoughts and getting the Storage team to track this work.
pkg/roachpb/data.go, line 178 at r1 (raw file):
// Copy returns a copy of the key. func (k Key) Copy() Key {
Nit: we use the verb Clone
to refer to this operation inside of Pebble (on Pebble InternalKeys
).
d3f4ce7
to
66fd9f4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @sumeerbhola, and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
ClearRangeWithHeuristic
andMVCCClearTimeRange
both use a threshold of 64 keys. The comments there indicate those thresholds were pulled out of the air. Should this threshold be different? We are doing Raft log truncation much more frequently than clearing ranges due to rebalancing. 100k feels much too large to me. I think there is a useful bit of benchmarking and analysis to to figure out a better value here, though this value is fine for the purposes of this PR. Cc @sumeerbhola for thoughts and getting the Storage team to track this work.
Yeah, I don't have a good sense of the number of range tombstones this could result in under various conditions and whether that would be problematic, so I was very conservative to keep this backportable. To take an unrealistic worst-case scenario: in a 3-node cluster with 100k ranges that are all seeing writes, if one node fails, a lower value could drop 100k range tombstones when all ranges hit RaftLogTruncationThreshold (and possibly continue to do so at regular intervals).
I think a value of e.g. 1000 would be more reasonable, but I'd want additional testing to validate that.
pkg/roachpb/data.go, line 178 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: we use the verb
Clone
to refer to this operation inside of Pebble (on PebbleInternalKeys
).
Ah yes, I see that's used elsewhere in this package too. Updated, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Yeah, I don't have a good sense of the number of range tombstones this could result in under various conditions and whether that would be problematic, so I was very conservative to keep this backportable. To take an unrealistic worst-case scenario: in a 3-node cluster with 100k ranges that are all seeing writes, if one node fails, a lower value could drop 100k range tombstones when all ranges hit RaftLogTruncationThreshold (and possibly continue to do so at regular intervals).
I think a value of e.g. 1000 would be more reasonable, but I'd want additional testing to validate that.
@cockroachdb/storage
In that 3-node example the alternative could be 1.6MB * 100K = 160GB of write batches.
We can add this discussion to cockroachdb/pebble#1295
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):
Previously, sumeerbhola wrote…
@cockroachdb/storage
In that 3-node example the alternative could be 1.6MB * 100K = 160GB of write batches.
We can add this discussion to cockroachdb/pebble#1295
It makes sense to me to set this threshold high for backports.
Lots of range tombstones aren't a problem by themselves. Processing of range tombstones is a bit more expensive than point tombstones, but if the choice is between 100k range tombstones or hundreds of millions of point tombstones I suspect the range tombstones will perform better. It should definitely be possible to simulate this scenario.
I seem to recall that someone (@nvanbenschoten?) experimented with using range tombstones for truncating the Raft log at some point and found a performance win, but we were scared to turn them on. I can't seem to find where that was done, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
It makes sense to me to set this threshold high for backports.
Lots of range tombstones aren't a problem by themselves. Processing of range tombstones is a bit more expensive than point tombstones, but if the choice is between 100k range tombstones or hundreds of millions of point tombstones I suspect the range tombstones will perform better. It should definitely be possible to simulate this scenario.
I seem to recall that someone (@nvanbenschoten?) experimented with using range tombstones for truncating the Raft log at some point and found a performance win, but we were scared to turn them on. I can't seem to find where that was done, though.
Here is the previous discussion of using ClearRange
for truncating the Raft log: #28126 (comment).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Here is the previous discussion of using
ClearRange
for truncating the Raft log: #28126 (comment).
Thanks, good points. I'll write up an issue (after a meeting block) to track this. There are also some related changes we should explore -- for example, the current RaftLogTruncationThreshold
of 16 MB seems tuned for the old 64 MB range size, and way too aggressive for 512 MB ranges (it's certainly better to send 30 MB of log entries than a 500 MB snapshot). There are likely other settings that should be reviewed too, and it'd make sense to tune them together.
Tobi beat me to it: #75802. Added a note to also tune |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The code here looks good. Thanks for getting it out so fast. There are more discussions to be had about tuning these thresholds, but it looks like we have separate github issues for those.
Reviewed 1 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):
// trigger it, unless the average log entry is <= 160 bytes. The key size is // ~16 bytes, so Pebble point deletion batches will be bounded at ~1.6MB. raftLogTruncationClearRangeThreshold = 100000
Should we set this to a metamorphic value so that it changes in tests? And should we then wrap that with a cluster setting, at least for the backport?
pkg/kv/kvserver/replica_raft.go, line 1949 at r2 (raw file):
end := prefixBuf.RaftLogKey(suggestedTruncatedState.Index + 1).Clone() // end is exclusive if err := readWriter.ClearRawRange(start, end); err != nil { return false, errors.Wrapf(err, "unable to clear truncated Raft entries for %+v at index %d-%d",
minor nit: s/at index/between indexes/
pkg/roachpb/data.go, line 178 at r2 (raw file):
// Clone returns a copy of the key. func (k Key) Clone() Key {
I suspect we're going to quickly wonder how we ever lived without this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):
How am I only now learning about ConstantWithMetamorphicTestRange
? This is great!
And should we then wrap that with a cluster setting, at least for the backport?
I'm not sure. I'm sort of hesitant to add a cluster setting for every little thing, but maybe there's no good reason to be stingy with them? They're certainly great to have when they're needed.
66fd9f4
to
e559539
Compare
e559539
to
a93d974
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)
pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
How am I only now learning about
ConstantWithMetamorphicTestRange
? This is great!And should we then wrap that with a cluster setting, at least for the backport?
I'm not sure. I'm sort of hesitant to add a cluster setting for every little thing, but maybe there's no good reason to be stingy with them? They're certainly great to have when they're needed.
We discussed this briefly in the repl team, and will drop the cluster setting. While they're a nice panic button and convenient for tuning/benchmarking, exposing user-facing knobs also have costs in terms of maintenance, test surface, and user error.
I could be convinced otherwise, but we can address that in a follow-up PR if necessary.
TFTRs! bors r=nvanbenschoten |
Forgot to handle bors r- |
Canceled. |
Raft log truncation was done using point deletes in a single Pebble batch. If the number of entries to truncate was very large, this could result in overflowing the Pebble batch, causing the node to panic. This has been seen to happen e.g. when the snapshot rate was set very low, effectively stalling snapshot transfers which in turn held up log truncations for extended periods of time. This patch uses a Pebble range tombstone if the number of entries to truncate is very large (>100k). In most common cases, point deletes are still used, to avoid writing too many range tombstones to Pebble. Release note (bug fix): Fixed a bug which could cause nodes to crash when truncating abnormally large Raft logs.
a93d974
to
2167dfc
Compare
bors r=nvanbenschoten |
Build succeeded: |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 2167dfc to blathers/backport-release-21.1-75793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.1.x failed. See errors above. error creating merge commit from 2167dfc to blathers/backport-release-21.2-75793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 21.2.x failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Raft log truncation was done using point deletes in a single Pebble
batch. If the number of entries to truncate was very large, this could
result in overflowing the Pebble batch, causing the node to panic. This
has been seen to happen e.g. when the snapshot rate was set very low,
effectively stalling snapshot transfers which in turn held up log
truncations for extended periods of time.
This patch uses a Pebble range tombstone if the number of entries to
truncate is very large (>100k). In most common cases, point deletes are
still used, to avoid writing too many range tombstones to Pebble.
Release note (bug fix): Fixed a bug which could cause nodes to crash
when truncating abnormally large Raft logs.