kvserver: use `ClearRawRange` to truncate very large Raft logs #75793

erikgrinaker · 2022-02-01T13:31:39Z

Raft log truncation was done using point deletes in a single Pebble
batch. If the number of entries to truncate was very large, this could
result in overflowing the Pebble batch, causing the node to panic. This
has been seen to happen e.g. when the snapshot rate was set very low,
effectively stalling snapshot transfers which in turn held up log
truncations for extended periods of time.

This patch uses a Pebble range tombstone if the number of entries to
truncate is very large (>100k). In most common cases, point deletes are
still used, to avoid writing too many range tombstones to Pebble.

Release note (bug fix): Fixed a bug which could cause nodes to crash
when truncating abnormally large Raft logs.

cockroach-teamcity · 2022-02-01T13:31:46Z

This change is

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

	// trigger it, unless the average log entry is <= 160 bytes. The key size is
	// ~16 bytes, so Pebble point deletion batches will be bounded at ~1.6MB.
	raftLogTruncationClearRangeThreshold = 100000

ClearRangeWithHeuristic and MVCCClearTimeRange both use a threshold of 64 keys. The comments there indicate those thresholds were pulled out of the air. Should this threshold be different? We are doing Raft log truncation much more frequently than clearing ranges due to rebalancing. 100k feels much too large to me. I think there is a useful bit of benchmarking and analysis to to figure out a better value here, though this value is fine for the purposes of this PR. Cc @sumeerbhola for thoughts and getting the Storage team to track this work.

pkg/roachpb/data.go, line 178 at r1 (raw file):

// Copy returns a copy of the key.
func (k Key) Copy() Key {

Nit: we use the verb Clone to refer to this operation inside of Pebble (on Pebble InternalKeys).

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @sumeerbhola, and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

ClearRangeWithHeuristic and MVCCClearTimeRange both use a threshold of 64 keys. The comments there indicate those thresholds were pulled out of the air. Should this threshold be different? We are doing Raft log truncation much more frequently than clearing ranges due to rebalancing. 100k feels much too large to me. I think there is a useful bit of benchmarking and analysis to to figure out a better value here, though this value is fine for the purposes of this PR. Cc @sumeerbhola for thoughts and getting the Storage team to track this work.

Yeah, I don't have a good sense of the number of range tombstones this could result in under various conditions and whether that would be problematic, so I was very conservative to keep this backportable. To take an unrealistic worst-case scenario: in a 3-node cluster with 100k ranges that are all seeing writes, if one node fails, a lower value could drop 100k range tombstones when all ranges hit RaftLogTruncationThreshold (and possibly continue to do so at regular intervals).

I think a value of e.g. 1000 would be more reasonable, but I'd want additional testing to validate that.

pkg/roachpb/data.go, line 178 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: we use the verb Clone to refer to this operation inside of Pebble (on Pebble InternalKeys).

Ah yes, I see that's used elsewhere in this package too. Updated, thanks.

sumeerbhola

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, I don't have a good sense of the number of range tombstones this could result in under various conditions and whether that would be problematic, so I was very conservative to keep this backportable. To take an unrealistic worst-case scenario: in a 3-node cluster with 100k ranges that are all seeing writes, if one node fails, a lower value could drop 100k range tombstones when all ranges hit RaftLogTruncationThreshold (and possibly continue to do so at regular intervals).

I think a value of e.g. 1000 would be more reasonable, but I'd want additional testing to validate that.

@cockroachdb/storage
In that 3-node example the alternative could be 1.6MB * 100K = 160GB of write batches.
We can add this discussion to cockroachdb/pebble#1295

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, sumeerbhola wrote…

@cockroachdb/storage
In that 3-node example the alternative could be 1.6MB * 100K = 160GB of write batches.
We can add this discussion to cockroachdb/pebble#1295

It makes sense to me to set this threshold high for backports.

Lots of range tombstones aren't a problem by themselves. Processing of range tombstones is a bit more expensive than point tombstones, but if the choice is between 100k range tombstones or hundreds of millions of point tombstones I suspect the range tombstones will perform better. It should definitely be possible to simulate this scenario.

I seem to recall that someone (@nvanbenschoten?) experimented with using range tombstones for truncating the Raft log at some point and found a performance win, but we were scared to turn them on. I can't seem to find where that was done, though.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It makes sense to me to set this threshold high for backports.

Lots of range tombstones aren't a problem by themselves. Processing of range tombstones is a bit more expensive than point tombstones, but if the choice is between 100k range tombstones or hundreds of millions of point tombstones I suspect the range tombstones will perform better. It should definitely be possible to simulate this scenario.

I seem to recall that someone (@nvanbenschoten?) experimented with using range tombstones for truncating the Raft log at some point and found a performance win, but we were scared to turn them on. I can't seem to find where that was done, though.

Here is the previous discussion of using ClearRange for truncating the Raft log: #28126 (comment).

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Here is the previous discussion of using ClearRange for truncating the Raft log: #28126 (comment).

Thanks, good points. I'll write up an issue (after a meeting block) to track this. There are also some related changes we should explore -- for example, the current RaftLogTruncationThreshold of 16 MB seems tuned for the old 64 MB range size, and way too aggressive for 512 MB ranges (it's certainly better to send 30 MB of log entries than a 500 MB snapshot). There are likely other settings that should be reviewed too, and it'd make sense to tune them together.

erikgrinaker · 2022-02-01T17:49:52Z

Tobi beat me to it: #75802. Added a note to also tune raftLogTruncationClearRangeThreshold.

nvanbenschoten

The code here looks good. Thanks for getting it out so fast. There are more discussions to be had about tuning these thresholds, but it looks like we have separate github issues for those.

Reviewed 1 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):

	// trigger it, unless the average log entry is <= 160 bytes. The key size is
	// ~16 bytes, so Pebble point deletion batches will be bounded at ~1.6MB.
	raftLogTruncationClearRangeThreshold = 100000

Should we set this to a metamorphic value so that it changes in tests? And should we then wrap that with a cluster setting, at least for the backport?

pkg/kv/kvserver/replica_raft.go, line 1949 at r2 (raw file):

		end := prefixBuf.RaftLogKey(suggestedTruncatedState.Index + 1).Clone() // end is exclusive
		if err := readWriter.ClearRawRange(start, end); err != nil {
			return false, errors.Wrapf(err, "unable to clear truncated Raft entries for %+v at index %d-%d",

minor nit: s/at index/between indexes/

pkg/roachpb/data.go, line 178 at r2 (raw file):

// Clone returns a copy of the key.
func (k Key) Clone() Key {

I suspect we're going to quickly wonder how we ever lived without this.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):
How am I only now learning about ConstantWithMetamorphicTestRange? This is great!

And should we then wrap that with a cluster setting, at least for the backport?

I'm not sure. I'm sort of hesitant to add a cluster setting for every little thing, but maybe there's no good reason to be stingy with them? They're certainly great to have when they're needed.

erikgrinaker

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)

pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

How am I only now learning about ConstantWithMetamorphicTestRange? This is great!

And should we then wrap that with a cluster setting, at least for the backport?

I'm not sure. I'm sort of hesitant to add a cluster setting for every little thing, but maybe there's no good reason to be stingy with them? They're certainly great to have when they're needed.

We discussed this briefly in the repl team, and will drop the cluster setting. While they're a nice panic button and convenient for tuning/benchmarking, exposing user-facing knobs also have costs in terms of maintenance, test surface, and user error.

I could be convinced otherwise, but we can address that in a follow-up PR if necessary.

erikgrinaker · 2022-02-02T11:16:27Z

TFTRs!

bors r=nvanbenschoten

erikgrinaker · 2022-02-02T11:17:58Z

Forgot to handle nil in Key.Clone().

bors r-

craig · 2022-02-02T11:18:00Z

Canceled.

Raft log truncation was done using point deletes in a single Pebble batch. If the number of entries to truncate was very large, this could result in overflowing the Pebble batch, causing the node to panic. This has been seen to happen e.g. when the snapshot rate was set very low, effectively stalling snapshot transfers which in turn held up log truncations for extended periods of time. This patch uses a Pebble range tombstone if the number of entries to truncate is very large (>100k). In most common cases, point deletes are still used, to avoid writing too many range tombstones to Pebble. Release note (bug fix): Fixed a bug which could cause nodes to crash when truncating abnormally large Raft logs.

erikgrinaker · 2022-02-02T11:27:10Z

bors r=nvanbenschoten

craig · 2022-02-02T12:30:08Z

Build succeeded:

GitHub CI (Cockroach)

blathers-crl · 2022-02-02T12:30:24Z

Encountered an error creating backports. Some common things that can go wrong:

The backport branch might have already existed.
There was a merge conflict.
The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.

error creating merge commit from 2167dfc to blathers/backport-release-21.1-75793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.

error creating merge commit from 2167dfc to blathers/backport-release-21.2-75793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.

_{🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.}

erikgrinaker added backport-21.1.x labels Feb 1, 2022

erikgrinaker requested review from tbg and a team February 1, 2022 13:31

erikgrinaker requested a review from a team as a code owner February 1, 2022 13:31

erikgrinaker self-assigned this Feb 1, 2022

erikgrinaker force-pushed the raft-log-truncate-clearrange branch from f2c460d to d3f4ce7 Compare February 1, 2022 13:37

petermattis reviewed Feb 1, 2022

View reviewed changes

erikgrinaker force-pushed the raft-log-truncate-clearrange branch from d3f4ce7 to 66fd9f4 Compare February 1, 2022 14:09

erikgrinaker commented Feb 1, 2022

View reviewed changes

sumeerbhola reviewed Feb 1, 2022

View reviewed changes

petermattis reviewed Feb 1, 2022

View reviewed changes

erikgrinaker commented Feb 1, 2022

View reviewed changes

erikgrinaker mentioned this pull request Feb 1, 2022

kvserver: re-evaluate raft log truncation heuristics #75802

Open

nvanbenschoten approved these changes Feb 1, 2022

View reviewed changes

erikgrinaker commented Feb 1, 2022

View reviewed changes

erikgrinaker force-pushed the raft-log-truncate-clearrange branch from 66fd9f4 to e559539 Compare February 1, 2022 19:21

nicktrav mentioned this pull request Feb 1, 2022

batchskl: panic due to index out of range cockroachdb/pebble#1258

Closed

erikgrinaker mentioned this pull request Feb 1, 2022

kvserver: truncate Raft log using ClearRange #75756

Closed

erikgrinaker force-pushed the raft-log-truncate-clearrange branch from e559539 to a93d974 Compare February 1, 2022 23:22

erikgrinaker commented Feb 2, 2022

View reviewed changes

erikgrinaker force-pushed the raft-log-truncate-clearrange branch from a93d974 to 2167dfc Compare February 2, 2022 11:25

craig bot merged commit 0e5eb08 into cockroachdb:master Feb 2, 2022

rupertharwood-crl added the O-postmortem Originated from a Postmortem action item. label Feb 3, 2022

This was referenced Feb 3, 2022

release-21.2: kvserver: use ClearRawRange to truncate very large Raft logs #75979

Merged

release-21.1: kvserver: use ClearRawRange to truncate very large Raft logs #75980

Merged

kvserver: truncate Raft log if snapshot sending is slow #76295

Closed

erikgrinaker deleted the raft-log-truncate-clearrange branch February 14, 2022 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: use `ClearRawRange` to truncate very large Raft logs #75793

kvserver: use `ClearRawRange` to truncate very large Raft logs #75793

erikgrinaker commented Feb 1, 2022 •

edited

Loading

cockroach-teamcity commented Feb 1, 2022

petermattis left a comment

erikgrinaker left a comment

sumeerbhola left a comment

petermattis left a comment

petermattis left a comment

erikgrinaker left a comment

erikgrinaker commented Feb 1, 2022

nvanbenschoten left a comment

erikgrinaker left a comment

erikgrinaker left a comment

erikgrinaker commented Feb 2, 2022

erikgrinaker commented Feb 2, 2022

craig bot commented Feb 2, 2022

erikgrinaker commented Feb 2, 2022

craig bot commented Feb 2, 2022

blathers-crl bot commented Feb 2, 2022

kvserver: use ClearRawRange to truncate very large Raft logs #75793

kvserver: use ClearRawRange to truncate very large Raft logs #75793

Conversation

erikgrinaker commented Feb 1, 2022 • edited Loading

cockroach-teamcity commented Feb 1, 2022

petermattis left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

sumeerbhola left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Feb 1, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented Feb 2, 2022

erikgrinaker commented Feb 2, 2022

craig bot commented Feb 2, 2022

erikgrinaker commented Feb 2, 2022

craig bot commented Feb 2, 2022

blathers-crl bot commented Feb 2, 2022

kvserver: use `ClearRawRange` to truncate very large Raft logs #75793

kvserver: use `ClearRawRange` to truncate very large Raft logs #75793

erikgrinaker commented Feb 1, 2022 •

edited

Loading