Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: use ClearRawRange to truncate very large Raft logs #75793

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Feb 1, 2022

Raft log truncation was done using point deletes in a single Pebble
batch. If the number of entries to truncate was very large, this could
result in overflowing the Pebble batch, causing the node to panic. This
has been seen to happen e.g. when the snapshot rate was set very low,
effectively stalling snapshot transfers which in turn held up log
truncations for extended periods of time.

This patch uses a Pebble range tombstone if the number of entries to
truncate is very large (>100k). In most common cases, point deletes are
still used, to avoid writing too many range tombstones to Pebble.

Release note (bug fix): Fixed a bug which could cause nodes to crash
when truncating abnormally large Raft logs.

@erikgrinaker erikgrinaker requested review from tbg and a team February 1, 2022 13:31
@erikgrinaker erikgrinaker requested a review from a team as a code owner February 1, 2022 13:31
@erikgrinaker erikgrinaker self-assigned this Feb 1, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker erikgrinaker force-pushed the raft-log-truncate-clearrange branch from f2c460d to d3f4ce7 Compare February 1, 2022 13:37
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

	// trigger it, unless the average log entry is <= 160 bytes. The key size is
	// ~16 bytes, so Pebble point deletion batches will be bounded at ~1.6MB.
	raftLogTruncationClearRangeThreshold = 100000

ClearRangeWithHeuristic and MVCCClearTimeRange both use a threshold of 64 keys. The comments there indicate those thresholds were pulled out of the air. Should this threshold be different? We are doing Raft log truncation much more frequently than clearing ranges due to rebalancing. 100k feels much too large to me. I think there is a useful bit of benchmarking and analysis to to figure out a better value here, though this value is fine for the purposes of this PR. Cc @sumeerbhola for thoughts and getting the Storage team to track this work.


pkg/roachpb/data.go, line 178 at r1 (raw file):

// Copy returns a copy of the key.
func (k Key) Copy() Key {

Nit: we use the verb Clone to refer to this operation inside of Pebble (on Pebble InternalKeys).

@erikgrinaker erikgrinaker force-pushed the raft-log-truncate-clearrange branch from d3f4ce7 to 66fd9f4 Compare February 1, 2022 14:09
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis, @sumeerbhola, and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

ClearRangeWithHeuristic and MVCCClearTimeRange both use a threshold of 64 keys. The comments there indicate those thresholds were pulled out of the air. Should this threshold be different? We are doing Raft log truncation much more frequently than clearing ranges due to rebalancing. 100k feels much too large to me. I think there is a useful bit of benchmarking and analysis to to figure out a better value here, though this value is fine for the purposes of this PR. Cc @sumeerbhola for thoughts and getting the Storage team to track this work.

Yeah, I don't have a good sense of the number of range tombstones this could result in under various conditions and whether that would be problematic, so I was very conservative to keep this backportable. To take an unrealistic worst-case scenario: in a 3-node cluster with 100k ranges that are all seeing writes, if one node fails, a lower value could drop 100k range tombstones when all ranges hit RaftLogTruncationThreshold (and possibly continue to do so at regular intervals).

I think a value of e.g. 1000 would be more reasonable, but I'd want additional testing to validate that.


pkg/roachpb/data.go, line 178 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: we use the verb Clone to refer to this operation inside of Pebble (on Pebble InternalKeys).

Ah yes, I see that's used elsewhere in this package too. Updated, thanks.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

Yeah, I don't have a good sense of the number of range tombstones this could result in under various conditions and whether that would be problematic, so I was very conservative to keep this backportable. To take an unrealistic worst-case scenario: in a 3-node cluster with 100k ranges that are all seeing writes, if one node fails, a lower value could drop 100k range tombstones when all ranges hit RaftLogTruncationThreshold (and possibly continue to do so at regular intervals).

I think a value of e.g. 1000 would be more reasonable, but I'd want additional testing to validate that.

@cockroachdb/storage
In that 3-node example the alternative could be 1.6MB * 100K = 160GB of write batches.
We can add this discussion to cockroachdb/pebble#1295

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten, @petermattis, and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, sumeerbhola wrote…

@cockroachdb/storage
In that 3-node example the alternative could be 1.6MB * 100K = 160GB of write batches.
We can add this discussion to cockroachdb/pebble#1295

It makes sense to me to set this threshold high for backports.

Lots of range tombstones aren't a problem by themselves. Processing of range tombstones is a bit more expensive than point tombstones, but if the choice is between 100k range tombstones or hundreds of millions of point tombstones I suspect the range tombstones will perform better. It should definitely be possible to simulate this scenario.

I seem to recall that someone (@nvanbenschoten?) experimented with using range tombstones for truncating the Raft log at some point and found a performance win, but we were scared to turn them on. I can't seem to find where that was done, though.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It makes sense to me to set this threshold high for backports.

Lots of range tombstones aren't a problem by themselves. Processing of range tombstones is a bit more expensive than point tombstones, but if the choice is between 100k range tombstones or hundreds of millions of point tombstones I suspect the range tombstones will perform better. It should definitely be possible to simulate this scenario.

I seem to recall that someone (@nvanbenschoten?) experimented with using range tombstones for truncating the Raft log at some point and found a performance win, but we were scared to turn them on. I can't seem to find where that was done, though.

Here is the previous discussion of using ClearRange for truncating the Raft log: #28126 (comment).

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Here is the previous discussion of using ClearRange for truncating the Raft log: #28126 (comment).

Thanks, good points. I'll write up an issue (after a meeting block) to track this. There are also some related changes we should explore -- for example, the current RaftLogTruncationThreshold of 16 MB seems tuned for the old 64 MB range size, and way too aggressive for 512 MB ranges (it's certainly better to send 30 MB of log entries than a 500 MB snapshot). There are likely other settings that should be reviewed too, and it'd make sense to tune them together.

@erikgrinaker
Copy link
Contributor Author

Tobi beat me to it: #75802. Added a note to also tune raftLogTruncationClearRangeThreshold.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The code here looks good. Thanks for getting it out so fast. There are more discussions to be had about tuning these thresholds, but it looks like we have separate github issues for those.

Reviewed 1 of 4 files at r1, 3 of 3 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker, @nvanbenschoten, and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):

	// trigger it, unless the average log entry is <= 160 bytes. The key size is
	// ~16 bytes, so Pebble point deletion batches will be bounded at ~1.6MB.
	raftLogTruncationClearRangeThreshold = 100000

Should we set this to a metamorphic value so that it changes in tests? And should we then wrap that with a cluster setting, at least for the backport?


pkg/kv/kvserver/replica_raft.go, line 1949 at r2 (raw file):

		end := prefixBuf.RaftLogKey(suggestedTruncatedState.Index + 1).Clone() // end is exclusive
		if err := readWriter.ClearRawRange(start, end); err != nil {
			return false, errors.Wrapf(err, "unable to clear truncated Raft entries for %+v at index %d-%d",

minor nit: s/at index/between indexes/


pkg/roachpb/data.go, line 178 at r2 (raw file):

// Clone returns a copy of the key.
func (k Key) Clone() Key {

I suspect we're going to quickly wonder how we ever lived without this.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):
How am I only now learning about ConstantWithMetamorphicTestRange? This is great!

And should we then wrap that with a cluster setting, at least for the backport?

I'm not sure. I'm sort of hesitant to add a cluster setting for every little thing, but maybe there's no good reason to be stingy with them? They're certainly great to have when they're needed.

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @tbg)


pkg/kv/kvserver/replica_raft.go, line 62 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

How am I only now learning about ConstantWithMetamorphicTestRange? This is great!

And should we then wrap that with a cluster setting, at least for the backport?

I'm not sure. I'm sort of hesitant to add a cluster setting for every little thing, but maybe there's no good reason to be stingy with them? They're certainly great to have when they're needed.

We discussed this briefly in the repl team, and will drop the cluster setting. While they're a nice panic button and convenient for tuning/benchmarking, exposing user-facing knobs also have costs in terms of maintenance, test surface, and user error.

I could be convinced otherwise, but we can address that in a follow-up PR if necessary.

@erikgrinaker
Copy link
Contributor Author

TFTRs!

bors r=nvanbenschoten

@erikgrinaker
Copy link
Contributor Author

Forgot to handle nil in Key.Clone().

bors r-

@craig
Copy link
Contributor

craig bot commented Feb 2, 2022

Canceled.

Raft log truncation was done using point deletes in a single Pebble
batch. If the number of entries to truncate was very large, this could
result in overflowing the Pebble batch, causing the node to panic. This
has been seen to happen e.g. when the snapshot rate was set very low,
effectively stalling snapshot transfers which in turn held up log
truncations for extended periods of time.

This patch uses a Pebble range tombstone if the number of entries to
truncate is very large (>100k). In most common cases, point deletes are
still used, to avoid writing too many range tombstones to Pebble.

Release note (bug fix): Fixed a bug which could cause nodes to crash
when truncating abnormally large Raft logs.
@erikgrinaker erikgrinaker force-pushed the raft-log-truncate-clearrange branch from a93d974 to 2167dfc Compare February 2, 2022 11:25
@erikgrinaker
Copy link
Contributor Author

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Feb 2, 2022

Build succeeded:

@craig craig bot merged commit 0e5eb08 into cockroachdb:master Feb 2, 2022
@blathers-crl
Copy link

blathers-crl bot commented Feb 2, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from 2167dfc to blathers/backport-release-21.1-75793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.1.x failed. See errors above.


error creating merge commit from 2167dfc to blathers/backport-release-21.2-75793: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 21.2.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
O-postmortem Originated from a Postmortem action item.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants