kvserver: fix delaying of splits with uninitialized followers #64060

tbg · 2021-04-22T12:25:12Z

Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later. To address this,
in #32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e. an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized. Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in #32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Closes #61396.

cc @cockroachdb/kv

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.

cockroach-teamcity · 2021-04-22T12:25:20Z

This change is

erikgrinaker

Reviewed 5 of 5 files at r1, 1 of 1 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)

pkg/kv/kvserver/client_split_burst_test.go, line 104 at r1 (raw file):

	})
	var returned bool
	defer func() {

nit: You can use t.Cleanup() to avoid the defer acrobatics here and in the test.

pkg/kv/kvserver/client_split_burst_test.go, line 171 at r1 (raw file):

		// We should have applied all but the last split on the slow node.
		// If we didn't, that indicates a failure to delay the splits
		// accordingly.

This comment doesn't match the assertion (which says that the slow follower hasn't seen any splits).

pkg/kv/kvserver/split_delay_helper.go, line 169 at r1 (raw file):

		// See TestSplitBurstWithSlowFollower for end-to-end verification of this
		// mechanism.
		if state := raftStatus.RaftState; state != raft.StateLeader {

nit: the stateassignment doesn't seem necessary here

pkg/kv/kvserver/split_delay_helper.go, line 96 at r2 (raw file):

}

func maybeDelaySplitToAvoidSnapshot(ctx context.Context, sdh splitDelayHelperI) (_s string) {

nit: Why the _s?

pkg/kv/kvserver/split_delay_helper.go, line 225 at r2 (raw file):

	}
	if len(lastProblems) != 0 {
		msg = fmt.Sprintf("; delayed by %.1fs to resolve: %s", slept.Seconds(), strings.Join(lastProblems, "; "))

This is going to generate the message when we didn't actually delay as well, e.g. if the replica is raft follower problem occurs on the first iteration. Not necessarily an issue, but worth considering only including the "delayed by" bit if slept > 0.

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)

pkg/kv/kvserver/client_split_burst_test.go, line 171 at r1 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

		// We should have applied all but the last split on the slow node.
		// If we didn't, that indicates a failure to delay the splits
		// accordingly.
This comment doesn't match the assertion (which says that the slow follower hasn't seen any splits).

Done.

pkg/kv/kvserver/split_delay_helper.go, line 96 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

nit: Why the _s?

Had some debugging here that I hadn't backed out, done.

pkg/kv/kvserver/split_delay_helper.go, line 225 at r2 (raw file):

Previously, erikgrinaker (Erik Grinaker) wrote…

This is going to generate the message when we didn't actually delay as well, e.g. if the replica is raft follower problem occurs on the first iteration. Not necessarily an issue, but worth considering only including the "delayed by" bit if slept > 0.

We'll print "delayed by 0.00s" which seems reasonable enough, given that we don't expect to hit that branch much. Sadly I expect us to grep for this message at some point, so it's helpful if it doesn't have too many bells and whistles. I'll leave as is.

erikgrinaker

When executing a split, it's surprisingly tricky to learn what
the resulting left-hand and right-hand side is. This is because

...because what?

Reviewed 3 of 3 files at r3, 1 of 1 files at r4, 1 of 1 files at r5.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)

tbg · 2021-04-23T19:30:25Z

...because what?

Oops, didn't mean to push this one. There's some weirdness I still haven't worked my way through. Removed.

tbg · 2021-04-23T20:05:22Z

Ha, that was fun. I had a data race when constructing the split key, which really messed things up in the /random flavor of the test and led to wildly corrupted range descriptors being written to meta2. The data race showed up here in CI and I even fixed it but without immediately putting 1 and 1 together. Still not landing that last commit here (goes to new PR, since it's standalone) but the /random test works now that the data race is fixed.

erikgrinaker

Ha, that was fun. I had a data race when constructing the split key, which really messed things up in the /random flavor of the test and led to wildly corrupted range descriptors being written to meta2.

Heh oops! 😅

Reviewed 3 of 3 files at r6.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker)

When executing a split, it's surprisingly tricky to learn what the resulting left-hand and right-hand side is. This is because when you retrieve it "after the fact", other operations may have changed the split already (for example, the split could have been merged, or additional splits added) and while you would get descriptors back, they wouldn't be meaningfully connected to the split any more in all cases. Really one would want to return the descriptors from the split txn itself, but AdminSplit is a no-op when the split already exists and so we would need special logic that performs a range lookup on the left neighbor. It could all be done, but does not seem worth it. There's still a nice simplification here that lets us remove the ad-hoc code, and we'll just accept that when there are concurrent splits the return values may not exactly line up with the split. This came out of cockroachdb#64060 (comment). Release note: None

Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.

I noticed that TestSplitBurstWithSlowFollower would average only <10 splits per second even when no lagging replica is introduced (i.e. the `time.Sleep` commented out). Investigating the raft chatter suggested that after campaigning, the leaseholder of the right-hand side would not be processed by its Store for a `Ready` handling cycle until after ~50ms (the coalesced heartbeat timeout) in the test had passed, and a similar delay was observed when it was sending out its votes. Adding a call to `enqueueRaftUpdateCheck` fixes both, we end up at ~100 splits per second. Release note: None

tbg · 2021-04-24T19:35:32Z

bors r=erikgrinaker

craig · 2021-04-24T20:19:22Z

Build succeeded:

GitHub CI (Cockroach)

When executing a split, it's surprisingly tricky to learn what the resulting left-hand and right-hand side is. This is because when you retrieve it "after the fact", other operations may have changed the split already (for example, the split could have been merged, or additional splits added) and while you would get descriptors back, they wouldn't be meaningfully connected to the split any more in all cases. Really one would want to return the descriptors from the split txn itself, but AdminSplit is a no-op when the split already exists and so we would need special logic that performs a range lookup on the left neighbor. It could all be done, but does not seem worth it. There's still a nice simplification here that lets us remove the ad-hoc code, and we'll just accept that when there are concurrent splits the return values may not exactly line up with the split. This came out of cockroachdb#64060 (comment). Release note: None

63471: backupccl: rework split and scatter mechanism r=adityamaru a=pbardea **backupccl: don't include empty import entries during restore** This commit filters out importSpans that have no associated data. File spans' start keys are set to be the .Next() of the EndKey of the previous file span. This was at least the case on the backup that is currently being run against restore2TB. This lead to a lot of importSpan entries that contained no files and would cover an empty key range. The precense of these import spans did not effect the performance of restores that much, but they did cause unnecessary splits and scatters which further contributed to the elevated NLHEs that were seen. This commit generally calms down the leaseholder errors, but do not eliminate them entirely. Future investigation is still needed. It also does not seem to have any performance impact on RESTORE performance (in terms of speed of the job). Release note: None **backupccl: rework split and scatter mechanism** This change reworks how RESTORE splits and scatter's in 2 ways: 1. This commits updates the split and scatter mechanism that restore uses to split at the key of the next chunk/entry rather than at the current chunk/entry. This resolves a long-standing TODO that updates the split and scattering of RESTORE to perform a split at the key of the next chunk/entry. Previously, we would split at a the start key of the span, and then scatter that span. Consider a chunk with entries that we want to split and scatter. If we split at the start of each entry and scatter the entry we just split off (the right hand side of the split), we'll be continuously scattering the suffix keyspace of the chunk. It's preferrable to split out a prefix first (by splitting at the key of the next chunk) and scattering the prefix. We additionally refactor the split and scatter processor interface and stop scattering the individual entries, but rather just split them. 2. Individual entries are no longer scattered. They used to be scattered with the randomizeLeases option set to false, but there was not any indication that this was beneficial to performance, so it was removed for simplicity. Release note: None 64155: server: simplify SplitRangeWithExpiration r=erikgrinaker a=tbg When executing a split, it's surprisingly tricky to learn what the resulting left-hand and right-hand side is. This is because when you retrieve it "after the fact", other operations may have changed the split already (for example, the split could have been merged, or additional splits added) and while you would get descriptors back, they wouldn't be meaningfully connected to the split any more in all cases. Really one would want to return the descriptors from the split txn itself, but AdminSplit is a no-op when the split already exists and so we would need special logic that performs a range lookup on the left neighbor. It could all be done, but does not seem worth it. There's still a nice simplification here that lets us remove the ad-hoc code, and we'll just accept that when there are concurrent splits the return values may not exactly line up with the split. This came out of #64060 (comment). cc @cockroachdb/kv Release note: None Co-authored-by: Paul Bardea <[email protected]> Co-authored-by: Tobias Grieger <[email protected]>

When executing a split, it's surprisingly tricky to learn what the resulting left-hand and right-hand side is. This is because when you retrieve it "after the fact", other operations may have changed the split already (for example, the split could have been merged, or additional splits added) and while you would get descriptors back, they wouldn't be meaningfully connected to the split any more in all cases. Really one would want to return the descriptors from the split txn itself, but AdminSplit is a no-op when the split already exists and so we would need special logic that performs a range lookup on the left neighbor. It could all be done, but does not seem worth it. There's still a nice simplification here that lets us remove the ad-hoc code, and we'll just accept that when there are concurrent splits the return values may not exactly line up with the split. This came out of cockroachdb#64060 (comment). Release note: None

tbg requested a review from erikgrinaker April 22, 2021 12:25

tbg force-pushed the repro-restore2tb-new branch from d6b73e0 to ed47b60 Compare April 22, 2021 12:39

This was referenced Apr 22, 2021

kvserver: unxpected replication change from 3 to 2 voters #64064

Closed

kvserver: repro branch for restore2tb snapshots #63784

Closed

erikgrinaker approved these changes Apr 22, 2021

View reviewed changes

tbg force-pushed the repro-restore2tb-new branch from ed47b60 to bec56cf Compare April 23, 2021 14:20

tbg requested a review from erikgrinaker April 23, 2021 14:37

tbg commented Apr 23, 2021

View reviewed changes

erikgrinaker approved these changes Apr 23, 2021

View reviewed changes

tbg force-pushed the repro-restore2tb-new branch from bec56cf to 9e8378b Compare April 23, 2021 19:29

tbg force-pushed the repro-restore2tb-new branch from 9e8378b to 7e238c5 Compare April 23, 2021 20:04

erikgrinaker approved these changes Apr 23, 2021

View reviewed changes

tbg force-pushed the repro-restore2tb-new branch from 7e238c5 to c0e9428 Compare April 23, 2021 20:09

tbg mentioned this pull request Apr 23, 2021

server: simplify SplitRangeWithExpiration #64155

Merged

tbg added the backport-21.1.x label Apr 23, 2021

tbg added 2 commits April 24, 2021 21:17

tbg force-pushed the repro-restore2tb-new branch from c0e9428 to 9282a53 Compare April 24, 2021 19:26

craig bot merged commit f2c2f2e into cockroachdb:master Apr 24, 2021

tbg deleted the repro-restore2tb-new branch April 26, 2021 07:36

tbg mentioned this pull request Apr 26, 2021

release-21.1: kvserver: fix delaying of splits with uninitialized followers #64202

Merged

tbg mentioned this pull request Apr 26, 2021

roachtest: tpccbench/nodes=9/cpu=4/multi-region failed [overload,closed ts regressing from X to Y] #61981

Closed

tbg mentioned this pull request May 20, 2021

release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: fix delaying of splits with uninitialized followers #64060

kvserver: fix delaying of splits with uninitialized followers #64060

tbg commented Apr 22, 2021 •

edited

Loading

cockroach-teamcity commented Apr 22, 2021

erikgrinaker left a comment

tbg left a comment

erikgrinaker left a comment

tbg commented Apr 23, 2021

tbg commented Apr 23, 2021

erikgrinaker left a comment

tbg commented Apr 24, 2021

craig bot commented Apr 24, 2021

kvserver: fix delaying of splits with uninitialized followers #64060

kvserver: fix delaying of splits with uninitialized followers #64060

Conversation

tbg commented Apr 22, 2021 • edited Loading

cockroach-teamcity commented Apr 22, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Apr 23, 2021

tbg commented Apr 23, 2021

erikgrinaker left a comment

Choose a reason for hiding this comment

tbg commented Apr 24, 2021

craig bot commented Apr 24, 2021

tbg commented Apr 22, 2021 •

edited

Loading