storage: smarten logic to wait for split trigger #32782

tbg · 2018-12-03T12:19:29Z

We previously "waited" for the split trigger quite literally, by
dropping an outgoing rejecting MsgAppResp for a few seconds. The
downside of doing so was that it would also artificially delay Raft
snapshots that were actually required. With the introduction of a
mechanism that delays split that would cause Raft snapshots this is a
lesser problem than before, but it's still ugly.

The revamped mechanism in this commit goes the extra mile to make
an informed decision on whether a split trigger is expected to apply:

when sending MsgApp to a follower that is being probed, send the
start key along with the message
when receiving said MsgApp on an uninitialized replica, check whether
the store has a replica that currently owns the start key. This replica
is the one in charge of applying the split trigger,
unless it's waiting for GC (so trigger a check for that).

There's also a time-based escape hatch to avoid delaying snapshots
indefinitely should there be a flaw in the above logic.

Release note: None

cockroach-teamcity · 2018-12-03T12:19:35Z

This change is

nvanbenschoten

Very happy to see this! With this extra information, we'll be able to make much more informed decisions on the follower side.

Reviewed 1 of 7 files at r1, 6 of 6 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5214 at r2 (raw file):

		_ = maybeDropMsgApp
		if status := r.raftStatusRLocked(); status != nil {
			if pr, ok := status.Progress[msg.To]; ok && pr.State == raft.ProgressStateProbe {

How often do we expect a follower to be in this state? I'd hate for this to blow up Raft message sizes in cases where we're not even going to use the key.

This does get me wondering about whether this is the right place to pass this information. I'm out of the loop with these change so I'm probably off, but what happens if we add similar logic to ignore/reject snapshots. Is the idea that once we reject a MsgApp (and enter ProgressStateSnapshot) it's hard to get back to the right state without an eventual snapshot?

pkg/storage/replica.go, line 5218 at r2 (raw file):

					log.Fatalf(ctx, "msg.Context nonempty for MsgApp: %v", msg.Context)
				}
				msg.Context = r.descRLocked().StartKey

Should we make the Context a structured message so we can extend it later without compatibility concerns?

pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):

func maybeDropMsgApp(
	ctx context.Context,
	r interface {

This is pretty fragile. I would pull it into a named interface.

pkg/storage/split_trigger_helper.go, line 86 at r2 (raw file):

	// replica):
	//
	// sender  (leader)    [a--lhs--b)[b---rhs----c)

Nice diagram!

pkg/storage/split_trigger_helper.go, line 124 at r2 (raw file):

	// message, resulting in a snapshot, as intended.

	verbose := verboseRaftLoggingEnabled()

nit: this all seems a little verbose (unintended pun). You're only using eventf twice below, which doesn't really justify the cost of creating it.

pkg/storage/split_trigger_helper.go, line 145 at r2 (raw file):

		// the above analysis) in which a split trigger just isn't coming. If
		// there are, the idea is that we notice this log message and improve
		// the heuristics.

👍

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5214 at r1 (raw file):

		// below for more context:
		_ = maybeDropMsgApp
		if status := r.raftStatusRLocked(); status != nil {

@nvanbenschoten I'm unhappy about the allocation here and I know you've disliked this kind of thing as well. On master I'm planning to solve this problem with a change in etcd/raft, though that likely won't be backportable (and I'd like to at least be able to consider this change for a backport, for we've backported its predecessor). How bad is this allocation in the meantime (and on 2.1)?

edit: noticed I hadn't actually sent this comment earlier.

pkg/storage/replica.go, line 5214 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How often do we expect a follower to be in this state? I'd hate for this to blow up Raft message sizes in cases where we're not even going to use the key.

This does get me wondering about whether this is the right place to pass this information. I'm out of the loop with these change so I'm probably off, but what happens if we add similar logic to ignore/reject snapshots. Is the idea that once we reject a MsgApp (and enter ProgressStateSnapshot) it's hard to get back to the right state without an eventual snapshot?

Yeah, we don't want to enter the snapshot queue erroneously. It may be possible to get Raft out of that state cleanly (at least after appropriate upstream changes), but I want raftsnapshot.pending to be a flat line at some point. We'll send at most one MsgApp per heartbeat interval to a follower that is being probed, so perf shouldn't be a concern here. (in steady state, followers are never expected to be in that state, this only occurs around cluster restarts and replication changes and such).

I morally also want to send something along in MsgAppResp instead, but it's tricky. The uninitialized follower is really dumb and doesn't know anything about itself and the world, so in particular it has no idea where in the key space it should be. The leader (who is the right-hand-side) on the other hand doesn't know about the left-hand-side's bound (and in particular I've often seen its corresponding left-hand-side replica not be leader any more in a rebalancing unit test I have laying around in a WIP that attempted to go this route).

Let me know if you see something I missed. As far as I can tell, we'll need information to go from the leader to the follower and back (the "back" being the rejection in this PR, we could also send an explicit context if we find a reason for it).

pkg/storage/replica.go, line 5218 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we make the Context a structured message so we can extend it later without compatibility concerns?

That sounds like a good idea.

tbg

Updated, PTAL.

Dismissed @nvanbenschoten from 2 discussions.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5218 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That sounds like a good idea.

Done, sort of. I actually took a step back and reconsidered where I was putting this information. The reason I chose the Context field is because it was there and so there wasn't any migration headache. However, it's a bit of a wart especially since it gets passed into Raft and we'd have to put protos inside protos. So what I've done instead is put it on RaftMessageRequest (as StartKey). Old versions receiving the hint simply won't use it, and since RaftMessageRequest doesn't inform any downstream-of-Raft-logic, there shouldn't be a problem.

pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is pretty fragile. I would pull it into a named interface.

I don't understand why it's fragile, but I understand that you didn't like the looks of it. Done.

pkg/storage/split_trigger_helper.go, line 124 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this all seems a little verbose (unintended pun). You're only using eventf twice below, which doesn't really justify the cost of creating it.

Yeah, there were more at some point. Done.

nvanbenschoten

Gave it another pass. One question I'm left with is wondering how you expect this to interact with repeated chains of splits and merges. I think it should all be fine, but it's worth exploring that and writing down a comment somewhere.

Reviewed 2 of 6 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5214 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, we don't want to enter the snapshot queue erroneously. It may be possible to get Raft out of that state cleanly (at least after appropriate upstream changes), but I want raftsnapshot.pending to be a flat line at some point. We'll send at most one MsgApp per heartbeat interval to a follower that is being probed, so perf shouldn't be a concern here. (in steady state, followers are never expected to be in that state, this only occurs around cluster restarts and replication changes and such).

I morally also want to send something along in MsgAppResp instead, but it's tricky. The uninitialized follower is really dumb and doesn't know anything about itself and the world, so in particular it has no idea where in the key space it should be. The leader (who is the right-hand-side) on the other hand doesn't know about the left-hand-side's bound (and in particular I've often seen its corresponding left-hand-side replica not be leader any more in a rebalancing unit test I have laying around in a WIP that attempted to go this route).

Let me know if you see something I missed. As far as I can tell, we'll need information to go from the leader to the follower and back (the "back" being the rejection in this PR, we could also send an explicit context if we find a reason for it).

Thanks for the explanation. I'd leave a comment pointing at https://github.com/etcd-io/etcd/blob/7f450bf6967638673dd88fd4e730b01d1303d5ff/raft/progress.go#L41, which backs up what you said about a single message per heartbeat interval.

pkg/storage/replica.go, line 5218 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Done, sort of. I actually took a step back and reconsidered where I was putting this information. The reason I chose the Context field is because it was there and so there wasn't any migration headache. However, it's a bit of a wart especially since it gets passed into Raft and we'd have to put protos inside protos. So what I've done instead is put it on RaftMessageRequest (as StartKey). Old versions receiving the hint simply won't use it, and since RaftMessageRequest doesn't inform any downstream-of-Raft-logic, there shouldn't be a problem.

👍 that's much better.

pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't understand why it's fragile, but I understand that you didn't like the looks of it. Done.

Fragile in that we can't perform static implementation assertions and can't "talk" about the type outside of using it.

pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):

	// Run the cheapest check first. If the leader doesn't think this replica is
	// probing, it won't set msg.Context (the common case).
	if msg.Type != raftpb.MsgApp || startKey == nil {

nit: len(startKey) == 0 feels a little safer since we're pulling this out of a proto.

tbg

Will add the comment tomorrow. I think I managed to address the rest.

Dismissed @nvanbenschoten from a discussion.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5214 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

@nvanbenschoten I'm unhappy about the allocation here and I know you've disliked this kind of thing as well. On master I'm planning to solve this problem with a change in etcd/raft, though that likely won't be backportable (and I'd like to at least be able to consider this change for a backport, for we've backported its predecessor). How bad is this allocation in the meantime (and on 2.1)?

edit: noticed I hadn't actually sent this comment earlier.

Ping @nvanbenschoten

pkg/storage/replica.go, line 5214 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks for the explanation. I'd leave a comment pointing at https://github.com/etcd-io/etcd/blob/7f450bf6967638673dd88fd4e730b01d1303d5ff/raft/progress.go#L41, which backs up what you said about a single message per heartbeat interval.

Done.

pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: len(startKey) == 0 feels a little safer since we're pulling this out of a proto.

But if this is happening to the first range, the start key will intentionally be empty. Does TestProtoZeroNilSlice (added in this revision) address your concern?

tbg · 2018-12-05T09:50:00Z

Ok, all set. PTAL.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5214 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ping @nvanbenschoten

I was thinking that we would only hit the allocation when we need to add the start key, but you're right that we'll hit it for every Raft message. That seems like a pretty big issue. RawNode.Status allocates at least three times and we regularly see it come up on allocation profiles. Calling it this frequently is likely to have a measurable performance impact. We would either need to prove that this is not the case or make that upstream change (RawNode.UnsafeStatus()?) before merging this one. I'd strongly lean towards the latter.

I'm not sure about the backport problem. Seems the like the work we're doing here is much more tricky than whatever accessor you add to etcd/raft, so I don't expect to see this backported. If it is though, I'd push to backport that change as well.

pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But if this is happening to the first range, the start key will intentionally be empty. Does TestProtoZeroNilSlice (added in this revision) address your concern?

Yep, thanks!

tbg

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/replica.go, line 5214 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was thinking that we would only hit the allocation when we need to add the start key, but you're right that we'll hit it for every Raft message. That seems like a pretty big issue. RawNode.Status allocates at least three times and we regularly see it come up on allocation profiles. Calling it this frequently is likely to have a measurable performance impact. We would either need to prove that this is not the case or make that upstream change (RawNode.UnsafeStatus()?) before merging this one. I'd strongly lean towards the latter.

I'm not sure about the backport problem. Seems the like the work we're doing here is much more tricky than whatever accessor you add to etcd/raft, so I don't expect to see this backported. If it is though, I'd push to backport that change as well.

Ok, I sent a PR upstream: etcd-io/etcd#10309

Once this (or whatever its final form is) is in, I'll bump the dep and update this PR.

nvanbenschoten · 2018-12-12T01:23:41Z

Is this ready for another look?

tbg · 2018-12-13T07:57:15Z

@nvanbenschoten yep

addressed

nvanbenschoten

thanks for following through with this. Two minor nits but otherwise ready to merge.

Reviewed 6 of 6 files at r4, 1 of 1 files at r5, 1 of 1 files at r6.
Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/storage/raft.proto, line 73 at r4 (raw file):

  repeated RaftHeartbeat heartbeat_resps = 7 [(gogoproto.nullable) = false];

  // Optionally, the start key of the sending replica. This is only populated

nit (take it or leave it) now that I'm reading over this again: name this range_start_key and move it right below range_id.

pkg/storage/replica.go, line 315 at r6 (raw file):

	// aren't held up by this latch, which could result in reads on the RHS
	// executed through the LHS after this is valid. For more detail, see:
	// https://github.com/cockroachdb/cockroach/issues/32583

nit: period at end.

We previously "waited" for the split trigger quite literally, by dropping an outgoing rejecting MsgAppResp for a few seconds. The downside of doing so was that it would also artificially delay Raft snapshots that were actually required. With the introduction of a mechanism that delays split that would cause Raft snapshots this is a lesser problem than before, but it's still ugly. The revamped mechanism in this commit goes the extra mile to make an informed decision on whether a split trigger is expected to apply: 1. when sending MsgApp to a follower that is being probed, send the start key along with the message 2. when receiving said MsgApp on an uninitialized replica, check whether the store has a replica that currently owns the start key. This replica is the one in charge of applying the split trigger, 3. unless it's waiting for GC (so trigger a check for that). There's also a time-based escape hatch to avoid delaying snapshots indefinitely should there be a flaw in the above logic. Release note: None

It exposes a bug (likely in range merges), cockroachdb#32784. Release note: None

Release note: None

tbg

TFTR! Going to wait for TC to green this before I merge.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/storage/raft.proto, line 73 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit (take it or leave it) now that I'm reading over this again: name this range_start_key and move it right below range_id.

No bother at all, done.

tbg · 2018-12-14T16:32:18Z

bors r=nvanbenschoten

32782: storage: smarten logic to wait for split trigger r=nvanbenschoten a=tbg We previously "waited" for the split trigger quite literally, by dropping an outgoing rejecting MsgAppResp for a few seconds. The downside of doing so was that it would also artificially delay Raft snapshots that were actually required. With the introduction of a mechanism that delays split that would cause Raft snapshots this is a lesser problem than before, but it's still ugly. The revamped mechanism in this commit goes the extra mile to make an informed decision on whether a split trigger is expected to apply: 1. when sending MsgApp to a follower that is being probed, send the start key along with the message 2. when receiving said MsgApp on an uninitialized replica, check whether the store has a replica that currently owns the start key. This replica is the one in charge of applying the split trigger, 3. unless it's waiting for GC (so trigger a check for that). There's also a time-based escape hatch to avoid delaying snapshots indefinitely should there be a flaw in the above logic. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

craig · 2018-12-14T16:46:49Z

Build succeeded

GitHub CI (Cockroach)

The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (cockroachdb#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that cockroachdb#32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots.

33573: roachtest: add MinVersion to scrub tests r=lucy-zhang a=lucy-zhang The `scrub` tests have been passing on master, but failing on 2.1 because there was a bug (fixed in #32908). Release note: None 33582: storage: bump RaftDelaySplitToSuppressSnapshotTicks r=knz a=knz The admin split path must accommodate a scenario where a range with not-yet-replicated followers is being manually split multiple times (eg. IMPORT during TPCC test fixtures). This scenario results in a bunch of replicas that all need to be populated with snapshots. To avoid backing up the raft snapshot queue, a heuristic was put in place (#32594) to delay the admin split if there is another snapshot being applied already. As shown by investigation in a failing test, there is a mismatch between the configured max delay for this heuristic (20s) and the actual duration of the snapshot application - the latter is limited by the max bandwidth for snapshots, 2 MB/s resulting in 32s applications in the worst case. We (Tobias and I) suspect that the heuristic thus fails to wait long enough to have the protective effect it was designed to provide. The current patch increases this delay to exceed this snapshot application duration estimate to about 50s. Note that this scenario is not likely to occur now that #32782 has been merged (this reduces the need for raft snapshots during splits); however in release-2.1 where that patch was not applied, the scenario is more likely. Release note (bug fix): resolve a cluster degradation scenario that could occur during IMPORT/RESTORE operations, manifested through a high number of pending Raft snapshots. Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.

64060: kvserver: fix delaying of splits with uninitialized followers r=erikgrinaker a=tbg Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in #32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in #32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in #32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Closes #61396. cc @cockroachdb/kv Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations. Co-authored-by: Tobias Grieger <[email protected]>

Bursts of splits (i.e. a sequence of splits for which each split splits the right-hand side of a prior split) can cause issues. This is because splitting a range in which a replica needs a snapshot results in two ranges in which a replica needs a snapshot where additionally there needs to be a sequencing between the two snapshots (one cannot apply a snapshot for the post-split replica until the pre-split replica has moved out of the way). The result of a long burst of splits such as occurring in RESTORE and IMPORT operations is then an overload of the snapshot queue with lots of wasted work, unavailable followers with operations hanging on them, and general mayhem in the logs. Since bulk operations also write a lot of data to the raft logs, log truncations then create an additional snapshot burden; in short, everything will be unhappy for a few hours and the cluster may effectively be unavailable. This isn't news to us and in fact was a big problem "back in 2018". When we first started to understand the issue, we introduced a mechanism that would delay splits (cockroachdb#32594) with the desired effect of ensuring that, all followers had caught up to ~all of the previous splits. This helped, but didn't explain why we were seeing snapshots in the first place. Investigating more, we realized that snapshots were sometimes spuriously requested by an uninitialized replica on the right-hand side which was contacted by another member of the right-hand side that had already been initialized by the split executing on the left-hand side; this snapshot was almost always unnecessary since the local left-hand side would usually initialize the right-hand side moments later. To address this, in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth of requests to an uninitialized range, and the mechanism was improved in cockroachdb#32782 and will now only do this if a local neighboring replica is expected to perform the split soon. With all this in place, you would've expected us to have all bases covered but it turns out that we are still running into issues prior to this PR. Concretely, whenever the aforementioned mechanism drops a message from the leader (a MsgApp), the leader will only contact the replica every second until it responds. It responds when it has been initialized via its left neighbor's splits and the leader reaches out again, i.e. an average of ~500ms after being initialized. However, by that time, it is itself already at the bottom of a long chain of splits, and the 500ms delay is delaying how long it takes for the rest of the chain to get initialized. Since the delay compounds on each link of the chain, the depth of the chain effectively determines the total delay experienced at the end. This would eventually exceed the patience of the mechanism that would suppress the snapshots, and snapshots would be requested. We would descend into madness similar to that experienced in the absence of the mechanism in the first place. The mechanism in cockroachdb#32594 could have helped here, but unfortunately it did not, as it routinely missed the fact that followers were not initialized yet. This is because during a split burst, the replica orchestrating the split was typically only created an instant before, and its raft group hadn't properly transitioned to leader status yet. This meant that in effect it wasn't delaying the splits at all. This commit adjusts the logic to delay splits to avoid this problem. While clamoring for leadership, the delay is upheld. Once collapsed into a definite state, the existing logic pretty much did the right thing, as it waited for the right-hand side to be in initialized. Release note (bug fix): Fixed a scenario in which a rapid sequence of splits could trigger a storm of Raft snapshots. This would be accompanied by log messages of the form "would have dropped incoming MsgApp, but allowing due to ..." and tended to occur as part of RESTORE/IMPORT operations.

tbg requested a review from a team December 3, 2018 12:19

tbg force-pushed the fix/split-trigger-wait branch from 54fac73 to c3828b0 Compare December 3, 2018 12:27

tbg mentioned this pull request Dec 3, 2018

storage: range does not match splits #32784

Closed

tbg requested a review from nvanbenschoten December 3, 2018 14:21

tbg force-pushed the fix/split-trigger-wait branch from c3828b0 to a1d3052 Compare December 3, 2018 14:22

nvanbenschoten requested changes Dec 3, 2018

View reviewed changes

tbg commented Dec 3, 2018

View reviewed changes

tbg force-pushed the fix/split-trigger-wait branch 2 times, most recently from df9ac85 to 5bcdb4b Compare December 4, 2018 14:48

tbg commented Dec 4, 2018

View reviewed changes

tbg mentioned this pull request Dec 4, 2018

storage: TestSplitTriggerRaftSnapshotRace failed under stress #32547

Closed

nvanbenschoten previously requested changes Dec 4, 2018

View reviewed changes

tbg commented Dec 4, 2018

View reviewed changes

tbg force-pushed the fix/split-trigger-wait branch 2 times, most recently from 10c7f79 to 210a61a Compare December 5, 2018 09:49

tbg mentioned this pull request Dec 5, 2018

storage: no way to replicaGC an uninitialized range #32843

Closed

nvanbenschoten reviewed Dec 5, 2018

View reviewed changes

tbg commented Dec 6, 2018

View reviewed changes

tbg force-pushed the fix/split-trigger-wait branch from 210a61a to 71e038b Compare December 7, 2018 21:10

tbg mentioned this pull request Dec 10, 2018

backport-2.1: storage: improve logging and log Raft truncation #32972

Merged

nvanbenschoten approved these changes Dec 13, 2018

View reviewed changes

tbg added 3 commits December 14, 2018 16:58

storage: disable merge queue during TestSplitTriggerRaftSnapshotRace

df26cf6

It exposes a bug (likely in range merges), cockroachdb#32784. Release note: None

storage: add comment on readOnlyCmdMu

2e3af6b

Release note: None

tbg force-pushed the fix/split-trigger-wait branch from 7329f96 to 2e3af6b Compare December 14, 2018 16:01

tbg commented Dec 14, 2018

View reviewed changes

craig bot merged commit 2e3af6b into cockroachdb:master Dec 14, 2018

This was referenced Dec 17, 2018

roachtest: restore2TB/nodes=10 failed #33107

Closed

storage: TestSplitTriggerRaftSnapshotRace failed under stress #33156

Closed

tbg mentioned this pull request Jan 9, 2019

roachtest: import/tpch/nodes=32 failed #33456

Closed

knz mentioned this pull request Jan 9, 2019

storage: bump RaftDelaySplitToSuppressSnapshotTicks #33582

Merged

tbg deleted the fix/split-trigger-wait branch March 13, 2019 11:51

tbg mentioned this pull request Apr 22, 2021

kvserver: fix delaying of splits with uninitialized followers #64060

Merged

tbg mentioned this pull request Apr 26, 2021

release-21.1: kvserver: fix delaying of splits with uninitialized followers #64202

Merged

tbg mentioned this pull request May 20, 2021

release-20.2: kvserver: fix delaying of splits with uninitialized followers #65500

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: smarten logic to wait for split trigger #32782

storage: smarten logic to wait for split trigger #32782

tbg commented Dec 3, 2018

cockroach-teamcity commented Dec 3, 2018

nvanbenschoten left a comment

tbg left a comment

tbg left a comment

nvanbenschoten left a comment

tbg left a comment

tbg commented Dec 5, 2018

nvanbenschoten left a comment

tbg left a comment

nvanbenschoten commented Dec 12, 2018

tbg commented Dec 13, 2018

nvanbenschoten left a comment

tbg left a comment

tbg commented Dec 14, 2018

craig bot commented Dec 14, 2018

storage: smarten logic to wait for split trigger #32782

storage: smarten logic to wait for split trigger #32782

Conversation

tbg commented Dec 3, 2018

cockroach-teamcity commented Dec 3, 2018

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Dec 5, 2018

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Dec 12, 2018

tbg commented Dec 13, 2018

nvanbenschoten left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Dec 14, 2018

craig bot commented Dec 14, 2018

Build succeeded