Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: smarten logic to wait for split trigger #32782

Merged
merged 3 commits into from
Dec 14, 2018

Conversation

tbg
Copy link
Member

@tbg tbg commented Dec 3, 2018

We previously "waited" for the split trigger quite literally, by
dropping an outgoing rejecting MsgAppResp for a few seconds. The
downside of doing so was that it would also artificially delay Raft
snapshots that were actually required. With the introduction of a
mechanism that delays split that would cause Raft snapshots this is a
lesser problem than before, but it's still ugly.

The revamped mechanism in this commit goes the extra mile to make
an informed decision on whether a split trigger is expected to apply:

  1. when sending MsgApp to a follower that is being probed, send the
    start key along with the message
  2. when receiving said MsgApp on an uninitialized replica, check whether
    the store has a replica that currently owns the start key. This replica
    is the one in charge of applying the split trigger,
  3. unless it's waiting for GC (so trigger a check for that).

There's also a time-based escape hatch to avoid delaying snapshots
indefinitely should there be a flaw in the above logic.

Release note: None

@tbg tbg requested a review from a team December 3, 2018 12:19
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg force-pushed the fix/split-trigger-wait branch from 54fac73 to c3828b0 Compare December 3, 2018 12:27
@tbg tbg requested a review from nvanbenschoten December 3, 2018 14:21
@tbg tbg force-pushed the fix/split-trigger-wait branch from c3828b0 to a1d3052 Compare December 3, 2018 14:22
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very happy to see this! With this extra information, we'll be able to make much more informed decisions on the follower side.

Reviewed 1 of 7 files at r1, 6 of 6 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5214 at r2 (raw file):

		_ = maybeDropMsgApp
		if status := r.raftStatusRLocked(); status != nil {
			if pr, ok := status.Progress[msg.To]; ok && pr.State == raft.ProgressStateProbe {

How often do we expect a follower to be in this state? I'd hate for this to blow up Raft message sizes in cases where we're not even going to use the key.

This does get me wondering about whether this is the right place to pass this information. I'm out of the loop with these change so I'm probably off, but what happens if we add similar logic to ignore/reject snapshots. Is the idea that once we reject a MsgApp (and enter ProgressStateSnapshot) it's hard to get back to the right state without an eventual snapshot?


pkg/storage/replica.go, line 5218 at r2 (raw file):

					log.Fatalf(ctx, "msg.Context nonempty for MsgApp: %v", msg.Context)
				}
				msg.Context = r.descRLocked().StartKey

Should we make the Context a structured message so we can extend it later without compatibility concerns?


pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):

func maybeDropMsgApp(
	ctx context.Context,
	r interface {

This is pretty fragile. I would pull it into a named interface.


pkg/storage/split_trigger_helper.go, line 86 at r2 (raw file):

	// replica):
	//
	// sender  (leader)    [a--lhs--b)[b---rhs----c)

Nice diagram!


pkg/storage/split_trigger_helper.go, line 124 at r2 (raw file):

	// message, resulting in a snapshot, as intended.

	verbose := verboseRaftLoggingEnabled()

nit: this all seems a little verbose (unintended pun). You're only using eventf twice below, which doesn't really justify the cost of creating it.


pkg/storage/split_trigger_helper.go, line 145 at r2 (raw file):

		// the above analysis) in which a split trigger just isn't coming. If
		// there are, the idea is that we notice this log message and improve
		// the heuristics.

👍

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5214 at r1 (raw file):

		// below for more context:
		_ = maybeDropMsgApp
		if status := r.raftStatusRLocked(); status != nil {

@nvanbenschoten I'm unhappy about the allocation here and I know you've disliked this kind of thing as well. On master I'm planning to solve this problem with a change in etcd/raft, though that likely won't be backportable (and I'd like to at least be able to consider this change for a backport, for we've backported its predecessor). How bad is this allocation in the meantime (and on 2.1)?

edit: noticed I hadn't actually sent this comment earlier.


pkg/storage/replica.go, line 5214 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How often do we expect a follower to be in this state? I'd hate for this to blow up Raft message sizes in cases where we're not even going to use the key.

This does get me wondering about whether this is the right place to pass this information. I'm out of the loop with these change so I'm probably off, but what happens if we add similar logic to ignore/reject snapshots. Is the idea that once we reject a MsgApp (and enter ProgressStateSnapshot) it's hard to get back to the right state without an eventual snapshot?

Yeah, we don't want to enter the snapshot queue erroneously. It may be possible to get Raft out of that state cleanly (at least after appropriate upstream changes), but I want raftsnapshot.pending to be a flat line at some point. We'll send at most one MsgApp per heartbeat interval to a follower that is being probed, so perf shouldn't be a concern here. (in steady state, followers are never expected to be in that state, this only occurs around cluster restarts and replication changes and such).

I morally also want to send something along in MsgAppResp instead, but it's tricky. The uninitialized follower is really dumb and doesn't know anything about itself and the world, so in particular it has no idea where in the key space it should be. The leader (who is the right-hand-side) on the other hand doesn't know about the left-hand-side's bound (and in particular I've often seen its corresponding left-hand-side replica not be leader any more in a rebalancing unit test I have laying around in a WIP that attempted to go this route).

Let me know if you see something I missed. As far as I can tell, we'll need information to go from the leader to the follower and back (the "back" being the rejection in this PR, we could also send an explicit context if we find a reason for it).


pkg/storage/replica.go, line 5218 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we make the Context a structured message so we can extend it later without compatibility concerns?

That sounds like a good idea.

@tbg tbg force-pushed the fix/split-trigger-wait branch 2 times, most recently from df9ac85 to 5bcdb4b Compare December 4, 2018 14:48
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, PTAL.

Dismissed @nvanbenschoten from 2 discussions.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5218 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

That sounds like a good idea.

Done, sort of. I actually took a step back and reconsidered where I was putting this information. The reason I chose the Context field is because it was there and so there wasn't any migration headache. However, it's a bit of a wart especially since it gets passed into Raft and we'd have to put protos inside protos. So what I've done instead is put it on RaftMessageRequest (as StartKey). Old versions receiving the hint simply won't use it, and since RaftMessageRequest doesn't inform any downstream-of-Raft-logic, there shouldn't be a problem.


pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is pretty fragile. I would pull it into a named interface.

I don't understand why it's fragile, but I understand that you didn't like the looks of it. Done.


pkg/storage/split_trigger_helper.go, line 124 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this all seems a little verbose (unintended pun). You're only using eventf twice below, which doesn't really justify the cost of creating it.

Yeah, there were more at some point. Done.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it another pass. One question I'm left with is wondering how you expect this to interact with repeated chains of splits and merges. I think it should all be fine, but it's worth exploring that and writing down a comment somewhere.

Reviewed 2 of 6 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5214 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Yeah, we don't want to enter the snapshot queue erroneously. It may be possible to get Raft out of that state cleanly (at least after appropriate upstream changes), but I want raftsnapshot.pending to be a flat line at some point. We'll send at most one MsgApp per heartbeat interval to a follower that is being probed, so perf shouldn't be a concern here. (in steady state, followers are never expected to be in that state, this only occurs around cluster restarts and replication changes and such).

I morally also want to send something along in MsgAppResp instead, but it's tricky. The uninitialized follower is really dumb and doesn't know anything about itself and the world, so in particular it has no idea where in the key space it should be. The leader (who is the right-hand-side) on the other hand doesn't know about the left-hand-side's bound (and in particular I've often seen its corresponding left-hand-side replica not be leader any more in a rebalancing unit test I have laying around in a WIP that attempted to go this route).

Let me know if you see something I missed. As far as I can tell, we'll need information to go from the leader to the follower and back (the "back" being the rejection in this PR, we could also send an explicit context if we find a reason for it).

Thanks for the explanation. I'd leave a comment pointing at https://github.com/etcd-io/etcd/blob/7f450bf6967638673dd88fd4e730b01d1303d5ff/raft/progress.go#L41, which backs up what you said about a single message per heartbeat interval.


pkg/storage/replica.go, line 5218 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Done, sort of. I actually took a step back and reconsidered where I was putting this information. The reason I chose the Context field is because it was there and so there wasn't any migration headache. However, it's a bit of a wart especially since it gets passed into Raft and we'd have to put protos inside protos. So what I've done instead is put it on RaftMessageRequest (as StartKey). Old versions receiving the hint simply won't use it, and since RaftMessageRequest doesn't inform any downstream-of-Raft-logic, there shouldn't be a problem.

👍 that's much better.


pkg/storage/split_trigger_helper.go, line 58 at r2 (raw file):

Previously, tbg (Tobias Grieger) wrote…

I don't understand why it's fragile, but I understand that you didn't like the looks of it. Done.

Fragile in that we can't perform static implementation assertions and can't "talk" about the type outside of using it.


pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):

	// Run the cheapest check first. If the leader doesn't think this replica is
	// probing, it won't set msg.Context (the common case).
	if msg.Type != raftpb.MsgApp || startKey == nil {

nit: len(startKey) == 0 feels a little safer since we're pulling this out of a proto.

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add the comment tomorrow. I think I managed to address the rest.

Dismissed @nvanbenschoten from a discussion.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5214 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

@nvanbenschoten I'm unhappy about the allocation here and I know you've disliked this kind of thing as well. On master I'm planning to solve this problem with a change in etcd/raft, though that likely won't be backportable (and I'd like to at least be able to consider this change for a backport, for we've backported its predecessor). How bad is this allocation in the meantime (and on 2.1)?

edit: noticed I hadn't actually sent this comment earlier.

Ping @nvanbenschoten


pkg/storage/replica.go, line 5214 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Thanks for the explanation. I'd leave a comment pointing at https://github.com/etcd-io/etcd/blob/7f450bf6967638673dd88fd4e730b01d1303d5ff/raft/progress.go#L41, which backs up what you said about a single message per heartbeat interval.

Done.


pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: len(startKey) == 0 feels a little safer since we're pulling this out of a proto.

But if this is happening to the first range, the start key will intentionally be empty. Does TestProtoZeroNilSlice (added in this revision) address your concern?

@tbg tbg force-pushed the fix/split-trigger-wait branch 2 times, most recently from 10c7f79 to 210a61a Compare December 5, 2018 09:49
@tbg
Copy link
Member Author

tbg commented Dec 5, 2018

Ok, all set. PTAL.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5214 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

Ping @nvanbenschoten

I was thinking that we would only hit the allocation when we need to add the start key, but you're right that we'll hit it for every Raft message. That seems like a pretty big issue. RawNode.Status allocates at least three times and we regularly see it come up on allocation profiles. Calling it this frequently is likely to have a measurable performance impact. We would either need to prove that this is not the case or make that upstream change (RawNode.UnsafeStatus()?) before merging this one. I'd strongly lean towards the latter.

I'm not sure about the backport problem. Seems the like the work we're doing here is much more tricky than whatever accessor you add to etcd/raft, so I don't expect to see this backported. If it is though, I'd push to backport that change as well.


pkg/storage/split_trigger_helper.go, line 66 at r3 (raw file):

Previously, tbg (Tobias Grieger) wrote…

But if this is happening to the first range, the start key will intentionally be empty. Does TestProtoZeroNilSlice (added in this revision) address your concern?

Yep, thanks!

Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 5214 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I was thinking that we would only hit the allocation when we need to add the start key, but you're right that we'll hit it for every Raft message. That seems like a pretty big issue. RawNode.Status allocates at least three times and we regularly see it come up on allocation profiles. Calling it this frequently is likely to have a measurable performance impact. We would either need to prove that this is not the case or make that upstream change (RawNode.UnsafeStatus()?) before merging this one. I'd strongly lean towards the latter.

I'm not sure about the backport problem. Seems the like the work we're doing here is much more tricky than whatever accessor you add to etcd/raft, so I don't expect to see this backported. If it is though, I'd push to backport that change as well.

Ok, I sent a PR upstream: etcd-io/etcd#10309

Once this (or whatever its final form is) is in, I'll bump the dep and update this PR.

@nvanbenschoten
Copy link
Member

Is this ready for another look?

@tbg
Copy link
Member Author

tbg commented Dec 13, 2018

@nvanbenschoten yep

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: thanks for following through with this. Two minor nits but otherwise ready to merge.

Reviewed 6 of 6 files at r4, 1 of 1 files at r5, 1 of 1 files at r6.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/raft.proto, line 73 at r4 (raw file):

  repeated RaftHeartbeat heartbeat_resps = 7 [(gogoproto.nullable) = false];

  // Optionally, the start key of the sending replica. This is only populated

nit (take it or leave it) now that I'm reading over this again: name this range_start_key and move it right below range_id.


pkg/storage/replica.go, line 315 at r6 (raw file):

	// aren't held up by this latch, which could result in reads on the RHS
	// executed through the LHS after this is valid. For more detail, see:
	// https://github.com/cockroachdb/cockroach/issues/32583

nit: period at end.

tbg added 3 commits December 14, 2018 16:58
We previously "waited" for the split trigger quite literally, by
dropping an outgoing rejecting MsgAppResp for a few seconds. The
downside of doing so was that it would also artificially delay Raft
snapshots that were actually required. With the introduction of a
mechanism that delays split that would cause Raft snapshots this is a
lesser problem than before, but it's still ugly.

The revamped mechanism in this commit goes the extra mile to make
an informed decision on whether a split trigger is expected to apply:

1. when sending MsgApp to a follower that is being probed, send the
start key along with the message
2. when receiving said MsgApp on an uninitialized replica, check whether
the store has a replica that currently owns the start key. This replica
is the one in charge of applying the split trigger,
3. unless it's waiting for GC (so trigger a check for that).

There's also a time-based escape hatch to avoid delaying snapshots
indefinitely should there be a flaw in the above logic.

Release note: None
It exposes a bug (likely in range merges), cockroachdb#32784.

Release note: None
@tbg tbg force-pushed the fix/split-trigger-wait branch from 7329f96 to 2e3af6b Compare December 14, 2018 16:01
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! Going to wait for TC to green this before I merge.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/raft.proto, line 73 at r4 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit (take it or leave it) now that I'm reading over this again: name this range_start_key and move it right below range_id.

No bother at all, done.

@tbg
Copy link
Member Author

tbg commented Dec 14, 2018

bors r=nvanbenschoten

craig bot pushed a commit that referenced this pull request Dec 14, 2018
32782: storage: smarten logic to wait for split trigger r=nvanbenschoten a=tbg

We previously "waited" for the split trigger quite literally, by
dropping an outgoing rejecting MsgAppResp for a few seconds. The
downside of doing so was that it would also artificially delay Raft
snapshots that were actually required. With the introduction of a
mechanism that delays split that would cause Raft snapshots this is a
lesser problem than before, but it's still ugly.

The revamped mechanism in this commit goes the extra mile to make
an informed decision on whether a split trigger is expected to apply:

1. when sending MsgApp to a follower that is being probed, send the
start key along with the message
2. when receiving said MsgApp on an uninitialized replica, check whether
the store has a replica that currently owns the start key. This replica
is the one in charge of applying the split trigger,
3. unless it's waiting for GC (so trigger a check for that).

There's also a time-based escape hatch to avoid delaying snapshots
indefinitely should there be a flaw in the above logic.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@craig
Copy link
Contributor

craig bot commented Dec 14, 2018

Build succeeded

@craig craig bot merged commit 2e3af6b into cockroachdb:master Dec 14, 2018
knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that cockroachdb#32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
knz added a commit to knz/cockroach that referenced this pull request Jan 9, 2019
The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (cockroachdb#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that cockroachdb#32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.
craig bot pushed a commit that referenced this pull request Jan 9, 2019
33573: roachtest: add MinVersion to scrub tests r=lucy-zhang a=lucy-zhang

The `scrub` tests have been passing on master, but failing on 2.1 because there
was a bug (fixed in #32908).

Release note: None

33582: storage: bump RaftDelaySplitToSuppressSnapshotTicks r=knz a=knz

The admin split path must accommodate a scenario where a range with
not-yet-replicated followers is being manually split multiple
times (eg. IMPORT during TPCC test fixtures). This scenario results in
a bunch of replicas that all need to be populated with
snapshots. To avoid backing up the raft snapshot queue, a heuristic
was put in place (#32594) to delay the admin split if there is another
snapshot being applied already.

As shown by investigation in a failing test, there is a mismatch
between the configured max delay for this heuristic (20s) and the
actual duration of the snapshot application - the latter is limited by
the max bandwidth for snapshots, 2 MB/s resulting in 32s applications
in the worst case. We (Tobias and I) suspect that the heuristic thus
fails to wait long enough to have the protective effect it was
designed to provide.

The current patch increases this delay to exceed this snapshot
application duration estimate to about 50s.

Note that this scenario is not likely to occur now that #32782 has
been merged (this reduces the need for raft snapshots during splits);
however in release-2.1 where that patch was not applied, the scenario
is more likely.

Release note (bug fix): resolve a cluster degradation scenario that
could occur during IMPORT/RESTORE operations, manifested through a
high number of pending Raft snapshots.

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@tbg tbg deleted the fix/split-trigger-wait branch March 13, 2019 11:51
tbg added a commit to tbg/cockroach that referenced this pull request Apr 22, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 23, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 23, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request Apr 24, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
craig bot pushed a commit that referenced this pull request Apr 24, 2021
64060: kvserver: fix delaying of splits with uninitialized followers r=erikgrinaker a=tbg

Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in #32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 #32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in #32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Closes #61396.

cc @cockroachdb/kv

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.


Co-authored-by: Tobias Grieger <[email protected]>
tbg added a commit to tbg/cockroach that referenced this pull request Apr 26, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
tbg added a commit to tbg/cockroach that referenced this pull request May 20, 2021
Bursts of splits (i.e. a sequence of splits for which each split splits
the right-hand side of a prior split) can cause issues. This is because
splitting a range in which a replica needs a snapshot results in two
ranges in which a replica needs a snapshot where additionally there
needs to be a sequencing between the two snapshots (one cannot apply
a snapshot for the post-split replica until the pre-split replica has
moved out of the way). The result of a long burst of splits such as
occurring in RESTORE and IMPORT operations is then an overload of the
snapshot queue with lots of wasted work, unavailable followers with
operations hanging on them, and general mayhem in the logs. Since
bulk operations also write a lot of data to the raft logs, log
truncations then create an additional snapshot burden; in short,
everything will be unhappy for a few hours and the cluster may
effectively be unavailable.

This isn't news to us and in fact was a big problem "back in 2018".
When we first started to understand the issue, we introduced a mechanism
that would delay splits (cockroachdb#32594) with the desired effect of ensuring
that, all followers had caught up to ~all of the previous splits.
This helped, but didn't explain why we were seeing snapshots in the
first place.

Investigating more, we realized that snapshots were sometimes spuriously
requested by an uninitialized replica on the right-hand side which was
contacted by another member of the right-hand side that had already been
initialized by the split executing on the left-hand side; this snapshot
was almost always unnecessary since the local left-hand side would
usually initialize the right-hand side moments later.  To address this,
in cockroachdb#32594 we started unconditionally dropping the first ~seconds worth
of requests to an uninitialized range, and the mechanism was improved in
 cockroachdb#32782 and will now only do this if a local neighboring replica is
expected to perform the split soon.

With all this in place, you would've expected us to have all bases
covered but it turns out that we are still running into issues prior
to this PR.

Concretely, whenever the aforementioned mechanism drops a message from
the leader (a MsgApp), the leader will only contact the replica every
second until it responds. It responds when it has been initialized via
its left neighbor's splits and the leader reaches out again, i.e.  an
average of ~500ms after being initialized. However, by that time, it is
itself already at the bottom of a long chain of splits, and the 500ms
delay is delaying how long it takes for the rest of the chain to get
initialized.  Since the delay compounds on each link of the chain, the
depth of the chain effectively determines the total delay experienced at
the end. This would eventually exceed the patience of the mechanism that
would suppress the snapshots, and snapshots would be requested. We would
descend into madness similar to that experienced in the absence of the
mechanism in the first place.

The mechanism in cockroachdb#32594 could have helped here, but unfortunately it
did not, as it routinely missed the fact that followers were not
initialized yet. This is because during a split burst, the replica
orchestrating the split was typically only created an instant before,
and its raft group hadn't properly transitioned to leader status yet.
This meant that in effect it wasn't delaying the splits at all.

This commit adjusts the logic to delay splits to avoid this problem.
While clamoring for leadership, the delay is upheld. Once collapsed
into a definite state, the existing logic pretty much did the right
thing, as it waited for the right-hand side to be in initialized.

Release note (bug fix): Fixed a scenario in which a rapid sequence
of splits could trigger a storm of Raft snapshots. This would be
accompanied by log messages of the form "would have dropped incoming
MsgApp, but allowing due to ..." and tended to occur as part of
RESTORE/IMPORT operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants