Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: quiesce ranges which have non-live replicas #26911

Merged
merged 1 commit into from
Jul 13, 2018

Conversation

spencerkimball
Copy link
Member

Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the NodeLiveness
object on every Raft ticker loop and then passes that to
Replica.tick() to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires #26908 to function properly

Fixes #9446

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@bdarnell
Copy link
Contributor

LGTM once #26908 is resolved.


Reviewed 3 of 3 files at r1.
Review status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 3998 at r1 (raw file):

// correctness issues.
//
// TODO(peter): When a node goes down, any range which has a replica on the

Replace this comment with a discussion of the new feature and the livenessMap argument.


pkg/storage/replica.go, line 4101 at r1 (raw file):

		return nil, false
	}
	var foundSelf bool

I don't think the new code that matches the range descriptor's replica map against the raft status can replace the foundSelf check (I can't quite remember what that was for, but it's weird enough that it must have fixed a real bug at some point)


Comments from Reviewable

@spencerkimball spencerkimball force-pushed the quiesce-non-live branch 2 times, most recently from e0cabfe to 10ee80c Compare July 11, 2018 22:13
Copy link
Member Author

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/replica.go, line 3998 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Replace this comment with a discussion of the new feature and the livenessMap argument.

Done.


pkg/storage/replica.go, line 4101 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I don't think the new code that matches the range descriptor's replica map against the raft status can replace the foundSelf check (I can't quite remember what that was for, but it's weird enough that it must have fixed a real bug at some point)

I was assuming that because we were going through our replica descriptor, that must contain the ID in the raft status, but that was a shaky assumption, so I've restored the foundSelf check. Changed.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)


pkg/storage/replica.go, line 4184 at r2 (raw file):

// node liveness instance invokes a callback which causes all nodes to
// wakes up any ranges containing replicas owned by the newly-live
// node, allowing the out-of-date replicas to be brought back up to date.

Mention what it means for the map to be nil.


pkg/storage/replica.go, line 4311 at r2 (raw file):

		}
		if progress, ok := status.Progress[uint64(rep.ReplicaID)]; !ok {
			if log.V(4) {

Add a comment describing when this can happen.


pkg/storage/store.go, line 3679 at r2 (raw file):

}

// nodeIsLiveCallback is invoked when a node transitions from non-live

This is naturally going to be racy, right? It's possible that a node comes back to life right after the liveness map has been .Stored but before the corresponding range has processed the tick. This could create scenarios where the node is already unquiesced, this callback is called and is a no-op, and then replica.maybeQuiesceLocked is called and quiesces even with the newly live node out of date.

This should all be fine I think because the newly live node will eventually force the range to unquiesce, but I would like to see the race explicitly called out and rationalized here.

Copy link
Member Author

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained


pkg/storage/replica.go, line 4184 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Mention what it means for the map to be nil.

Done.


pkg/storage/replica.go, line 4311 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment describing when this can happen.

I don't know of any reason this could happen (@bdarnell?). I took this step because while these two objects should mirror each other, they are separately maintained.


pkg/storage/store.go, line 3679 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is naturally going to be racy, right? It's possible that a node comes back to life right after the liveness map has been .Stored but before the corresponding range has processed the tick. This could create scenarios where the node is already unquiesced, this callback is called and is a no-op, and then replica.maybeQuiesceLocked is called and quiesces even with the newly live node out of date.

This should all be fine I think because the newly live node will eventually force the range to unquiesce, but I would like to see the race explicitly called out and rationalized here.

Done, and I've added code to make that race significantly less likely by updating the liveness map first thing in the callback.

Copy link
Contributor

@bdarnell bdarnell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/replica.go, line 4311 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I don't know of any reason this could happen (@bdarnell?). I took this step because while these two objects should mirror each other, they are separately maintained.

It should never happen (except for the race between the two updates, but that should never happen concurrently with this code). Still good to be defensive here.

Copy link
Member Author

@spencerkimball spencerkimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/replica.go, line 4311 at r2 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

It should never happen (except for the race between the two updates, but that should never happen concurrently with this code). Still good to be defensive here.

OK

// node, allowing the out-of-date replicas to be brought back up to date.
// If livenessMap is nil, liveness data will not be used, meaning no range
// will quiesce if any replicas are behind, whether or not they are live.
//
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should mention that when the map is not nil, absence of an entry is taken as non-live (?).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

} else if progress.Match != status.Applied {
// Skip any node in the descriptor which is not live.
if livenessMap != nil && !livenessMap[rep.NodeID] {
log.Infof(ctx, "skipping node %d because not live. Progress=%+v",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I'm missing something, but wouldn't you also want to not send a message to that follower?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand what you're suggesting. What message are you referring to that we shouldn't send to the follower, and what does this bit of code have to do with the message?

But thanks for pointing to this snippet of code because I realize the log.Infof need to be inside an if log.V(4) block.

Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
@spencerkimball
Copy link
Member Author

bors r+

craig bot pushed a commit that referenced this pull request Jul 13, 2018
26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball

Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires #26908 to function properly

Fixes #9446

Co-authored-by: Spencer Kimball <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jul 13, 2018

Build succeeded

@craig craig bot merged commit c47ace3 into cockroachdb:master Jul 13, 2018
@spencerkimball spencerkimball deleted the quiesce-non-live branch July 13, 2018 18:53
tbg added a commit to tbg/cockroach that referenced this pull request Jul 16, 2018
PR cockroachdb#26911 added a mechanism which during quiescence decisions ignored
nodes which (at some recent point in time) seemed non-live.
Unfortunately, the corresponding mechanism introduced to unquiesce when
the node becomes live is quite racy, at least when tormented with the
wild timings we throw at it in unit testing.  The basic discrepancy is
that unquiescing is triggered by an explicit liveness event (i.e. the kv
record being overwritten), whereas a node may seem non-live for mere
reasons of timing inbetween. In effect this means that quiescence is
more aggressive than unquiescence.

I explored various venues to address this (going as far as introducing
an unquiesce message to the Raft scheduler and serializing unquiescence
through the store Raft ticker goroutine), but it's hopeless. It's an
equally idle diversion to try to adjust the test liveness durations.

So, in this commit, the liveness check becomes more lenient: even when a
node seems non-live, we count it as live if there's a healthy connection
established to it. This effectively deals with issues in tests but also
seems like a good addition for real world workloads: when the liveness
range breaks down, we don't want to aggressively quiesce.

Fixes cockroachdb#27545.
Fixes cockroachdb#27607.

Release note: None
tbg added a commit to tbg/cockroach that referenced this pull request Jul 16, 2018
PR cockroachdb#26911 added a mechanism which during quiescence decisions ignored
nodes which (at some recent point in time) seemed non-live.
Unfortunately, the corresponding mechanism introduced to unquiesce when
the node becomes live is quite racy, at least when tormented with the
wild timings we throw at it in unit testing.  The basic discrepancy is
that unquiescing is triggered by an explicit liveness event (i.e. the kv
record being overwritten), whereas a node may seem non-live for mere
reasons of timing inbetween. In effect this means that quiescence is
more aggressive than unquiescence.

I explored various venues to address this (going as far as introducing
an unquiesce message to the Raft scheduler and serializing unquiescence
through the store Raft ticker goroutine), but it's hopeless. It's an
equally idle diversion to try to adjust the test liveness durations.

So, in this commit, the liveness check becomes more lenient: even when a
node seems non-live, we count it as live if there's a healthy connection
established to it. This effectively deals with issues in tests but also
seems like a good addition for real world workloads: when the liveness
range breaks down, we don't want to aggressively quiesce.

Fixes cockroachdb#27545.
Fixes cockroachdb#27607.

Release note: None
craig bot pushed a commit that referenced this pull request Jul 16, 2018
27617: storage: quiesce less aggressively r=nvanbenschoten,bdarnell a=tschottdorf

PR #26911 added a mechanism which during quiescence decisions ignored
nodes which (at some recent point in time) seemed non-live.
Unfortunately, the corresponding mechanism introduced to unquiesce when
the node becomes live is quite racy, at least when tormented with the
wild timings we throw at it in unit testing.  The basic discrepancy is
that unquiescing is triggered by an explicit liveness event (i.e. the kv
record being overwritten), whereas a node may seem non-live for mere
reasons of timing inbetween. In effect this means that quiescence is
more aggressive than unquiescence.

I explored various venues to address this (going as far as introducing
an unquiesce message to the Raft scheduler and serializing unquiescence
through the store Raft ticker goroutine), but it's hopeless. It's an
equally idle diversion to try to adjust the test liveness durations.

So, in this commit, the liveness check becomes more lenient: even when a
node seems non-live, we count it as live if there's a healthy connection
established to it. This effectively deals with issues in tests but also
seems like a good addition for real world workloads: when the liveness
range breaks down, we don't want to aggressively quiesce.

Fixes #27545.
Fixes #27607.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Jul 24, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 7, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_quiescence` set attached to quiescence
requests, it allows all quiesced replicas to agree on which node liveness
events would need to result in the range being unquiesced and which ones
can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 18, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Aug 20, 2020
Mitigates the runtime regression in cockroachdb#50865.
Implements the proposal from cockroachdb#50865 (comment).
Based on cockroachdb#51816.

In cockroachdb#26911, we introduced the ability to quiesce ranges with dead replicas
that were behind on the Raft log. This was a big win, as it prevented a
node failure from stopping any range with a replica on that node from
quiescing, even if the range was dormant. However, in order to ensure
that the replica was caught up quickly once its node being brought back
on-line, we began aggressively unquiescing when nodes moved from not-live
to live.

This turns out to be a destabilizing behavior. In scenarios like cockroachdb#50865
where nodes contain large numbers of replicas (i.e. O(100k)), suddenly
unquiescing all replicas on a node due to a transient liveness blip can
bring a cluster to its knees. Like cockroachdb#51888 and cockroachdb#45013, this is another
case where we add work to the system when it gets sufficiently loaded,
causing stability to be divergent instead of convergent and resulting in
an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date
replicas on nodes that undergo liveness changes. It does so by socializing
the liveness state of lagging replicas during Range quiescence. In
dong so through a new `lagging_followers_on_quiesce` set attached to
quiescence requests, it allows all quiesced replicas to agree on which
node liveness events would need to result in the range being unquiesced
and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see cockroachdb#50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence
as it exists today. Before this change, we did not provide this
guarantee in the case that a leader was behind on its node liveness
information, broadcasted a quiescence notification, and then crashed. In
that case, a node that was brought back online might not be caught up
for minutes because the range wouldn't unquiesce until the lagging
replica reached out to it. Interestingly, this may be a semi-common
occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it
reduces the set of ranges with a replica on a given dead node that
unquiesce when that node becomes live to only those ranges that had
write activity after the node went down. This is typically only a small
fraction of the total number of ranges on each node (operating on the
usual assumption that in an average large cluster, most data is
write-cold), especially when the outage is short. However, if all ranges
had write activity and subsequently quiesced before the outage resolved,
we would still have to unquiesce all ranges upon the node coming back
up. For this reason, the change is primarily targeted towards reducing
the impact of node liveness blips and not reducing the impact of
bringing nodes back up after sustained node outages. This is
intentional, as the former category of outage is the one more likely to
be caused by overload due to unquiescence traffic itself (as we see
in cockroachdb#50865), so it is the category we choose to focus on here.

This commit deliberately does not worry about the migration of this
change for clarity. The next commit adds in the migration.

Release note (performance improvement): transient node liveness blips
no longer cause up-to-date Ranges to unquiesce, which makes these events
less destabilizing.
craig bot pushed a commit that referenced this pull request Aug 20, 2020
51894: kvserver: don't unquiesce on liveness changes of up-to-date replicas r=nvanbenschoten a=nvanbenschoten

Mitigates the runtime regression in #50865.
Implements the proposal from #50865 (comment).

In #26911, we introduced the ability to quiesce ranges with dead replicas that were behind on the Raft log. This was a big win, as it prevented a node failure from stopping any range with a replica on that node from quiescing, even if the range was dormant. However, in order to ensure that the replica was caught up quickly once its node being brought back on-line, we began aggressively unquiescing when nodes moved from not-live to live.

This turns out to be a destabilizing behavior. In scenarios like #50865 where nodes contain large numbers of replicas (i.e. O(100k)), suddenly unquiescing all replicas on a node due to a transient liveness blip can bring a cluster to its knees. Like #51888 and #45013, this is another case where we add work to the system when it gets sufficiently loaded, causing stability to be divergent instead of convergent and resulting in an unstable system.

This fix removes the need to unquiesce ranges that have up-to-date replicas on nodes that undergo liveness changes. It does so by socializing the liveness state of lagging replicas during Range quiescence. In dong so through a new `lagging_quiescence` set attached to quiescence requests, it allows all quiesced replicas to agree on which node liveness events would need to result in the range being unquiesced and which ones can be ignored.

This allows us to continue to provide the guarantee that:

> If at least one up-to-date replica in a Raft group is alive, the
> Raft group will catch up any lagging replicas that are also alive.

For a pseudo-proof of this guarantee, see #50865 (comment).

A secondary effect of this change is that it fixes a bug in quiescence as it exists today. Before this change, we did not provide this guarantee in the case that a leader was behind on its node liveness information, broadcasted a quiescence notification, and then crashed. In that case, a node that was brought back online might not be caught up for minutes because the range wouldn't unquiesce until the lagging replica reached out to it. Interestingly, this may be a semi-common occurrence during rolling restart scenarios.

This is a large improvement on top of the existing behavior because it reduces the set of ranges with a replica on a given dead node that unquiesce when that node becomes live to only those ranges that had write activity after the node went down. This is typically only a small fraction of the total number of ranges on each node (operating on the usual assumption that in an average large cluster, most data is write-cold), especially when the outage is short. However, if all ranges had write activity and subsequently quiesced before the outage resolved, we would still have to unquiesce all ranges upon the node coming back up. For this reason, the change is primarily targeted towards reducing the impact of node liveness blips and not reducing the impact of bringing nodes back up after sustained node outages. This is intentional, as the former category of outage is the one more likely to be caused by overload due to unquiescence traffic itself (as we see in #50865), so it is the category we choose to focus on here.

Release note (performance improvement): transient node liveness blips no longer cause up-to-date Ranges to unquiesce, which makes these events less destabilizing.

@petermattis I'm adding you as a reviewer because you've done more thinking about Range quiescence than anyone else. But I can find a new second reviewer if you don't have time to take a look at this.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants