storage: allow quiescing in the presence of dead nodes #9446

petermattis · 2016-09-17T18:27:08Z

// TODO(peter): When a node goes down, any range which has a replica on the
// down node will not quiesce. This could be a significant performance
// impact. Additionally, when the node comes back up we want to bring any
// replicas it contains back up to date. Right now this will be handled because
// those ranges never quiesce. One thought for handling both these scenarios is
// to hook into the StorePool and its notion of "down" nodes. But that might
// not be sensitive enough.

Cc @cockroachdb/stability

bdarnell · 2016-09-18T15:16:03Z

The StorePool currently takes 5 minutes to declare a node "down". We could lower this, although making it too low would result in more frequent repair operations for temporary blips. Maybe the StorePool should recognize different levels of "down":

After any disruption at all, remove the node from the pool of rebalance targets
After some fairly short time, allow ranges on this node to quiesce even though it isn't responding
After enough time has passed that we're sure the node isn't coming back soon, schedule its ranges for repair.

If the downed node comes back while in state 2 (and we're assuming this will be relatively common; otherwise we should go straight to state 3), it will try to start elections, which will be disruptive.

No matter how quickly we declare a node down, though, there will be some time during which we are unable to quiesce and heartbeats will be flowing at full volume. The risk of these load spikes remains a major concern with quiescence as an alternative to coalescing heartbeats.

petermattis · 2016-09-18T15:26:15Z

Yeah, using StorePool for this purpose feels awkward.

When a node goes down, all traffic to that node should start failing fast. It feels like there is something we can notice there, perhaps at the RaftTransport level. Something I haven't worked out is how to determine when the node has come back up in the absence of other traffic to the node.

bdarnell · 2016-09-18T16:07:47Z

Networks can fail in many mysterious ways; not all of them result in something that will promptly trigger a fail-fast in the raft transport.

If the node was down for at least the StoreGossipInterval, it will tell you via gossip when it comes back up. If it's down for a shorter period of time, it will tell you that it's back up by calling raft elections, although that's disruptive and we want to avoid that outcome. We could have it open all its raft streams to all peer stores at startup instead of waiting to do them lazily; we could then un-quiesce in RaftMessage before we go into the Recv loop.

spencerkimball · 2016-09-18T16:56:26Z

Raft transport employs its own circuit breaker for failing fast. We could
export a method to check that for whether or not to exclude a node from the
healthy set in order to quiesce. The problem is that the signal only works
if it's periodically primed by trying to send traffic to the node. But
circuit breakers do have an events callback mechanism, so it would be
pretty simple to give the raft transport the ability to accept
registrations and invoke a callback when it receives a BreakerReady event.

On Sun, Sep 18, 2016 at 12:08 PM Ben Darnell [email protected]
wrote:

Networks can fail in many mysterious ways; not all of them result in
something that will promptly trigger a fail-fast in the raft transport.

If the node was down for at least the StoreGossipInterval, it will tell
you via gossip when it comes back up. If it's down for a shorter period of
time, it will tell you that it's back up by calling raft elections,
although that's disruptive and we want to avoid that outcome. We could have
it open all its raft streams to all peer stores at startup instead of
waiting to do them lazily; we could then un-quiesce in RaftMessage before
we go into the Recv loop.

—
You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#9446 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTZznAZFDATtOpI8Ywln7swdlUqw5ks5qrWHggaJpZM4J_q49
.

petermattis · 2016-09-18T16:56:41Z

True about fail-fast not being foolproof. If a node is partitioned away, we should stop receiving any traffic from it. Feels like there must be some reliable signal there.

Lazy Raft loading prevents a node from calling Raft elections when it comes back up.

bdarnell · 2016-09-19T02:01:39Z

Lazy Raft loading prevents a node from calling Raft elections when it comes back up.

If the process restarted. If it was a network-level problem, the process will still be there with its unquiesced raft groups calling for elections.

tbg · 2018-06-07T09:49:12Z

In looking at this I first wanted to replicate the problem in a workload.

The following does it (typed from memory, so may need some adjustments):

roachprod create tobias-quiesce -n 4
roachprod put tobias-quiesce:1-3 cockroach
roachprod put tobias:quiesce:4 bin/workload
roachprod start tobias-quiesce:1-3
roachprod ssh tobias-quiesce:4 "./workload run kv $(roachprod pgurl tobias-quiesce:1) --seed $(date +%N) --max-ops 1 --splits 10000 --init"
roachprod stop tobias-quiesce:1-3
roachprod start tobias-quiesce:1-3
roachprod ssh tobias-quiesce:4 "./workload run kv $(roachprod pgurl tobias-quiesce:1) --seed $(date +%N) --concurrency 64"
# While kv is running
roachprod stop tobias-quiesce:3
# qps plummets from the 1.8k range to below 100
roachprod start tobias-quiesce:3
# QPS recovers over next few minutes

tbg · 2018-06-07T10:12:18Z

As an experiment, I have a change that allows quiescence whenever the leaseholder and some other replica are up to date (i.e. with 3x replication it allows one replica to remain behind, which is something we can get away with for this workload):

The below graphs show the effect of shutting down node3. (To be fair, I did it with kill -9 in the "before", and gracefully in "after", though it almost seems like it doesn't make a huge difference):

Before my change:

After my change:

QPS are even higher than before (since the replicas now quiesce more aggressively).

tbg · 2018-06-07T10:13:40Z

@nvanbenschoten this tells me that we 100% have to address this issue to hit any kind of key result with TPCC under chaos, unless TPCC runs with very few ranges. The graphs above have 10k ranges.

petermattis · 2018-06-07T13:48:58Z

@tschottdorf These results are in line with what I expected. I agree that fixing this issue will be likely required to have good TPCC perf under chaos if we run with a reasonable number of warehouses. I recall TPCC-10k had tens of thousands of ranges, but the precise number eludes me.

a-robinson · 2018-06-07T14:58:12Z

TPC-C 10k has ~56k ranges IIRC. This will be very important.

This test currently fails, at least in remote mode (curiously in local mode it works, at least on my laptop). ``` --- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s) kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14 ``` Fixing this test is the goal for cockroachdb#9446. Release note: None

26542: roachtest: quiescence with dead node r=petermattis a=tschottdorf This test currently fails, at least in remote mode (curiously in local mode it works, at least on my laptop). ``` --- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s) kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14 ``` Fixing this test is the goal for #9446. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>

nvanbenschoten · 2018-06-12T15:02:34Z

@tschottdorf thanks for exploring this and creating a roachtest to allow easy reproduction. It certainly seems like this is going to be a blocker to good performance under chaos in any realistic deployment.

You mentioned that you had a change to address this. Is it using an approach that you'd actually consider pursuing or is it too hacky? Mind expanding on your approach and your findings?

tbg · 2018-06-12T16:55:22Z

@nvanbenschoten no change to address this except a hack that quiesces whenever at least two replicas are up to date, which is clearly not something we can pursue.

I'll experiment some more next week. For starters, we can query node liveness and ignore followers on non-live nodes when deciding to quiesce. One downside of that approach is that in the event of a liveness range problem, there will be spurious quiescence, but that could be worked around. Then there might be some performance hit from having these liveness checks in a hot loop as the active replicas are ticked, but again, we can find ways to mitigate that.

The liveness duration is kind of a natural threshold to use, since after an unexpected node outage, the cluster is likely unavailable for around a lease duration anyway. It's different for graceful quit, but we can inspect the Draining field as well (though we probably don't want to quiesce right away).

spencerkimball · 2018-06-12T19:30:14Z

If we do make the decision to quiesce because a node is not live, what mechanism unquiesces and catches up the out-of-date replica?

bdarnell · 2018-06-14T20:37:40Z

I think we'd need to build something that would detect replicas that are out of date even though their range is quiesced. Maybe quiesced replicas would still get 1/N heartbeat ticks instead of going completely quiet.

We should also try to understand what exactly is so expensive about these unquiesced ranges - we may be able to partially address this by optimizing coalesced heartbeats, for example.

tbg · 2018-06-14T20:51:18Z

Sometimes ticking quiesced replicas is also something that could be useful for using follower reads for recovery. I agree that we should figure out what's so expensive there. We're sending a lot more rpcs, my assumption was that that was it but it may not be.

On Thu, Jun 14, 2018, 10:37 PM Ben Darnell ***@***.***> wrote: I think we'd need to build something that would detect replicas that are out of date even though their range is quiesced. Maybe quiesced replicas would still get 1/N heartbeat ticks instead of going completely quiet. We should also try to understand what exactly is so expensive about these unquiesced ranges - we may be able to partially address this by optimizing coalesced heartbeats, for example. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#9446 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AE135DEZGo7YmycP1aM-zouSild8xjXaks5t8smhgaJpZM4J_q49> .

--

…

-- Tobias

bdarnell · 2018-06-14T21:25:21Z

If we're sending a lot more rpcs, maybe heartbeat coalescing is broken. The coalesced heartbeats should get bigger and more expensive, but there shouldn't be more of them.

tbg · 2018-06-15T09:58:52Z

I didn't put this well. I didn't see more RPCs, just more heartbeats. I suspect either the heartbeats get clunky or it's just the large number of active ranges that has too much overhead.

Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446

26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires #26908 to function properly Fixes #9446 Co-authored-by: Spencer Kimball <[email protected]>

bdarnell mentioned this issue Sep 18, 2016

storage: coalesce heartbeats #9380

Merged

petermattis assigned a-robinson Dec 7, 2016

dianasaur323 mentioned this issue Dec 14, 2016

Product Roadmap Q4 2016 #10528

Closed

37 tasks

petermattis added this to the Later milestone Feb 22, 2017

a-robinson mentioned this issue Oct 18, 2017

Severe performance degradation when shutting one of 3 nodes down. #19236

Closed

a-robinson modified the milestones: Later, 2.1 Feb 26, 2018

a-robinson removed their assignment Feb 26, 2018

petermattis assigned benesch Mar 28, 2018

nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 24, 2018

nvanbenschoten assigned tbg and unassigned benesch May 29, 2018

tbg mentioned this issue Jun 8, 2018

roachtest: quiescence with dead node #26542

Merged

spencerkimball mentioned this issue Jun 21, 2018

storage: quiesce ranges which have non-live replicas #26911

Merged

craig bot closed this as completed in #26911 Jul 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage: allow quiescing in the presence of dead nodes #9446

storage: allow quiescing in the presence of dead nodes #9446

petermattis commented Sep 17, 2016

bdarnell commented Sep 18, 2016

petermattis commented Sep 18, 2016

bdarnell commented Sep 18, 2016

spencerkimball commented Sep 18, 2016

petermattis commented Sep 18, 2016

bdarnell commented Sep 19, 2016

tbg commented Jun 7, 2018

tbg commented Jun 7, 2018 •

edited

Loading

tbg commented Jun 7, 2018

petermattis commented Jun 7, 2018

a-robinson commented Jun 7, 2018

nvanbenschoten commented Jun 12, 2018

tbg commented Jun 12, 2018

spencerkimball commented Jun 12, 2018

bdarnell commented Jun 14, 2018

tbg commented Jun 14, 2018 via email

bdarnell commented Jun 14, 2018

tbg commented Jun 15, 2018

storage: allow quiescing in the presence of dead nodes #9446

storage: allow quiescing in the presence of dead nodes #9446

Comments

petermattis commented Sep 17, 2016

bdarnell commented Sep 18, 2016

petermattis commented Sep 18, 2016

bdarnell commented Sep 18, 2016

spencerkimball commented Sep 18, 2016

petermattis commented Sep 18, 2016

bdarnell commented Sep 19, 2016

tbg commented Jun 7, 2018

tbg commented Jun 7, 2018 • edited Loading

tbg commented Jun 7, 2018

petermattis commented Jun 7, 2018

a-robinson commented Jun 7, 2018

nvanbenschoten commented Jun 12, 2018

tbg commented Jun 12, 2018

spencerkimball commented Jun 12, 2018

bdarnell commented Jun 14, 2018

tbg commented Jun 14, 2018 via email

bdarnell commented Jun 14, 2018

tbg commented Jun 15, 2018

tbg commented Jun 7, 2018 •

edited

Loading