Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: allow quiescing in the presence of dead nodes #9446

Closed
petermattis opened this issue Sep 17, 2016 · 18 comments
Closed

storage: allow quiescing in the presence of dead nodes #9446

petermattis opened this issue Sep 17, 2016 · 18 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Milestone

Comments

@petermattis
Copy link
Collaborator

Per a TODO comment:

// TODO(peter): When a node goes down, any range which has a replica on the
// down node will not quiesce. This could be a significant performance
// impact. Additionally, when the node comes back up we want to bring any
// replicas it contains back up to date. Right now this will be handled because
// those ranges never quiesce. One thought for handling both these scenarios is
// to hook into the StorePool and its notion of "down" nodes. But that might
// not be sensitive enough.

Cc @cockroachdb/stability

@bdarnell
Copy link
Contributor

The StorePool currently takes 5 minutes to declare a node "down". We could lower this, although making it too low would result in more frequent repair operations for temporary blips. Maybe the StorePool should recognize different levels of "down":

  1. After any disruption at all, remove the node from the pool of rebalance targets
  2. After some fairly short time, allow ranges on this node to quiesce even though it isn't responding
  3. After enough time has passed that we're sure the node isn't coming back soon, schedule its ranges for repair.

If the downed node comes back while in state 2 (and we're assuming this will be relatively common; otherwise we should go straight to state 3), it will try to start elections, which will be disruptive.

No matter how quickly we declare a node down, though, there will be some time during which we are unable to quiesce and heartbeats will be flowing at full volume. The risk of these load spikes remains a major concern with quiescence as an alternative to coalescing heartbeats.

@petermattis
Copy link
Collaborator Author

Yeah, using StorePool for this purpose feels awkward.

When a node goes down, all traffic to that node should start failing fast. It feels like there is something we can notice there, perhaps at the RaftTransport level. Something I haven't worked out is how to determine when the node has come back up in the absence of other traffic to the node.

@bdarnell
Copy link
Contributor

Networks can fail in many mysterious ways; not all of them result in something that will promptly trigger a fail-fast in the raft transport.

If the node was down for at least the StoreGossipInterval, it will tell you via gossip when it comes back up. If it's down for a shorter period of time, it will tell you that it's back up by calling raft elections, although that's disruptive and we want to avoid that outcome. We could have it open all its raft streams to all peer stores at startup instead of waiting to do them lazily; we could then un-quiesce in RaftMessage before we go into the Recv loop.

@spencerkimball
Copy link
Member

Raft transport employs its own circuit breaker for failing fast. We could
export a method to check that for whether or not to exclude a node from the
healthy set in order to quiesce. The problem is that the signal only works
if it's periodically primed by trying to send traffic to the node. But
circuit breakers do have an events callback mechanism, so it would be
pretty simple to give the raft transport the ability to accept
registrations and invoke a callback when it receives a BreakerReady event.

On Sun, Sep 18, 2016 at 12:08 PM Ben Darnell [email protected]
wrote:

Networks can fail in many mysterious ways; not all of them result in
something that will promptly trigger a fail-fast in the raft transport.

If the node was down for at least the StoreGossipInterval, it will tell
you via gossip when it comes back up. If it's down for a shorter period of
time, it will tell you that it's back up by calling raft elections,
although that's disruptive and we want to avoid that outcome. We could have
it open all its raft streams to all peer stores at startup instead of
waiting to do them lazily; we could then un-quiesce in RaftMessage before
we go into the Recv loop.


You are receiving this because you are on a team that was mentioned.
Reply to this email directly, view it on GitHub
#9446 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AF3MTZznAZFDATtOpI8Ywln7swdlUqw5ks5qrWHggaJpZM4J_q49
.

@petermattis
Copy link
Collaborator Author

True about fail-fast not being foolproof. If a node is partitioned away, we should stop receiving any traffic from it. Feels like there must be some reliable signal there.

Lazy Raft loading prevents a node from calling Raft elections when it comes back up.

@bdarnell
Copy link
Contributor

Lazy Raft loading prevents a node from calling Raft elections when it comes back up.

If the process restarted. If it was a network-level problem, the process will still be there with its unquiesced raft groups calling for elections.

@petermattis petermattis added this to the Later milestone Feb 22, 2017
@a-robinson a-robinson modified the milestones: Later, 2.1 Feb 26, 2018
@a-robinson a-robinson removed their assignment Feb 26, 2018
@nvanbenschoten nvanbenschoten added C-performance Perf of queries or internals. Solution not expected to change functional behavior. A-kv-replication Relating to Raft, consensus, and coordination. labels Apr 24, 2018
@nvanbenschoten nvanbenschoten assigned tbg and unassigned benesch May 29, 2018
@tbg
Copy link
Member

tbg commented Jun 7, 2018

In looking at this I first wanted to replicate the problem in a workload.

The following does it (typed from memory, so may need some adjustments):

roachprod create tobias-quiesce -n 4
roachprod put tobias-quiesce:1-3 cockroach
roachprod put tobias:quiesce:4 bin/workload
roachprod start tobias-quiesce:1-3
roachprod ssh tobias-quiesce:4 "./workload run kv $(roachprod pgurl tobias-quiesce:1) --seed $(date +%N) --max-ops 1 --splits 10000 --init"
roachprod stop tobias-quiesce:1-3
roachprod start tobias-quiesce:1-3
roachprod ssh tobias-quiesce:4 "./workload run kv $(roachprod pgurl tobias-quiesce:1) --seed $(date +%N) --concurrency 64"
# While kv is running
roachprod stop tobias-quiesce:3
# qps plummets from the 1.8k range to below 100
roachprod start tobias-quiesce:3
# QPS recovers over next few minutes

@tbg
Copy link
Member

tbg commented Jun 7, 2018

As an experiment, I have a change that allows quiescence whenever the leaseholder and some other replica are up to date (i.e. with 3x replication it allows one replica to remain behind, which is something we can get away with for this workload):

The below graphs show the effect of shutting down node3. (To be fair, I did it with kill -9 in the "before", and gracefully in "after", though it almost seems like it doesn't make a huge difference):

Before my change:

image

image

After my change:

image

image

QPS are even higher than before (since the replicas now quiesce more aggressively).

@tbg
Copy link
Member

tbg commented Jun 7, 2018

@nvanbenschoten this tells me that we 100% have to address this issue to hit any kind of key result with TPCC under chaos, unless TPCC runs with very few ranges. The graphs above have 10k ranges.

@petermattis
Copy link
Collaborator Author

@tschottdorf These results are in line with what I expected. I agree that fixing this issue will be likely required to have good TPCC perf under chaos if we run with a reasonable number of warehouses. I recall TPCC-10k had tens of thousands of ranges, but the precise number eludes me.

@a-robinson
Copy link
Contributor

TPC-C 10k has ~56k ranges IIRC. This will be very important.

tbg added a commit to tbg/cockroach that referenced this issue Jun 8, 2018
This test currently fails, at least in remote mode (curiously in local
mode it works, at least on my laptop).

```
--- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s)
        kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14
```

Fixing this test is the goal for cockroachdb#9446.

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Jun 8, 2018
This test currently fails, at least in remote mode (curiously in local
mode it works, at least on my laptop).

```
--- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s)
        kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14
```

Fixing this test is the goal for cockroachdb#9446.

Release note: None
craig bot pushed a commit that referenced this issue Jun 8, 2018
26542: roachtest: quiescence with dead node r=petermattis a=tschottdorf

This test currently fails, at least in remote mode (curiously in local
mode it works, at least on my laptop).

```
--- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s)
        kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14
```

Fixing this test is the goal for #9446.

Release note: None

Co-authored-by: Tobias Schottdorf <[email protected]>
@nvanbenschoten
Copy link
Member

@tschottdorf thanks for exploring this and creating a roachtest to allow easy reproduction. It certainly seems like this is going to be a blocker to good performance under chaos in any realistic deployment.

You mentioned that you had a change to address this. Is it using an approach that you'd actually consider pursuing or is it too hacky? Mind expanding on your approach and your findings?

@tbg
Copy link
Member

tbg commented Jun 12, 2018

@nvanbenschoten no change to address this except a hack that quiesces whenever at least two replicas are up to date, which is clearly not something we can pursue.

I'll experiment some more next week. For starters, we can query node liveness and ignore followers on non-live nodes when deciding to quiesce. One downside of that approach is that in the event of a liveness range problem, there will be spurious quiescence, but that could be worked around. Then there might be some performance hit from having these liveness checks in a hot loop as the active replicas are ticked, but again, we can find ways to mitigate that.

The liveness duration is kind of a natural threshold to use, since after an unexpected node outage, the cluster is likely unavailable for around a lease duration anyway. It's different for graceful quit, but we can inspect the Draining field as well (though we probably don't want to quiesce right away).

@spencerkimball
Copy link
Member

If we do make the decision to quiesce because a node is not live, what mechanism unquiesces and catches up the out-of-date replica?

@bdarnell
Copy link
Contributor

I think we'd need to build something that would detect replicas that are out of date even though their range is quiesced. Maybe quiesced replicas would still get 1/N heartbeat ticks instead of going completely quiet.

We should also try to understand what exactly is so expensive about these unquiesced ranges - we may be able to partially address this by optimizing coalesced heartbeats, for example.

@tbg
Copy link
Member

tbg commented Jun 14, 2018 via email

@bdarnell
Copy link
Contributor

If we're sending a lot more rpcs, maybe heartbeat coalescing is broken. The coalesced heartbeats should get bigger and more expensive, but there shouldn't be more of them.

@tbg
Copy link
Member

tbg commented Jun 15, 2018

I didn't put this well. I didn't see more RPCs, just more heartbeats. I suspect either the heartbeats get clunky or it's just the large number of active ranges that has too much overhead.

spencerkimball added a commit to spencerkimball/cockroach that referenced this issue Jun 21, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
spencerkimball added a commit to spencerkimball/cockroach that referenced this issue Jul 11, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
spencerkimball added a commit to spencerkimball/cockroach that referenced this issue Jul 12, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
spencerkimball added a commit to spencerkimball/cockroach that referenced this issue Jul 13, 2018
Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires cockroachdb#26908 to function properly

Fixes cockroachdb#9446
craig bot pushed a commit that referenced this issue Jul 13, 2018
26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball

Previously all replicas had to be completely up to date in order to
quiesce ranges. This made the loss of a node in a cluster with many
ranges an expensive proposition, as a significant number of ranges
could be kept unquiesced for as long as the node was down.

This change refreshes a liveness map from the `NodeLiveness`
object on every Raft ticker loop and then passes that to
`Replica.tick()` to allow the leader to disregard non-live nodes
when making its should-quiesce determination.

Release note (performance improvement): prevent dead nodes in clusters
with many ranges from causing unnecessarily high CPU usage.

Note that this PR requires #26908 to function properly

Fixes #9446

Co-authored-by: Spencer Kimball <[email protected]>
@craig craig bot closed this as completed in #26911 Jul 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-performance Perf of queries or internals. Solution not expected to change functional behavior.
Projects
None yet
Development

No branches or pull requests

7 participants