-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: allow quiescing in the presence of dead nodes #9446
Comments
The StorePool currently takes 5 minutes to declare a node "down". We could lower this, although making it too low would result in more frequent repair operations for temporary blips. Maybe the StorePool should recognize different levels of "down":
If the downed node comes back while in state 2 (and we're assuming this will be relatively common; otherwise we should go straight to state 3), it will try to start elections, which will be disruptive. No matter how quickly we declare a node down, though, there will be some time during which we are unable to quiesce and heartbeats will be flowing at full volume. The risk of these load spikes remains a major concern with quiescence as an alternative to coalescing heartbeats. |
Yeah, using When a node goes down, all traffic to that node should start failing fast. It feels like there is something we can notice there, perhaps at the |
Networks can fail in many mysterious ways; not all of them result in something that will promptly trigger a fail-fast in the raft transport. If the node was down for at least the StoreGossipInterval, it will tell you via gossip when it comes back up. If it's down for a shorter period of time, it will tell you that it's back up by calling raft elections, although that's disruptive and we want to avoid that outcome. We could have it open all its raft streams to all peer stores at startup instead of waiting to do them lazily; we could then un-quiesce in RaftMessage before we go into the Recv loop. |
Raft transport employs its own circuit breaker for failing fast. We could On Sun, Sep 18, 2016 at 12:08 PM Ben Darnell [email protected]
|
True about fail-fast not being foolproof. If a node is partitioned away, we should stop receiving any traffic from it. Feels like there must be some reliable signal there. Lazy Raft loading prevents a node from calling Raft elections when it comes back up. |
If the process restarted. If it was a network-level problem, the process will still be there with its unquiesced raft groups calling for elections. |
In looking at this I first wanted to replicate the problem in a workload. The following does it (typed from memory, so may need some adjustments):
|
As an experiment, I have a change that allows quiescence whenever the leaseholder and some other replica are up to date (i.e. with 3x replication it allows one replica to remain behind, which is something we can get away with for this workload): The below graphs show the effect of shutting down node3. (To be fair, I did it with Before my change: After my change: QPS are even higher than before (since the replicas now quiesce more aggressively). |
@nvanbenschoten this tells me that we 100% have to address this issue to hit any kind of key result with TPCC under chaos, unless TPCC runs with very few ranges. The graphs above have 10k ranges. |
@tschottdorf These results are in line with what I expected. I agree that fixing this issue will be likely required to have good TPCC perf under chaos if we run with a reasonable number of warehouses. I recall TPCC-10k had tens of thousands of ranges, but the precise number eludes me. |
TPC-C 10k has ~56k ranges IIRC. This will be very important. |
This test currently fails, at least in remote mode (curiously in local mode it works, at least on my laptop). ``` --- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s) kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14 ``` Fixing this test is the goal for cockroachdb#9446. Release note: None
This test currently fails, at least in remote mode (curiously in local mode it works, at least on my laptop). ``` --- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s) kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14 ``` Fixing this test is the goal for cockroachdb#9446. Release note: None
26542: roachtest: quiescence with dead node r=petermattis a=tschottdorf This test currently fails, at least in remote mode (curiously in local mode it works, at least on my laptop). ``` --- FAIL: kv/quiescence/nodes=3 [unstable] (1258.49s) kv.go:160: QPS dropped by more than 80%: from 2669.32 to 1582.14 ``` Fixing this test is the goal for #9446. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
@tschottdorf thanks for exploring this and creating a roachtest to allow easy reproduction. It certainly seems like this is going to be a blocker to good performance under chaos in any realistic deployment. You mentioned that you had a change to address this. Is it using an approach that you'd actually consider pursuing or is it too hacky? Mind expanding on your approach and your findings? |
@nvanbenschoten no change to address this except a hack that quiesces whenever at least two replicas are up to date, which is clearly not something we can pursue. I'll experiment some more next week. For starters, we can query node liveness and ignore followers on non-live nodes when deciding to quiesce. One downside of that approach is that in the event of a liveness range problem, there will be spurious quiescence, but that could be worked around. Then there might be some performance hit from having these liveness checks in a hot loop as the active replicas are ticked, but again, we can find ways to mitigate that. The liveness duration is kind of a natural threshold to use, since after an unexpected node outage, the cluster is likely unavailable for around a lease duration anyway. It's different for graceful quit, but we can inspect the |
If we do make the decision to quiesce because a node is not live, what mechanism unquiesces and catches up the out-of-date replica? |
I think we'd need to build something that would detect replicas that are out of date even though their range is quiesced. Maybe quiesced replicas would still get 1/N heartbeat ticks instead of going completely quiet. We should also try to understand what exactly is so expensive about these unquiesced ranges - we may be able to partially address this by optimizing coalesced heartbeats, for example. |
Sometimes ticking quiesced replicas is also something that could be useful
for using follower reads for recovery.
I agree that we should figure out what's so expensive there. We're sending
a lot more rpcs, my assumption was that that was it but it may not be.
On Thu, Jun 14, 2018, 10:37 PM Ben Darnell ***@***.***> wrote:
I think we'd need to build something that would detect replicas that are
out of date even though their range is quiesced. Maybe quiesced replicas
would still get 1/N heartbeat ticks instead of going completely quiet.
We should also try to understand what exactly is so expensive about these
unquiesced ranges - we may be able to partially address this by optimizing
coalesced heartbeats, for example.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#9446 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135DEZGo7YmycP1aM-zouSild8xjXaks5t8smhgaJpZM4J_q49>
.
--
…-- Tobias
|
If we're sending a lot more rpcs, maybe heartbeat coalescing is broken. The coalesced heartbeats should get bigger and more expensive, but there shouldn't be more of them. |
I didn't put this well. I didn't see more RPCs, just more heartbeats. I suspect either the heartbeats get clunky or it's just the large number of active ranges that has too much overhead. |
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires cockroachdb#26908 to function properly Fixes cockroachdb#9446
26911: storage: quiesce ranges which have non-live replicas r=spencerkimball a=spencerkimball Previously all replicas had to be completely up to date in order to quiesce ranges. This made the loss of a node in a cluster with many ranges an expensive proposition, as a significant number of ranges could be kept unquiesced for as long as the node was down. This change refreshes a liveness map from the `NodeLiveness` object on every Raft ticker loop and then passes that to `Replica.tick()` to allow the leader to disregard non-live nodes when making its should-quiesce determination. Release note (performance improvement): prevent dead nodes in clusters with many ranges from causing unnecessarily high CPU usage. Note that this PR requires #26908 to function properly Fixes #9446 Co-authored-by: Spencer Kimball <[email protected]>
Per a TODO comment:
Cc @cockroachdb/stability
The text was updated successfully, but these errors were encountered: