Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: leader-less range stalls entire cluster #12591

Closed
mberhault opened this issue Dec 28, 2016 · 39 comments
Closed

stability: leader-less range stalls entire cluster #12591

mberhault opened this issue Dec 28, 2016 · 39 comments

Comments

@mberhault
Copy link
Contributor

sha: 789f749

While testing linear scalability for https://github.com/cockroachlabs/production/issues/175, the cluster stalled shortly after going from 32 to 64 nodes:
https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-sql?from=1482932037968&to=1482946437968&var-cluster=sky&var-node=All&var-rate_interval=1m

sql requests show:

application error: range 614: replica {36 36 13} not lease holder; lease holder unknown

all mentions of r614 on all logs:
range-614.txt

errors only: trouble started around 17:10. nodes cockroach-sky-{33-64} were done being started around then.

20: [email protected]
E161228 17:11:22.318248 11105 storage/queue.go:597  [replicaGC,n21,s21,r614/11:/Table/51/1/293{63395…-93477…},@c427905680] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded

35: [email protected]
E161228 17:10:16.328463 144139 storage/replica_range_lease.go:185  [n36,s36,r614/13:/Table/51/1/293{63395…-93477…},@c4260ea900] heartbeat failed on epoch increment
E161228 17:10:16.785883 147755 storage/replica_range_lease.go:185  [n36,s36,r614/13:/Table/51/1/293{63395…-93477…},@c4260ea900] heartbeat failed on epoch increment

62: [email protected]
E161228 17:10:23.292964 164466 storage/replica_range_lease.go:190  [n63,s63,r614/15:/Table/51/1/293{63395…-93477…},@c4276ebb00] mismatch incrementing epoch for node_id:21 epoch:1 expiration:<wall_time:1482945022425698115 logical:0 > ; actual is {NodeID:21 Epoch:1 Expiration:1482945031.442442362,0}
@mberhault
Copy link
Contributor Author

maybe @petermattis and @bdarnell? not the best week for debugging though

@mberhault
Copy link
Contributor Author

including log files for all nodes:
scaling-brokenness.zip

@mberhault
Copy link
Contributor Author

@a-robinson, you probably know a decent amount of node liveness stuff, this seems to be in that chunk of code.

@a-robinson
Copy link
Contributor

I'll look around at things. Is the cluster still in this state?

@mberhault
Copy link
Contributor Author

nope, I took it down to resume my scalability testing. looking at the logs, there's lots of spam about slow heartbeats (eg, look at 35.stderr in the logs zip).

@mberhault
Copy link
Contributor Author

reran the same scenario but with 30 minutes between node additions.
At 64 nodes, crapped out again with:

E161228 20:51:36.888246 1131 storage/queue.go:597  [raftlog,n20,s20,r628/7:/Table/51/1/4{341497…-768829…},@c42036d680] failed to send RPC: sending to all 3 replicas failed; last error: range 628: replica {6 6 4} not lease holder; lease holder unknown

Node cockroach-sky-0020.eastus2.cloudapp.azure.com is the only one exporting a non-zero replicas_leaders_not_leaseholders

@a-robinson
Copy link
Contributor

The lease on range 628 looks corrupted. All the nodes I've checked say the lease is assigned to n1,s1,r628/1 even though n1,s1 doesn't have a replica on it.

@mberhault
Copy link
Contributor Author

mberhault commented Dec 28, 2016

something odd I noticed on both runs (failing around 17:10 and 20:45 respectively):

21.stderr:W161228 17:10:25.345304 3968825 storage/intent_resolver.go:338  [n22,s22,r1/9:/{Min-System/tsd/c…}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: range 614: replica {63 63 15} not lease holder; lease holder unknown
21.stderr:W161228 17:11:10.165271 4051672 storage/intent_resolver.go:334  [n22,s22,r1/9:/{Min-System/tsd/c…}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: range 1: replica {22 22 9} not lease holder; lease holder unknown
21.stderr:W161228 20:45:38.550323 1693693 storage/intent_resolver.go:338  [n22,s22,r1/7:/{Min-System/tsd/c…}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: range 628: replica {20 20 7} not lease holder; lease holder unknown
21.stderr:W161228 20:45:54.333822 1717012 storage/intent_resolver.go:334  [n22,s22,r1/7:/{Min-System/tsd/c…}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: range 1: replica {22 22 7} not lease holder; lease holder unknown

thee are both about range 1 which explains why the whole cluster is wedged as opposed to just a single range.
another odd thing, but most likely due to timing/similar scenarios/coincidence is that this is the same node.

@andreimatei
Copy link
Contributor

andreimatei commented Dec 28, 2016 via email

@a-robinson
Copy link
Contributor

Yeah. Is this an example of our range descriptor cache problems from https://forum.cockroachlabs.com/t/rangedescriptorcache-and-uncertainty/393 / #10751?

@mberhault
Copy link
Contributor Author

ugh. odd that this happens so consistently. all I want is to run a 64 node cluster

@a-robinson
Copy link
Contributor

a-robinson commented Dec 28, 2016

Although there's also the fair question of why none of the nodes (particularly the raft leader, node 20 in this case) are able to take the lease?

@mberhault
Copy link
Contributor Author

could it be because that's the one with the odd r1 warnings above? the only one btw.

@mberhault
Copy link
Contributor Author

anyway, the latest run is still up, the one with n20,r628 leader-not-leaseholder

@a-robinson
Copy link
Contributor

There were a ton of "context deadline exceeded" errors trying to resolve intents on n22,s22,r1/7 while all the new nodes were starting up around 20:45. A lot of things were moving slowly, probably due to all the work involved in registering/rebalancing to a huge number of new nodes at once.

Performance also may have been influenced by the gossip network having problems -- new nodes kept getting bounced back and forth between nodes with no space for incoming connections. We may need better logic around when to refuse gossip connections. On node 22 (the range 1 leaseholder), 2186 out of 19907 total log lines are about refusing incoming gossip connections.

If all you want to do is test the scalability, I'd guess that you'd be much less likely to see this problem if you added the nodes more gradually (or wait until after @petermattis and @spencerkimball make progress on their work to avoid performance dips while adding nodes).

@mberhault
Copy link
Contributor Author

yeah, the gossip spam is a but much. part of it may be that they only have the first node in the join flag, so this may just crowd everything.
If you don't need the cluster anymore (I saved the logs from the second round), I'll try more things.

@a-robinson
Copy link
Contributor

Feel free to start over as long as you save all the nodes' logs. I'll try to figure out why the cluster wasn't able to recover from the stale lease.

@mberhault
Copy link
Contributor Author

full logs from all nodes:
scaling-brokenness2.zip

@a-robinson
Copy link
Contributor

As for how the problem got started, this doesn't look good:

W161228 20:45:34.938146 1684745 kv/txn_coord_sender.go:773  [replicate,n20,s20,r628/7:/Table/51/1/4{341497…-768829…},@c42036d680] heartbeat to "change-replica" id=6865cdeb key=/Local/Range/"\xbb\x89\xfd\x06\x06h\xe5\xfadP\x1e\x12\x00\x01\x88"/RangeDescriptor rw=true pri=0.05586161 iso=SERIALIZABLE stat=PENDING epo=0 ts=1482957920.627389393,0 orig=1482957920.627389393,0 max=1482957920.877389393,0 wto=false rop=false failed: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 1 desc = context canceled
E161228 20:45:35.413792 1131 storage/queue.go:597  [raftlog,n20,s20,r628/7:/Table/51/1/4{341497…-768829…},@c42036d680] result is ambiguous (removing replica)

@mberhault
Copy link
Contributor Author

mberhault commented Dec 28, 2016 via email

@mberhault
Copy link
Contributor Author

the same node did it again with slower startup (30s between nodes). no stalls this time, but performance below that of 32-node cluster. Also stuck on snapshots (the only thing left in the logs is about too-large snapshots, so something got stuck before that). Dropped that VM, brought up another, and performing the run again.

@mberhault
Copy link
Contributor Author

mberhault commented Dec 29, 2016

latest run without that node is happy. Will try more.

@a-robinson
Copy link
Contributor

Weird that the issue is with that specific node. I've pieced together more of what happened and why things haven't recovered. Hopefully I'll have some sort of proposal for a fix this afternoon.

Here's a bit more of the detail of how things went south:

Node 1 tried to take the lease but had an outdated node liveness info about node 20 due to gossip network problems, so it failed to increment epoch (node 20 was leaseholder at the time)

For a period of time, node 20 had trouble talking to the node liveness table (its prometheus graph suggests it thought all nodes went dead)

Node 20 failed to renew its node liveness, some other node incremented its node epoch (presumably node 1):
W161228 20:45:33.467727 1687265 storage/node_liveness.go:217  [n20,s20,r628/7:/Table/51/1/4{341497…-768829…},@c42036d680] slow heartbeat took 9.2s
E161228 20:45:33.467738 1687265 storage/replica_range_lease.go:185  [n20,s20,r628/7:/Table/51/1/4{341497…-768829…},@c42036d680] heartbeat failed on epoch increment

Around the same time, node 1 proposed removing itself (presumably after taking the lease) (plus a context cancellation):
I161228 20:45:33.208616 6894953 storage/replica.go:2364  [n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] proposing REMOVE_REPLICA {NodeID:1 StoreID:1 ReplicaID:1}: [{NodeID:34 StoreID:34 Replica
ID:8} {NodeID:6 StoreID:6 ReplicaID:4} {NodeID:20 StoreID:20 ReplicaID:7}]W161228 20:45:33.219279 6899516 storage/replica.go:2081  [n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] context cancellation after 2.5s of attempting command PushTxn [/Local/Range/"\xbb\x89\xfd
\x06\x06h\xe5\xfadP\x1e\x12\x00\x01\x88"/RangeDescriptor,/Min)

A couple seconds later, a peer node suggested node 1 remove itself (plus another context cancellation):
W161228 20:45:35.054068 6896349 storage/replica.go:2081  [n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] context cancellation after 7.7s of attempting command [txn: 6865cdeb], HeartbeatTxn [/Local/Range/"\xbb\x89\xfd\x06\x06h\xe5\xfadP\x1e\x12\x00\x01\x88"/RangeDescriptor,/Min)
I161228 20:45:35.386528 4707288 storage/store.go:3136  [n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] added to replica GC queue (peer suggestion)

As node 1 removed itself, it canceled a pending command from node 20:
E161228 20:45:35.413792 1131 storage/queue.go:597  [raftlog,n20,s20,r628/7:/Table/51/1/4{341497…-768829…},@c42036d680] result is ambiguous (removing replica)

A minute later, it failed to properly complete replica GC:
E161228 20:46:34.936921 116 storage/queue.go:597  [replicaGC,n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 4 desc = context deadline exceeded

@a-robinson
Copy link
Contributor

Learning that nodes can propose removing their own replica of a range that they hold the lease for has gotten me wondering something -- during the replica removal process, is there any explicit step where the replica transfers its lease away if it's the leaseholder? Such a step wouldn't have been needed under expiration-based leases, since the lease would just expire and another replica would grab it. It's clearly needed under epoch-based leases, since the lease remains valid as long as the node remains healthy. @spencerkimball

I'm still trying to figure out a good way for replicas to be able to take a lease away from a live node that no longer holds a replica in the case that we get into this state, but it's also worth digging into the above question as well.

@petermattis
Copy link
Collaborator

petermattis commented Dec 29, 2016

@a-robinson A node is not supposed to remove its own replica from a range. Instead, the node first transfers the lease to another node and that other node performs the replica removal. There could very well be bugs here, but that is the way it is supposed to work in theory. See the code in replicateQueue.process for the AllocatorRemove case.

@a-robinson
Copy link
Contributor

Yeah, just found that. There must be some sort of race that's being hit here.

@petermattis
Copy link
Collaborator

@a-robinson Coming back up to speed after the holiday break. Did you make any more progress on this? Perhaps we should add some assertions that we never try to remove the leaseholder for a range.

@a-robinson
Copy link
Contributor

Ok, I'm not positive, but I think that what happened is this:

  1. Node 20 held the lease initially, with the other replicas on nodes 1 and 6
  2. At 20:45:17.919860, n20 started the process of adding a new replica
  3. At 20:45:20.315365, n20 finished adding a new replica on node 34
  4. At 20:45:20.628063, n20 proposed removing the replica on node 1 (this is an educated guess -- the relevant log message just says it's changing replicas, not what the change is, but I can't imagine it was anything else)
  5. n20 gets bogged down with gossip thrashing and/or other load problems and nothing happens for a while
  6. At 20:45:33.208616, n1 (apparently now the leaseholder) processes the command to remove n1 (itself). Because the lease check only happens during replicate queue's process method and not on the node that actually executes the command, it carries out removing itself
  7. At 20:45:33.467727, n20 realizes that its node liveness heartbeat failed due to an epoch increment, saying its attempt took 9.2 seconds
  8. At 20:45:35.712303, n20's remove replica attempt came back as an ambiguous result error. Because the error makes us return from (*Replica).ChangeReplicas immediately, we didn't do the check for whether our view of the descriptor was accurate (originally motivated by stability: replicaDescCache: clobbering {2 2 5} with {6 6 5} #7224). It's unlikely that check would have failed, but it's worth noting that we're not running that check for ambiguous commits.

I'm thinking that two reasonable steps to take (in addition to the bit of resiliency added by #12598) would be to (1) improve the ChangeReplica log message to indicate that it's removing a replica and which replica it's removing, and (2) refuse to execute a raft configuration change that would remove the leaseholder that's proposing the change to raft (in defaultSubmitProposalLocked). I'll add these two little pieces onto #12598.

The part I don't really understand yet is how the two operations on the RangeDescriptor may have been interleaved. @andreimatei / @bdarnell, if you have thoughts on this I'd be interested to hear them. The only relevant entries I can find in the logs are outlined below.

Node 1 failed to push a transaction on the range descriptor at 20:45:33.219279 (which may have been the ChangeReplicas transaction?):
storage/replica.go:2081 [n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] context cancellation after 2.5s of attempting command PushTxn [/Local/Range/"\xbb\x89\xfd \x06\x06h\xe5\xfadP\x1e\x12\x00\x01\x88"/RangeDescriptor,/Min)

Node 1 also failed to heartbeat a transaction at 20:45:35.054068 (which may have been its own txn meant to take the lease?):
storage/replica.go:2081 [n1,s1,r628/1:/Table/51/1/4{341497…-768829…},@c4230f6900] context cancellation after 7.7s of attempting command [txn: 6865cdeb], HeartbeatTxn [/Loc al/Range/"\xbb\x89\xfd\x06\x06h\xe5\xfadP\x1e\x12\x00\x01\x88"/RangeDescriptor,/Min))

@tamird
Copy link
Contributor

tamird commented Jan 4, 2017 via email

@a-robinson
Copy link
Contributor

Thanks for bringing that up, I wasn't aware of that expectation. That may be central to what's going on here, since it doesn't appear to be true in this case, at least not at a high level.

Node 20 is the only replica that called ChangeReplicas via the replicate queue around the relevant time, and Node 1 is the replica that executed the REMOVE_REPLICA ChangeReplicasTrigger.

@bdarnell
Copy link
Contributor

bdarnell commented Jan 4, 2017

I thought we already enforced that commands must be applied under the same
lease under which they were proposed. Is that not the case?

There are multiple levels here: The replication queue checks that the current node holds the lease, then runs the allocator (which knows not to remove the lease holder) and calls ChangeReplicas. ChangeReplicas runs a transaction through the DistSender, which takes a nontrivial amount of time and tracks the lease holder as it moves (there are storage-level protections that commands don't execute if the lease changes while they're in raft, but that's irrelevant here - DistSender will just repropose the EndTransaction in that case). So the replication queue is running on node 20 and decides to remove the replica on node 1, but by the time the ChangeReplicas transaction finishes node 1 is now the lease holder.

(2) refuse to execute a raft configuration change that would remove the leaseholder that's proposing the change to raft (in defaultSubmitProposalLocked)

Alternately, we could execute the command, but give the EndTransaction a side effect of voiding the lease held by the removed node. It's not the most graceful way of handling the situation (since it opens up a race for the next node to grab the lease), but it's self-contained and easy to implement.

It looks like somewhere between 32 and 64 nodes we hit some sort of critical threshold in the gossip system. The formula in Gossip.maxPeers grows very slowly (it returns minPeers (which is 3) for all clusters up to 82 nodes); maybe we need to allow more slack to avoid trouble when the cluster is growing (or maybe the formula is just wrong; there's no comment explaining why the formula is what it is).

@petermattis
Copy link
Collaborator

Nice analysis, @a-robinson.

Alternately, we could execute the command, but give the EndTransaction a side effect of voiding the lease held by the removed node. It's not the most graceful way of handling the situation (since it opens up a race for the next node to grab the lease), but it's self-contained and easy to implement.

What race are you worried about here? My worry about this approach is that it gives a non-leaseholder replica an ability to revoke a lease (by removing the leaseholder replica). Doesn't that open a window where the removed replica still thinks it is the leaseholder and services reads even though it is no longer a member of the range?

@bdarnell
Copy link
Contributor

bdarnell commented Jan 4, 2017

What race are you worried about here?

Poor choice of words - it's a (small) thundering herd instead of a race, since the lease becomes unassigned and all the remaining replicas can try to grab it at once.

Doesn't that open a window where the removed replica still thinks it is the leaseholder and services reads even though it is no longer a member of the range?

Hmm, good point. I don't see any way to remove the leaseholder safely so we'll just have to prevent this from happening.

@a-robinson
Copy link
Contributor

While #12598 should fix this (testing underway on sky), it'd still be nice to understand what's going on with the replica descriptors here. If no one has any thoughts, I'll probably spend some more time looking at the code on Friday.

@andreimatei
Copy link
Contributor

Excellent investigation, Alex. Is there a mystery left? Isn't your explanation satisfying enough?

  1. n20 initiated a CHANGE_REPLICAS removing n1 (but didn't propose the EndTransaction right away)
  2. n20 heartbeat went stale
  3. n1 gets the lease
  4. n20 continues proposing the CHANGE_REPLICAS and everybody applies it. The fact that it gets an AmbiguousResultError means that the proposal was sitting in some queue somewhere and it got re-proposed.

As I said on #12598, I think a better solution than a custom check in defaultSubmitProposalLocked would be for the replication queue to specify under which lease they want the CHANGE_REPLICAS applied.

@bdarnell
Copy link
Contributor

bdarnell commented Jan 5, 2017

I think we've got a pretty complete explanation for what's going on with the range descriptors. What we don't yet understand is why nodes are losing their liveness status during this phase of growth (probably related to the same gossip thrashing as in #9819).

@a-robinson
Copy link
Contributor

Yeah, sorry, the explanation covers everything. I was still confused because I had been thinking that the lease information was stored as part of the RangeDescriptor, which led me to not understand how the CPut in ChangeReplicas's transaction was succeeding when interleaved with the lease change. It turns out that leases aren't stored in the RangeRescriptor, so there's no real reason the CPut wouldn't fail.

@a-robinson
Copy link
Contributor

That should read "there's no real reason the CPut would fail."

@a-robinson
Copy link
Contributor

Alright, the test yesterday checked out. Performance scaled admirably by 2x each time the number of nodes doubled, and the one time a leaseholder was asked to remove itself, it politely declined.

We should still wrap up the post-merge discussion on #12598, and probably give this a proper error type so that it doesn't get treated as an ambiguous error by the client, I think we can consider this issue closed.

If anyone wants access to the dump of logs from the test, let me know. Right now they're just sitting on my laptop.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants