-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
stability: leader-less range stalls entire cluster #12591
Comments
maybe @petermattis and @bdarnell? not the best week for debugging though |
including log files for all nodes: |
@a-robinson, you probably know a decent amount of node liveness stuff, this seems to be in that chunk of code. |
I'll look around at things. Is the cluster still in this state? |
nope, I took it down to resume my scalability testing. looking at the logs, there's lots of spam about slow heartbeats (eg, look at 35.stderr in the logs zip). |
reran the same scenario but with 30 minutes between node additions.
Node cockroach-sky-0020.eastus2.cloudapp.azure.com is the only one exporting a non-zero |
The lease on range 628 looks corrupted. All the nodes I've checked say the lease is assigned to n1,s1,r628/1 even though n1,s1 doesn't have a replica on it. |
something odd I noticed on both runs (failing around 17:10 and 20:45 respectively):
thee are both about range 1 which explains why the whole cluster is wedged as opposed to just a single range. |
It could be an instance of our range descriptor cache problems:
#10751
…On Wed, Dec 28, 2016 at 4:24 PM, marc ***@***.***> wrote:
something odd I noticed on both runs (failing around 17:10 and 20:45
respectively):
21.stderr:W161228 17:10:25.345304 3968825 storage/intent_resolver.go:338 [n22,s22,r1/9:/{Min-System/tsd/c…}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: range 614: replica {63 63 15} not lease holder; lease holder unknown
21.stderr:W161228 17:11:10.165271 4051672 storage/intent_resolver.go:334 [n22,s22,r1/9:/{Min-System/tsd/c…}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: range 1: replica {22 22 9} not lease holder; lease holder unknown
21.stderr:W161228 20:45:38.550323 1693693 storage/intent_resolver.go:338 [n22,s22,r1/7:/{Min-System/tsd/c…}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: range 628: replica {20 20 7} not lease holder; lease holder unknown
21.stderr:W161228 20:45:54.333822 1717012 storage/intent_resolver.go:334 [n22,s22,r1/7:/{Min-System/tsd/c…}]: failed to resolve intents: failed to send RPC: sending to all 3 replicas failed; last error: range 1: replica {22 22 7} not lease holder; lease holder unknown
thee are both about range 1 which explains why the whole cluster is wedged
as opposed to just a single range.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#12591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAXBcZtBOENzbxn80HJgM2EODC2C7b8Qks5rMtN8gaJpZM4LXE-j>
.
|
Yeah. Is this an example of our range descriptor cache problems from https://forum.cockroachlabs.com/t/rangedescriptorcache-and-uncertainty/393 / #10751? |
ugh. odd that this happens so consistently. all I want is to run a 64 node cluster |
Although there's also the fair question of why none of the nodes (particularly the raft leader, node 20 in this case) are able to take the lease? |
could it be because that's the one with the odd r1 warnings above? the only one btw. |
anyway, the latest run is still up, the one with n20,r628 leader-not-leaseholder |
There were a ton of "context deadline exceeded" errors trying to resolve intents on Performance also may have been influenced by the gossip network having problems -- new nodes kept getting bounced back and forth between nodes with no space for incoming connections. We may need better logic around when to refuse gossip connections. On node 22 (the range 1 leaseholder), 2186 out of 19907 total log lines are about refusing incoming gossip connections. If all you want to do is test the scalability, I'd guess that you'd be much less likely to see this problem if you added the nodes more gradually (or wait until after @petermattis and @spencerkimball make progress on their work to avoid performance dips while adding nodes). |
yeah, the gossip spam is a but much. part of it may be that they only have the first node in the join flag, so this may just crowd everything. |
Feel free to start over as long as you save all the nodes' logs. I'll try to figure out why the cluster wasn't able to recover from the stale lease. |
full logs from all nodes: |
As for how the problem got started, this doesn't look good:
|
yeah, we have quite a few of those. I'm wondering if our rebalance logic
behaves strangely when the snapshot works but remove replica doesn't.
…On Wed, Dec 28, 2016 at 5:30 PM, Alex Robinson ***@***.***> wrote:
As for how the problem got started, this doesn't look good:
W161228 20:45:34.938146 1684745 kv/txn_coord_sender.go:773 ***@***.*** heartbeat to "change-replica" id=6865cdeb key=/Local/Range/"\xbb\x89\xfd\x06\x06h\xe5\xfadP\x1e\x12\x00\x01\x88"/RangeDescriptor rw=true pri=0.05586161 iso=SERIALIZABLE stat=PENDING epo=0 ts=1482957920.627389393,0 orig=1482957920.627389393,0 max=1482957920.877389393,0 wto=false rop=false failed: failed to send RPC: sending to all 3 replicas failed; last error: rpc error: code = 1 desc = context canceled
E161228 20:45:35.413792 1131 storage/queue.go:597 ***@***.*** result is ambiguous (removing replica)
—
You are receiving this because you authored the thread.
Reply to this email directly, view it on GitHub
<#12591 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AFoMqL8szLiQ37iDzuEshW-lmH9Z9309ks5rMuMIgaJpZM4LXE-j>
.
|
the same node did it again with slower startup (30s between nodes). no stalls this time, but performance below that of 32-node cluster. Also stuck on snapshots (the only thing left in the logs is about too-large snapshots, so something got stuck before that). Dropped that VM, brought up another, and performing the run again. |
latest run without that node is happy. Will try more. |
Weird that the issue is with that specific node. I've pieced together more of what happened and why things haven't recovered. Hopefully I'll have some sort of proposal for a fix this afternoon. Here's a bit more of the detail of how things went south:
|
Learning that nodes can propose removing their own replica of a range that they hold the lease for has gotten me wondering something -- during the replica removal process, is there any explicit step where the replica transfers its lease away if it's the leaseholder? Such a step wouldn't have been needed under expiration-based leases, since the lease would just expire and another replica would grab it. It's clearly needed under epoch-based leases, since the lease remains valid as long as the node remains healthy. @spencerkimball I'm still trying to figure out a good way for replicas to be able to take a lease away from a live node that no longer holds a replica in the case that we get into this state, but it's also worth digging into the above question as well. |
@a-robinson A node is not supposed to remove its own replica from a range. Instead, the node first transfers the lease to another node and that other node performs the replica removal. There could very well be bugs here, but that is the way it is supposed to work in theory. See the code in |
Yeah, just found that. There must be some sort of race that's being hit here. |
@a-robinson Coming back up to speed after the holiday break. Did you make any more progress on this? Perhaps we should add some assertions that we never try to remove the leaseholder for a range. |
Ok, I'm not positive, but I think that what happened is this:
I'm thinking that two reasonable steps to take (in addition to the bit of resiliency added by #12598) would be to (1) improve the ChangeReplica log message to indicate that it's removing a replica and which replica it's removing, and (2) refuse to execute a raft configuration change that would remove the leaseholder that's proposing the change to raft (in The part I don't really understand yet is how the two operations on the RangeDescriptor may have been interleaved. @andreimatei / @bdarnell, if you have thoughts on this I'd be interested to hear them. The only relevant entries I can find in the logs are outlined below. Node 1 failed to push a transaction on the range descriptor at 20:45:33.219279 (which may have been the Node 1 also failed to heartbeat a transaction at 20:45:35.054068 (which may have been its own txn meant to take the lease?): |
On Tue, Jan 3, 2017 at 9:32 PM, Alex Robinson ***@***.***> wrote:
- At 20:45:33.208616, n1 (apparently now the leaseholder) processes the
command to remove n1 (itself). Because the lease check only happens during
replicate queue's process method and not on the node that actually executes
the command, it carries out removing itself
I thought we already enforced that commands must be applied under the same
lease under which they were proposed. Is that not the case?
|
Thanks for bringing that up, I wasn't aware of that expectation. That may be central to what's going on here, since it doesn't appear to be true in this case, at least not at a high level. Node 20 is the only replica that called |
There are multiple levels here: The replication queue checks that the current node holds the lease, then runs the allocator (which knows not to remove the lease holder) and calls ChangeReplicas. ChangeReplicas runs a transaction through the DistSender, which takes a nontrivial amount of time and tracks the lease holder as it moves (there are storage-level protections that commands don't execute if the lease changes while they're in raft, but that's irrelevant here - DistSender will just repropose the EndTransaction in that case). So the replication queue is running on node 20 and decides to remove the replica on node 1, but by the time the ChangeReplicas transaction finishes node 1 is now the lease holder.
Alternately, we could execute the command, but give the EndTransaction a side effect of voiding the lease held by the removed node. It's not the most graceful way of handling the situation (since it opens up a race for the next node to grab the lease), but it's self-contained and easy to implement. It looks like somewhere between 32 and 64 nodes we hit some sort of critical threshold in the gossip system. The formula in |
Nice analysis, @a-robinson.
What race are you worried about here? My worry about this approach is that it gives a non-leaseholder replica an ability to revoke a lease (by removing the leaseholder replica). Doesn't that open a window where the removed replica still thinks it is the leaseholder and services reads even though it is no longer a member of the range? |
Poor choice of words - it's a (small) thundering herd instead of a race, since the lease becomes unassigned and all the remaining replicas can try to grab it at once.
Hmm, good point. I don't see any way to remove the leaseholder safely so we'll just have to prevent this from happening. |
While #12598 should fix this (testing underway on |
Excellent investigation, Alex. Is there a mystery left? Isn't your explanation satisfying enough?
As I said on #12598, I think a better solution than a custom check in |
I think we've got a pretty complete explanation for what's going on with the range descriptors. What we don't yet understand is why nodes are losing their liveness status during this phase of growth (probably related to the same gossip thrashing as in #9819). |
Yeah, sorry, the explanation covers everything. I was still confused because I had been thinking that the lease information was stored as part of the RangeDescriptor, which led me to not understand how the CPut in |
That should read "there's no real reason the CPut would fail." |
Alright, the test yesterday checked out. Performance scaled admirably by 2x each time the number of nodes doubled, and the one time a leaseholder was asked to remove itself, it politely declined. We should still wrap up the post-merge discussion on #12598, and probably give this a proper error type so that it doesn't get treated as an ambiguous error by the client, I think we can consider this issue closed. If anyone wants access to the dump of logs from the test, let me know. Right now they're just sitting on my laptop. |
sha: 789f749
While testing linear scalability for https://github.com/cockroachlabs/production/issues/175, the cluster stalled shortly after going from 32 to 64 nodes:
https://monitoring.gce.cockroachdb.com/dashboard/db/cockroach-sql?from=1482932037968&to=1482946437968&var-cluster=sky&var-node=All&var-rate_interval=1m
sql requests show:
all mentions of r614 on all logs:
range-614.txt
errors only: trouble started around 17:10. nodes
cockroach-sky-{33-64}
were done being started around then.The text was updated successfully, but these errors were encountered: