-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Lease owned by replica that no longer exists #15003
Comments
cc @RaduBerinde for triage |
It is probably due to the interaction between a queue and the scatter operation. I should set the cluster up in |
See #13469 and the review thread it links to. We can't force this node to give up its lease when it gets into this state. We need to prevent non-members from acquiring the lease in the first place (and/or not allow the current leaseholder to be removed from the range), so I think that means more synchronization between the scatter code and the replication queue. |
@bdarnell can you expand on what synchronization needs to be added? I wouldn't know where to start |
I can envision having a mechanism to serialize with queue operations on that range. But the problem is that we are moving the lease as part of relocation, and after we do that another host's queues have control over what happens to that range (right?) I don't have ideas on how to prevent that, other than maybe breaking up the relocation operation into multiple operations, and having the remove replica step go to the new lease holder. |
The first question to answer here is whether this can happen even without scatter. The main protection here is that in the absence of the scatter command, the replication queue is normally the only thing that performs rebalances or lease relocations. However, it seems possible that there could be two replication queues active on a range (right after a leadership change) or a node grabbing the lease after a failover at the same time that a replication change is in progress. If it is possible to get into this state even without SCATTER, then we need to do some sort of checking in the storage layer, disallowing either lease transfers or replica changes when the change would result in an invalid lease holder. If the problem is unique to SCATTER, then the necessary synchronization might be between SCATTER and the replication queue to make sure they are not taking action at the same time. But my guess is that what we need is the former. |
@spencerkimball, make any progress on this yet? I'm freeing up and this seems pretty up my alley, so I'm going to assume you haven't started yet and take over. |
Sounds good. I didn't have a chance to investigate this. |
This just happened on one of my runs:
Is this surprising? |
It's tough to say given that you deleted your branch after merging it. If your branch was on top of something after 05acf1e, then it's very surprising. If not, then it's expected. |
Oops, I don't know. Might branch might have been before that commit at the
point of that failed test. Sorry for the noise.
FWIW, I plan on looking at Ben's PR (post-factum).
|
This error happened again on
This could be #15385. Prior to the crash, there was a lot of odd gossip behavior in the logs. Node 2 is continually trying to connect to itself, and this 6-node cluster has apparently built up a gossip network with a 44-hop path in it.
|
The node is trying to connect to itself. It doesn't think it is itself. This is strange. |
Something happened to gossip on The most notable thing to get merged that day was my change to enable propEvalKV, although propEvalKV was enabled on cyan before that merge. I'm not sure about the exact timing of that rollout. |
This was also the day that cyan was switched from stop-the-world to rolling restarts. This has a more plausible effect on gossip than anything else that happened around that time. The rolling restarts lead to a less-compact gossip graph and it starts thrashing trying to balance everything out. This may not be related to the lease errors seen here (I don't see why it would be, other than general availability problems that might be caused by gossip and making the race more likely), so let's move gossip-related topics to #9819 unless there's evidence that it's connected to the lease error. |
The gossip behavior is really strange, and I wouldn't be shocked if it did have something to do with the other issues. I'm looking into it now, and have these preliminary observations:
|
Well, the info key that was so many hops away was the system DB key, so that one mystery is solved. I'll move all follow-ups to #9819, even though this is a bit different from the standard gossip thrashing that issue was originally meant to cover. |
Ok, I have the gossip stuff figured out (#9819 (comment)), but I don't see how it would explain the leaseholder "no longer exists" crash, so that may be something else. |
Since the only instance of this we've seen was accompanied by weird gossip behavior (now fixed), I'm pushing this to 1.1. |
1 similar comment
Looks like it might be fixed by #15754, though I can't prove it. I'll try repro'ing this on the SCATTER test. |
That's what I had been hoping, although I don't know whether the Scatter test will be particularly good at repro-ing it, since Scatter uses |
Hmm, that's a good point. |
Closing since we think that #15754 fixed this. |
Fingers crossed! :)
On Sun, May 21, 2017 at 7:41 PM Alex Robinson ***@***.***> wrote:
Closing since we think that #15754
<#15754> fixed this.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#15003 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AE135GPMHE5qLv1pIEdtYUPOkBxAGyh9ks5r8MvAgaJpZM4M_hG3>
.
--
…-- Tobias
|
Same time on
|
Not a proof, but the telltale signs are there: n2 seems to be running some rebalances that get timeouts (but are ambiguous, so perhaps committed) and then that lease panic happens. So it seems likely that n2 just rebalanced off of n1 and n1 managed to cop the lease. |
Just to second @tschottdorf, it does look like the same issue that #15754 fixed. |
The following tests appear to have failed:
#226401:
Please assign, take a look and update the issue accordingly.
The text was updated successfully, but these errors were encountered: