-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: Possible race between RequestLease and ChangeReplicas #15385
Comments
Since the only instance of this we've seen was accompanied by weird gossip behavior (now fixed), I'm pushing this to 1.1. |
@bdarnell, work-stealing this from you. |
I wonder what would happen when a replica winds up removing itself from the Unless I messed up, this implies having to make sure that a lease holder cannot |
I thought we disallowed removing the leaseholder replica. I know for certain |
What we currently have is this pretty hacky check in the proposal code path: |
(but you're very right about the risk of a non-replica holding the lease -- node liveness will keep the lease alive forever and it has been the source of a few different headaches while you were gone) |
@petermattis @a-robinson sorry, should've been a little more clear - I'm aware of both the proposal code and the replicate queue trying not to have this happen, but I didn't see us explicitly preventing it anywhere. But now I see (or at least think) that the
Since a proposal can only commit under the lease that proposed it, and since only a replica holding the lease proposes under that lease, we should be good. But the whole thing is pretty subtle. I'll make sure to add some commentary. |
As for the rest, I agree with @bdarnell's suggestion. |
Make sure that a lease with an intended lease holder that has since been removed catches a forcedError. See cockroachdb#15385.
Make sure that a lease with an intended lease holder that has since been removed catches a forcedError. See cockroachdb#15385.
Closed in #15754. |
Hi @nvanbenschoten, please add branch-* labels to identify which branch(es) this C-bug affects. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
The fix for this issue was to check whether the leaseholder replica is part of the current cockroach/pkg/kv/kvserver/kvserverbase/forced_error.go Lines 189 to 207 in e59f52b
Since that fix was introduced, we've added additional constraints on who in a range can hold a lease. Notably, It would appear that these additional constraints did not make their way to the below-raft check back when we began using learner replicas in v19.2.0, permitting a new variant of this race where a RequestLease lands after the target leaseholder has been demoted to a learner. To fix this, we need to replace the |
As discussed in #15355,
TransferLease
is now sequenced with respect to concurrent replica changes by the command queue.RequestLease
has no such protection because it is evaluated on followers (unlike all other commands). It is instead guarded by theRaftCommand.ProposerLease
field, which ensures that the replica requesting the lease has an up-to-date view of the current lease. This allows for a race in which a replica is removed from the range immediately after it has attempted to take the lease. (This race is difficult to hit in practice because the range must be healthy to execute the ChangeReplicas transaction, but the current lease holder must be (or appear to be) unhealthy in order for a follower to attempt to grab the lease) When this occurs, we will hit thelog.Fatal
which prevents ranges from getting stuck with a lease on a non-member store.The simplest fix I see is to add a counter to
roachpb.Lease
which is incremented on every replica change, so that theProposerLease
will be seen as outdated when aRequestLease
crosses over a rebalance.Jira issue: CRDB-44008
The text was updated successfully, but these errors were encountered: