-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: allow removing leaseholder in ChangeReplicas #40333
Comments
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
As of cockroachdb#40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \cockroachdb#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until cockroachdb#40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None
40363: storage: work around can't-swap-leaseholder r=nvanbenschoten a=tbg As of #40284, the replicate queue was issuing swaps (atomic add+remove) during rebalancing. TestInitialPartitioning helpfully points out (once you flip atomic rebalancing on) that when the replication factor is one, there is no way to perform such an atomic swap because it will necessarily have to remove the leaseholder. To work around this restriction (which, by the way, we dislike - see \#40333), fall back to just adding a replica in this case without also removing one. In the next scanner cycle (which should happen immediately since we requeue the range) the range will be over-replicated and hopefully the lease will be transferred over and then the original leaseholder removed. I would be very doubtful that this all works, but it is how things worked until #40284, so this PR really just falls back to the previous behavior in cases where we can't do better. Release note: None Co-authored-by: Tobias Schottdorf <[email protected]>
Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None
Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. The PR also adds a flag to control how many of these subtests to run. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None
Prior to this commit the partitioning tests worked by creating a 3 node cluster and then expressed constraints over the three nodes. It then validates that the cluster conforms to the constraints by querying data and examining the trace to determine which node held the data. This is problematic for one because it is succeptible to cockroachdb#40333. In rare cases we'll down-replicate to the wrong single node (e.g. if the right one is not live) and we won't ever fix it. It also doesn't exercise leaseholder preferences. This PR adds functionality to configure clusters with larger numbers of nodes where each expectation in the config can now refer to a leaseholder_preference rather than a constraint and we'll allocate the additional nodes to 3 datacenters. This larger test creates dramatically more data movement and has been useful when testing cockroachdb#40892. The PR also adds a flag to control how many of these subtests to run. Release justification: Only touches testing and is useful for testing a release blocker. Release note: None
We have marked this issue as stale because it has been inactive for |
We have marked this issue as stale because it has been inactive for |
This was addressed by #74077. |
The replicate queue code is littered with decisions to transfer a lease away simply because the leaseholder should be removed. When the replication factor is low or there are constraints on the range it can happen that we can't transfer this lease away until we've added another replica. Adding another replica before removing the leaseholder is precisely what we are trying to avoid in #12768.
TestInitialPartitioning
definitely hits this problem (it will fail if rebalancing is always carried out atomically), so it's a good starting point for investigation.The lease transfer code that the queue uses (centered around allocator.TransferLeaseTarget) is very hard to reason about as well, so I'm not even sure if maybe it would fail to find a target when there is one sometimes.
With all that in mind it'd be nice if we could just issue any replication change, including one that removes the leaseholder. The reason we don't allow this today is that this will either wedge the range (because the lease will remain active, but the leader isn't serving the range any more) or cause potential anomalies (if we allow another node to get the lease without properly invalidating the old one).
The right way to make this work, I think, is to use the fact that the replica change removing the leaseholder is necessarily evaluated and proposed on the leaseholder. If we intercept that request accordingly and make sure that it causes the leaseholder to behave as if a TransferLease had been issued (setting its minProposedTS, etc) then we could treat a lease whose leaseholder store isn't in the descriptor as invalid (even if the epoch is still active), meaning that any member of the range could obtain the lease immediately after the replica change is through.
We'll need to allow leases to be obtained by VOTER_INCOMING replicas, but this is fine.
We could also add a leaseholder hint to the replica change to fix the lease transfer target, which is something the replicate queue (or allocator 2.0) would want to do.
Jira issue: CRDB-5549
The text was updated successfully, but these errors were encountered: