Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: allow voter demoting to get a lease, in case there's an incoming voter #83686

Merged
merged 1 commit into from
Jul 8, 2022

Conversation

shralex
Copy link
Contributor

@shralex shralex commented Jul 1, 2022

Fixes #83687.

It is possible that we've entered a joint config where the leaseholder is being
replaced with an incoming voter. We try to transfer the lease away, but this fails.
In this case, we need the demoted voter to re-aquire the lease, as it might be an epoch
based lease, that doesn't expire. Otherwise we might be stuck without anyone else getting
the lease (the original leaseholder will be in the PROSCRIBED state, repeatedly trying to
re-aquire the lease, but will fail since its a VOTER_DEMOTING_LEARNER).

Release note (bug fix): Fixes a critical bug (#83687) introduced in 22.1.0 where
failure to transfer a lease in the joint config may result in range unavailability. The fix
allows the original leaseholder to re-aquire the lease so that lease transfer can be retried.

@shralex shralex requested a review from a team as a code owner July 1, 2022 01:56
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for getting this fix for #83687 out so quickly! This looks good.

It will be interesting to write the test that reproduces what we saw in that issue and demonstrates that this resolves the bug. Also, we'll want to update TestCheckCanReceiveLease.

@nvanbenschoten nvanbenschoten changed the title kvserver: allow voter demoting to get a lease, in case there's an inc… kvserver: allow voter demoting to get a lease, in case there's an incoming voter Jul 1, 2022
@shralex shralex force-pushed the voter_demoting_lease branch 3 times, most recently from c705905 to 723a8a8 Compare July 1, 2022 17:30
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 3 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @shralex)


pkg/roachpb/metadata_replicas.go line 548 at r1 (raw file):

	if !ok {
		return errReplicaNotFound
	} else if !(repDesc.IsVoterNewConfig() || (repDesc.IsAnyVoter() && replDescs.

We will need to plumb a cluster setting check in here, like we have in Replica.propose. Clusters that don't contain all 22.1 nodes shouldn't allow a voter demoting to receive the lease.


pkg/roachpb/metadata_replicas.go line 549 at r1 (raw file):

		return errReplicaNotFound
	} else if !(repDesc.IsVoterNewConfig() || (repDesc.IsAnyVoter() && replDescs.
		containsVoterIncoming())) {

Do you have ideas about unifying this logic and the logic in Replica.propose? Doing so would 1) ensure that they're coherent, and 2) give us a good place to justify the subtle condition and link it back to the config change protocol.

Is doing so as simple as calling

if !CheckCanReceiveLease(r.ReplicaID(), p.command.ReplicatedEvalResult.State.Desc.Replica) { ... }

in Replica.propose?


pkg/kv/kvserver/client_lease_test.go line 278 at r1 (raw file):

to VOTER_OUTGOING

Do you mean VOTER_DEMOTING_LEARNER? I didn't think we still used VOTER_OUTGOING anywhere.


pkg/kv/kvserver/client_lease_test.go line 280 at r1 (raw file):

	// command to move n3 to VOTER_OUTGOING has been evaluated, we'll send
	// the request to transfer the lease to n3. The hope is that it will
	// get past the sanity above latch acquisition prior to change replicas

"sanity check"

"prior to the change replicas command committing"


pkg/kv/kvserver/client_lease_test.go line 282 at r1 (raw file):

	// get past the sanity above latch acquisition prior to change replicas
	// command committing.
	var scratchRangeID atomic.Value

nit: consider an atomic.Int64 instead for the convenient zero value.


pkg/kv/kvserver/client_lease_test.go line 300 at r1 (raw file):

	knobs.Store.(*kvserver.StoreTestingKnobs).TestingProposalFilter = func(args kvserverbase.ProposalFilterArgs) *roachpb.Error {
		blockIfShould(args)
		return nil

Were you still planning to add a second test? This one is good, but it's not quite the same scenario as what we saw in #83687. In that failure mode, the proposal for a lease transfer failed, which required a lease request to make it through CheckCanReceiveLease. So when writing a test to demonstrate that, we'll want the TestingProposalFilter to return an error for the LeaseTransfer performed by the config change, not just block. We'll then want to make sure that the lease is re-acquired and the config change is eventually completed.


pkg/kv/kvserver/client_lease_test.go line 319 at r1 (raw file):

	//
	//  - Send an AdminChangeReplicasRequest to remove n3 and add n4
	//  - Block the step that moves n3 to VOTER_OUTGOING on changeReplicasChan

Same question about VOTER_OUTGOING.


pkg/kv/kvserver/batcheval/cmd_lease_test.go line 278 at r1 (raw file):

		{leaseholderType: roachpb.VOTER_INCOMING, anotherReplicaType: none, eligible: true},

		// A VOTER_OUTGOING should only be able to get the lease if there's a VOTER_INCOMING

nit: punctuation.

Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @shralex)


pkg/roachpb/metadata_replicas.go line 548 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We will need to plumb a cluster setting check in here, like we have in Replica.propose. Clusters that don't contain all 22.1 nodes shouldn't allow a voter demoting to receive the lease.

Done


pkg/roachpb/metadata_replicas.go line 549 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do you have ideas about unifying this logic and the logic in Replica.propose? Doing so would 1) ensure that they're coherent, and 2) give us a good place to justify the subtle condition and link it back to the config change protocol.

Is doing so as simple as calling

if !CheckCanReceiveLease(r.ReplicaID(), p.command.ReplicatedEvalResult.State.Desc.Replica) { ... }

in Replica.propose?

Done


pkg/kv/kvserver/client_lease_test.go line 278 at r1 (raw file):

VOTER_DEMOTING_LEARNER
Done


pkg/kv/kvserver/client_lease_test.go line 280 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

"sanity check"

"prior to the change replicas command committing"

Done

Code quote:

get past the sanity above latch acquisition prior to change replicas

pkg/kv/kvserver/client_lease_test.go line 319 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Same question about VOTER_OUTGOING.

Done


pkg/kv/kvserver/batcheval/cmd_lease_test.go line 278 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: punctuation.

Done

Code quote:

A VOTER_OUTGOING should only be able to 

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 11 of 11 files at r2, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @shralex)


pkg/kv/kvclient/kvcoord/replica_slice.go line 84 at r2 (raw file):

	}
	canReceiveLease := func(rDesc roachpb.ReplicaDescriptor) bool {
		if err := roachpb.CheckCanReceiveLease(rDesc, desc.Replicas(), false); err != nil {

If we can't plumb a cluster setting in here, then I think we should pass true /* leaseHolderRemovalAllowed */ here so that we optimistically assume that a VOTER_DEMOTING can be a leaseholder. This function is called with filter = OnlyPotentialLeaseholders to determine the maximal set of replicas that a request may need to be routed to in order to find the leaseholder.


pkg/kv/kvserver/replica_proposal_buf.go line 15 at r2 (raw file):

import (
	"context"
	"github.com/cockroachdb/cockroach/pkg/clusterversion"

Have you configured crlfmt: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher?

crlfmt would move this line down to the block below.


pkg/kv/kvserver/replica_raft.go line 379 at r2 (raw file):

		// kind of voter in the next new config (potentially VOTER_DEMOTING).
		proposedDesc := p.command.ReplicatedEvalResult.State.Desc
		if err := roachpb.CheckCanReceiveLease(lhDescriptor, proposedDesc.Replicas(),

nit: this indentation looks foreign to this codebase and I think it might also be against our style guide. Consider:

if err := roachpb.CheckCanReceiveLease(
    lhDescriptor, proposedDesc.Replicas(), lhRemovalAllowed,
); err != nil {
    ...
}

Or pull proposedDesc.Replicas() into a variable and use a single line.


pkg/kv/kvserver/replica_raft.go line 379 at r2 (raw file):

		// kind of voter in the next new config (potentially VOTER_DEMOTING).
		proposedDesc := p.command.ReplicatedEvalResult.State.Desc
		if err := roachpb.CheckCanReceiveLease(lhDescriptor, proposedDesc.Replicas(),

We should move some of this explanation to CheckCanReceiveLease, but then explain why we're calling that function in the first place.


pkg/kv/kvserver/batcheval/cmd_lease_request.go line 15 at r2 (raw file):

import (
	"context"
	"github.com/cockroachdb/cockroach/pkg/clusterversion"

Same thing here.


pkg/kv/kvserver/batcheval/cmd_lease_request.go line 75 at r2 (raw file):

	// If this check is removed at some point, the filtering of learners on the
	// sending side would have to be removed as well.
	if err := roachpb.CheckCanReceiveLease(args.Lease.Replica, cArgs.EvalCtx.Desc().Replicas(),

nit: same point about indentation for lines that look like this (at least here and in cmd_lease_transfer.go)


pkg/roachpb/metadata_replicas.go line 549 at r1 (raw file):

See further info in replica_raft Propose().

Now that we're calling this method from replica.Propose, I think we should move the explanation here.


pkg/kv/kvserver/batcheval/cmd_lease_test.go line 314 at r2 (raw file):

				rngDesc.InternalReplicas = append(rngDesc.InternalReplicas, anotherDesc)
			}
			err := roachpb.CheckCanReceiveLease(rngDesc.InternalReplicas[0], rngDesc.Replicas(), true)

Should we extend this test to exercise the true and false case for leaseHolderRemovalAllowed?

@shralex shralex force-pushed the voter_demoting_lease branch from e5b9ffa to ce195e7 Compare July 6, 2022 04:51
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 7 of 7 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @shralex)


pkg/kv/kvclient/kvcoord/replica_slice.go line 84 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we can't plumb a cluster setting in here, then I think we should pass true /* leaseHolderRemovalAllowed */ here so that we optimistically assume that a VOTER_DEMOTING can be a leaseholder. This function is called with filter = OnlyPotentialLeaseholders to determine the maximal set of replicas that a request may need to be routed to in order to find the leaseholder.

With the /* leaseHolderRemovalAllowed */ comment. That's in the CRDB style guide for literal value params.


pkg/kv/kvserver/replica_raft.go line 366 at r3 (raw file):

		// we move to a joint config where v1 (VOTER_DEMOTING_LEARNER) transfer the
		// lease to v4 (VOTER_INCOMING) directly.
		// See also https://github.com/cockroachdb/cockroach/issues/67740.

I don't want to hold up this PR any further, but I think we should improve this comment to explain why we don't allow VOTER_DEMOTING to hold the lease if there is not a VOTER_INCOMING. Do you mind opening an issue?


pkg/kv/kvserver/client_lease_test.go line 300 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Were you still planning to add a second test? This one is good, but it's not quite the same scenario as what we saw in #83687. In that failure mode, the proposal for a lease transfer failed, which required a lease request to make it through CheckCanReceiveLease. So when writing a test to demonstrate that, we'll want the TestingProposalFilter to return an error for the LeaseTransfer performed by the config change, not just block. We'll then want to make sure that the lease is re-acquired and the config change is eventually completed.

Were you still planning to add this test?

Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @shralex)


pkg/kv/kvclient/kvcoord/replica_slice.go line 84 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

If we can't plumb a cluster setting in here, then I think we should pass true /* leaseHolderRemovalAllowed */ here so that we optimistically assume that a VOTER_DEMOTING can be a leaseholder. This function is called with filter = OnlyPotentialLeaseholders to determine the maximal set of replicas that a request may need to be routed to in order to find the leaseholder.

Done


pkg/kv/kvserver/replica_proposal_buf.go line 15 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Have you configured crlfmt: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher?

crlfmt would move this line down to the block below.

Thanks! I did this multiple times, but it keeps disappearing, perhaps when GoLand updates...


pkg/kv/kvserver/replica_raft.go line 379 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this indentation looks foreign to this codebase and I think it might also be against our style guide. Consider:

if err := roachpb.CheckCanReceiveLease(
    lhDescriptor, proposedDesc.Replicas(), lhRemovalAllowed,
); err != nil {
    ...
}

Or pull proposedDesc.Replicas() into a variable and use a single line.

Done


pkg/kv/kvserver/replica_raft.go line 379 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

We should move some of this explanation to CheckCanReceiveLease, but then explain why we're calling that function in the first place.

Done


pkg/kv/kvserver/batcheval/cmd_lease_request.go line 15 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Same thing here.

Done


pkg/kv/kvserver/batcheval/cmd_lease_request.go line 75 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: same point about indentation for lines that look like this (at least here and in cmd_lease_transfer.go)

Done

Code quote:

clusterversion.EnableLeaseHolderRemoval)

pkg/roachpb/metadata_replicas.go line 549 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

See further info in replica_raft Propose().

Now that we're calling this method from replica.Propose, I think we should move the explanation here.

Done


pkg/kv/kvserver/client_lease_test.go line 300 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Were you still planning to add this test?

Yep


pkg/kv/kvserver/batcheval/cmd_lease_test.go line 314 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we extend this test to exercise the true and false case for leaseHolderRemovalAllowed?

Done

@shralex shralex force-pushed the voter_demoting_lease branch from ce195e7 to eb1aa71 Compare July 6, 2022 17:44
Copy link
Collaborator

@arulajmani arulajmani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 3 of 11 files at r2, 5 of 7 files at r3, 2 of 3 files at r4, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten and @shralex)


-- commits line 7 at r4:
Might be worth adding some words about the PROSCRIBED status which prevents other nodes from being able to get the lease.


-- commits line 10 at r4:
nit: this can be removed now. Might also be worth linking the issue this PR resolves in the commit message itself.


pkg/kv/kvserver/replica_proposal_buf.go line 154 at r4 (raw file):

	withGroupLocked(func(proposerRaft) error) error
	registerProposalLocked(*ProposalData)
	leaderStatusRLocked(ctx context.Context, raftGroup proposerRaft) rangeLeaderInfo

testProposer in replica_proposal_buf_test.go seems in need of an update.


pkg/kv/kvserver/replica_raft.go line 387 at r4 (raw file):

		// leaseholder to be a VOTER_DEMOTING as long as there is a VOTER_INCOMING.
		// Otherwise, the leaseholder must be a full voter in the target config.
		// When the leaseholder is being removed, this check won't allow exiting

nit: this comment (the stuff after "When...") feels a bit out of place. Might be worth expanding a bit more or dropping it entirely.


pkg/roachpb/metadata_replicas.go line 548 at r1 (raw file):

Previously, shralex wrote…

Done

I'm not very familiar with this stuff, so I might be missing something, but is it possible for us to get into a joint configuration which includes a VOTER_DEMOTING and VOTER_INCOMING if all nodes aren't running 22.1 (and have been finalized to the appropriate version)?


pkg/roachpb/metadata_replicas.go line 551 at r4 (raw file):

// An error is also returned is the replica is not part of `replDescs`.
func CheckCanReceiveLease(
	wouldbeLeaseholder ReplicaDescriptor, replDescs ReplicaSet, leaseHolderRemovalAllowed bool,

Could we add a comment that says leaseHolderRemovalAllowed is intended to check if the cluster version is EnableLeaseHolderRemoval or higher? It might also be worth adding a TODO to clean this up in 22.2.


pkg/roachpb/metadata_replicas.go line 558 at r4 (raw file):

	}
	if !(repDesc.IsVoterNewConfig() ||
		(leaseHolderRemovalAllowed && repDesc.IsAnyVoter() && replDescs.containsVoterIncoming())) {

Would this conditional be clearer if instead of repDesc.IsAnyVoter, we instead called into a different function (say) repDesc.IsAnyDemotingVoter? That would help avoid the redundancy from the previous call and make the joint config inference here a bit clearer.


pkg/kv/kvserver/client_lease_test.go line 270 at r4 (raw file):

// requests for nodes which are already in the VOTER_DEMOTING_LEARNER state succeeds
// when there is a VOTER_INCOMING node.
func TestTransferLeaseToVoterDemotinggWithIncoming(t *testing.T) {

nit: "Demotingg"


pkg/kv/kvserver/batcheval/cmd_lease_test.go line 278 at r1 (raw file):

Previously, shralex wrote…

Done

Do we need to be accounting for these VOTER_OUTGOING cases (in tests and in code) given we no longer use it?

@shralex shralex force-pushed the voter_demoting_lease branch from eb1aa71 to 55dfa3e Compare July 6, 2022 23:23
Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani, @nvanbenschoten, and @shralex)


pkg/kv/kvserver/replica_proposal_buf.go line 154 at r4 (raw file):

replica_proposal_buf_test.go
Thanks


pkg/kv/kvserver/replica_raft.go line 366 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't want to hold up this PR any further, but I think we should improve this comment to explain why we don't allow VOTER_DEMOTING to hold the lease if there is not a VOTER_INCOMING. Do you mind opening an issue?

Done

Code quote:

See also https://github.com/cockroachdb/cockroach/issues/67740.

pkg/kv/kvserver/replica_raft.go line 387 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

nit: this comment (the stuff after "When...") feels a bit out of place. Might be worth expanding a bit more or dropping it entirely.

I can understand why its confusing - it refers to a check made inside the function. I tried to clarify.


pkg/roachpb/metadata_replicas.go line 548 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

I'm not very familiar with this stuff, so I might be missing something, but is it possible for us to get into a joint configuration which includes a VOTER_DEMOTING and VOTER_INCOMING if all nodes aren't running 22.1 (and have been finalized to the appropriate version)?

It is still possible in previous releases, but only from 22.1 the VOTER_DEMOTING could be the leaseholder


pkg/roachpb/metadata_replicas.go line 551 at r4 (raw file):

leaseHolderRemovalAllowed is intended to check if the cluster version is EnableLeaseHolderRemoval or higher? It might also be worth adding a TODO to clean this up in 22.2.
Done


pkg/roachpb/metadata_replicas.go line 558 at r4 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Would this conditional be clearer if instead of repDesc.IsAnyVoter, we instead called into a different function (say) repDesc.IsAnyDemotingVoter? That would help avoid the redundancy from the previous call and make the joint config inference here a bit clearer.

I changed it to hopefully be more readable.


pkg/kv/kvserver/client_lease_test.go line 282 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: consider an atomic.Int64 instead for the convenient zero value.

Done


pkg/kv/kvserver/client_lease_test.go line 300 at r1 (raw file):

Previously, shralex wrote…

Yep

Added


pkg/kv/kvserver/client_lease_test.go line 270 at r4 (raw file):

Demotingg
Thanks


pkg/kv/kvserver/batcheval/cmd_lease_test.go line 278 at r1 (raw file):

Previously, arulajmani (Arul Ajmani) wrote…

Do we need to be accounting for these VOTER_OUTGOING cases (in tests and in code) given we no longer use it?

Discussed with Nathan offline, yes for now, given that we do assume this everywhere else for now.

@shralex shralex force-pushed the voter_demoting_lease branch from 55dfa3e to dd22457 Compare July 7, 2022 04:07
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think everything is here, so we'll just need to polish this off before merging.

Reviewed 1 of 3 files at r4, 4 of 4 files at r5, 1 of 1 files at r6, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @shralex)


-- commits line 2 at r6:
You updated the PR description but not the commit message. Let's copy that updated text over here.


pkg/kv/kvserver/batcheval/cmd_lease_request.go line 75 at r2 (raw file):

Previously, shralex wrote…

Done

nit: this line did not get updated.


pkg/kv/kvserver/client_lease_test.go line 371 at r6 (raw file):

// https://github.com/cockroachdb/cockroach/issues/83687
// and makes sure that if lease transfer fails during a joint configuration
// the previous leaseholder will successfully re-aquire the lease.

Could you add a sentence or two about the flow of the test to this comment? What does it do and why does this prove that the bug is fixed. The explanation should mention the revoked lease that needs to be reacquired.


pkg/kv/kvserver/client_lease_test.go line 378 at r6 (raw file):

	knobs := base.TestingKnobs{Store: &kvserver.StoreTestingKnobs{}}
	// Add a testing knob to allow us to fall a lease transfer command

s/fall a lease transfer command while it is/fail all lease transfer commands on this range when they are/


pkg/kv/kvserver/client_lease_test.go line 416 at r6 (raw file):

	require.Regexp(t, "Injecting lease transfer failure", err)

	desc, err = tc.LookupRange(scratchStartKey)

Add a comment about the state that we are in here.


pkg/kv/kvserver/client_lease_test.go line 420 at r6 (raw file):

	require.True(t, desc.Replicas().InAtomicReplicationChange())

	// Further lease transfers should succeed.

, allowing the atomic replication change to complete.

@shralex shralex force-pushed the voter_demoting_lease branch from dd22457 to 496e2a0 Compare July 7, 2022 16:56
Copy link
Contributor Author

@shralex shralex left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @arulajmani and @nvanbenschoten)


-- commits line 7 at r4:

Previously, arulajmani (Arul Ajmani) wrote…

Might be worth adding some words about the PROSCRIBED status which prevents other nodes from being able to get the lease.

Done


-- commits line 10 at r4:

Previously, arulajmani (Arul Ajmani) wrote…

nit: this can be removed now. Might also be worth linking the issue this PR resolves in the commit message itself.

Done


-- commits line 2 at r6:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

You updated the PR description but not the commit message. Let's copy that updated text over here.

Done


pkg/kv/kvserver/batcheval/cmd_lease_request.go line 75 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: this line did not get updated.

Done


pkg/kv/kvserver/client_lease_test.go line 371 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Could you add a sentence or two about the flow of the test to this comment? What does it do and why does this prove that the bug is fixed. The explanation should mention the revoked lease that needs to be reacquired.

Done

Code quote:

base.TestingKnobs{Store: &kvserver.StoreTestingKnobs{}}

pkg/kv/kvserver/client_lease_test.go line 378 at r6 (raw file):

fail all lease transfer commands on this range when they are
Done


pkg/kv/kvserver/client_lease_test.go line 416 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Add a comment about the state that we are in here.

Done


pkg/kv/kvserver/client_lease_test.go line 420 at r6 (raw file):

, allowing the atomic replication change to complete.
Done

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 2 of 2 files at r7, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani and @shralex)


pkg/kv/kvserver/client_lease_test.go line 379 at r7 (raw file):

//   is attempted to move the lease from n1 to n4 before exiting the joint config,
//   but that fails, causing us to remain in the joint configuration with the original
//   leaseholder believing the lease was given away, but everyone else thinking it's still

"believing the lease was given away"

This isn't quite correct. The original leaseholder knows that the lease was not given away, but it has already revoked its lease so it needs to re-acquire a new version of the lease.

Consider just saying "the original leaseholder having revoked its lease, but everyone else ..."


pkg/kv/kvserver/client_lease_test.go line 385 at r7 (raw file):

//   VOTER_DEMOTING_LEARNER (n1) replica to get the lease if there's also a VOTER_INCOMING
//   which is the case here (n4).
// - n1 transfers the lease away and the range leaves the joint configuration

nit: punctuation.


pkg/kv/kvserver/client_lease_test.go line 430 at r7 (raw file):

	require.Regexp(t, "Injecting lease transfer failure", err)

	// We're now in a joint configuration, n1 thinks the lease was given away

Same point about "thinks the lease was given away".

…oming voter

Fixes cockroachdb#83687.

It is possible that we've entered a joint config where the leaseholder is being
replaced with an incoming voter. We try to transfer the lease away, but this fails.
In this case, we need the demoted voter to re-aquire the lease, as it might be an epoch
based lease, that doesn't expire. Otherwise we might be stuck without anyone else getting
the lease (the original leaseholder will be in the PROSCRIBED state, repeatedly trying to
re-aquire the lease, but will fail since its a VOTER_DEMOTING_LEARNER).

Release note (bug fix): Fixes a critical bug (cockroachdb#83687) introduced in 22.1.0 where
failure to transfer a lease in the joint config may result in range unavailability. The fix
allows the original leaseholder to re-aquire the lease so that lease transfer can be retried.
@shralex shralex force-pushed the voter_demoting_lease branch from 496e2a0 to 139dc42 Compare July 7, 2022 21:45
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r8, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @arulajmani)

@shralex
Copy link
Contributor Author

shralex commented Jul 7, 2022

bors r+

@craig
Copy link
Contributor

craig bot commented Jul 8, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: failure while transferring lease away when exiting joint config can lead to range unavailability
4 participants