Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kv: manual validation of fallible lease transfers #87121

Closed
nvanbenschoten opened this issue Aug 30, 2022 · 2 comments
Closed

kv: manual validation of fallible lease transfers #87121

nvanbenschoten opened this issue Aug 30, 2022 · 2 comments
Assignees
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-investigation Further steps needed to qualify. C-label will change. GA-blocker T-kv KV Team

Comments

@nvanbenschoten
Copy link
Member

nvanbenschoten commented Aug 30, 2022

In #82758, we made it more likely that a lease transfer will be rejected because the recipient has not yet received a lease. We've seen fallout from this in places where lease transfers are not being retried. For example, with #85405.

We should try to uncover cases where we are not correctly handling lease transfer failures. To do so, I propose we run the following experiment:

  1. give ReplicaMayNeedSnapshot a 10% chance of returning true.
  2. run tpcc/headroom/n4cpu16 10 times and make sure they all pass.
  3. stress TestKVNemesisMultiNode

Jira issue: CRDB-19159

Epic CRDB-19172

@nvanbenschoten nvanbenschoten added C-investigation Further steps needed to qualify. C-label will change. A-kv-distribution Relating to rebalancing and leasing. GA-blocker T-kv KV Team branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Aug 30, 2022
@nvanbenschoten nvanbenschoten self-assigned this Aug 30, 2022
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 30, 2022
…rror

Related to cockroachdb#87121.

This commit adds the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
error status to the set of expected error types that can be returned to a
`TransferLeaseOperation`.

This commit, along with the previous, deflakes `TestKVNemesisMultiNode`
when run with the following diff:
```
diff --git a/pkg/kv/kvserver/raftutil/util.go b/pkg/kv/kvserver/raftutil/util.go
@@ -181,5 +183,8 @@ func ReplicaMayNeedSnapshot(
        // 2. that we do not perform a log truncation between now and when our action
        //    goes into effect. In practice, this means serializing with Raft log
        //    truncation operations using latching.
+       if rand.Intn(10) == 0 {
+               return ReplicaStateProbe
+       }
        return NoSnapshotNeeded
 }
```
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this issue Aug 31, 2022
…rror

Related to cockroachdb#87121.

This commit adds the `LeaseTransferRejectedBecauseTargetMayNeedSnapshot`
error status to the set of expected error types that can be returned to a
`TransferLeaseOperation`.

This commit, along with the previous, deflakes `TestKVNemesisMultiNode`
when run with the following diff:
```
diff --git a/pkg/kv/kvserver/raftutil/util.go b/pkg/kv/kvserver/raftutil/util.go
@@ -181,5 +183,8 @@ func ReplicaMayNeedSnapshot(
        // 2. that we do not perform a log truncation between now and when our action
        //    goes into effect. In practice, this means serializing with Raft log
        //    truncation operations using latching.
+       if rand.Intn(10) == 0 {
+               return ReplicaStateProbe
+       }
        return NoSnapshotNeeded
 }
```
@nvanbenschoten nvanbenschoten added branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 and removed branch-release-22.1 Used to mark GA and release blockers, technical advisories, and bugs for 22.1 labels Sep 6, 2022
@nvanbenschoten
Copy link
Member Author

stress TestKVNemesisMultiNode

This was flaky before 6ee6501. With that change, TestKVNemesisMultiNode passes under stress without issue, even with a 50% chance of lease transfers failing.

@nvanbenschoten
Copy link
Member Author

All 10 iterations of tpcc/headroom/n4cpu16 passed, even with a 50% chance of lease transfers failing.

Together, these are good indications that we have retry loops around lease transfer attempts where necessary. I'm considering this complete.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-distribution Relating to rebalancing and leasing. branch-release-22.2 Used to mark GA and release blockers, technical advisories, and bugs for 22.2 C-investigation Further steps needed to qualify. C-label will change. GA-blocker T-kv KV Team
Projects
None yet
Development

No branches or pull requests

1 participant