Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: replicaGCQueue acquires latch reaching into another rangeID #102936

Closed
tbg opened this issue May 9, 2023 · 1 comment · Fixed by #105816
Closed

kvserver: replicaGCQueue acquires latch reaching into another rangeID #102936

tbg opened this issue May 9, 2023 · 1 comment · Fixed by #105816
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented May 9, 2023

Describe the problem

@irfansharif reported this log from a WIP test1. In the test, r64 lost quorum and the circuit breaker tripped. But then we see this on r1, which should be healthy (it's a single-voter group):

E230507 19:48:05.815591 17777 kv/kvserver/queue.go:1136 [T1,n2,replicaGC,s2,r64/2:/{Table/Max-Max}] 599 replica unavailable: (n1,s1):1 unable to serve request to r1:/{Min-System/NodeLiveness} [(n1,s1):1, next=2, gen=0]: closed timestamp: 1683488882.708838000,0 (2023-05-07 19:48:02); raft status: {"id":"1","term":6,"vote":"1","commit":51,"lead":"1","raftState":"StateLeader","applied":51,"progress":{"1":{"match":51,"next":52,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/RangeID/64/r/AbortSpan/"8dfa719f-0a1a-40fc-a56f-a174787edc56"@0,0

and

E230507 19:48:05.815646 653 kv/kvserver/reports/reporter.go:159 [T1,n1,replication-reporter] 598 failed to generate replication reports: failed to compute constraint conformance report: replica unavailable: (n1,s1):1 unable to serve request to r1:/{Min-System/NodeLiveness} [(n1,s1):1, next=2, gen=0]: closed timestamp: 1683488882.708838000,0 (2023-05-07 19:48:02); raft status: {"id":"1","term":6,"vote":"1","commit":51,"lead":"1","raftState":"StateLeader","applied":51,"progress":{"1":{"match":51,"next":52,"state":"StateReplicate"}},"leadtransferee":"0"}: encountered poisoned latch /Local/RangeID/64/r/AbortSpan/"8dfa719f-0a1a-40fc-a56f-a174787edc56"@0,0

Almost certainly this will have something to do with the fact that the first range has a start key of KeyMin (empty key) but the addressable keyspace only starts at keys.LocalMax (\x02). Both replicaGC and constraint reports scan meta2, which is contained in r1 in this test. In doing so they seem to acquire a latch that spans range-local replicated keys for r64 (and likely all other ranges). This is incorrect.

I don't understand how that latch gets into r1's latch manager. Each replica has its own latch manager so even if r64 has a bunch of poisoned latches why does it show up in r1?

Jira issue: CRDB-27742

Footnotes

  1. https://cockroachlabs.slack.com/archives/G01G8LK77DK/p1683495028679899

@tbg tbg added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team labels May 9, 2023
@irfansharif
Copy link
Contributor

irfansharif commented May 9, 2023

To reproduce, I used the following from #98308. One oddity was that I was suppressing time-based election timeouts, so I'd use RaftElectionTimeoutTicks: 10000.

$ dev test pkg/kv/kvserver -f TestFlowControlRaftMembershipRemoveSelf/transfer-lease-first=true \
  -v --show-logs --stream-output --timeout 10m --stress --ignore-cache

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants