-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: Raft CheckQuorum causes failover/partial/lease-liveness
unavailability
#107060
Comments
cc @cockroachdb/replication |
Confirmed that this is PreVote+CheckQuorum by running with |
This scenario is covered by the integration test |
Seems like
|
Liveness on n6 claims that n5 is still live, but it isn't. |
Yeah, the local liveness record expired at 10:25:33, but it's still marked as
|
Right, it's because the node has an RPC connection to the partitioned leader, which overrides cockroach/pkg/kv/kvserver/store_raft.go Lines 851 to 863 in e968604
cockroach/pkg/kv/kvserver/replica_raft.go Lines 2237 to 2242 in 4d7ec64
|
Confirmed that explicitly checking |
Coincides with #104042. Given the gradual throughput increase, I'm guessing we're waiting for sequential Raft election timeouts or something.
https://roachperf.crdb.dev/?filter=&view=failover%2Fpartial%2Flease-liveness&tab=gce
Jira issue: CRDB-29865
Epic CRDB-25199
The text was updated successfully, but these errors were encountered: