Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
PR cockroachdb#26911 added a mechanism which during quiescence decisions ignored nodes which (at some recent point in time) seemed non-live. Unfortunately, the corresponding mechanism introduced to unquiesce when the node becomes live is quite racy, at least when tormented with the wild timings we throw at it in unit testing. The basic discrepancy is that unquiescing is triggered by an explicit liveness event (i.e. the kv record being overwritten), whereas a node may seem non-live for mere reasons of timing inbetween. In effect this means that quiescence is more aggressive than unquiescence. I explored various venues to address this (going as far as introducing an unquiesce message to the Raft scheduler and serializing unquiescence through the store Raft ticker goroutine), but it's hopeless. It's an equally idle diversion to try to adjust the test liveness durations. So, in this commit, the liveness check becomes more lenient: even when a node seems non-live, we count it as live if there's a healthy connection established to it. This effectively deals with issues in tests but also seems like a good addition for real world workloads: when the liveness range breaks down, we don't want to aggressively quiesce. Fixes cockroachdb#27545. Fixes cockroachdb#27607. Release note: None
- Loading branch information