kvserver: ignore draining nodes in proposal quota #55806
Labels
A-kv-replication
Relating to Raft, consensus, and coordination.
C-enhancement
Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
T-kv
KV Team
Describe the problem
It doesn't seem like we take the Draining status of a node into account in the quota pool. This means that when the node terminates, from the POV of the quota pool it has just disappeared.
I think we mostly get this right, though perhaps accidentally:
cockroach/pkg/kv/kvserver/replica_proposal_quota.go
Lines 151 to 161 in a8ae1bf
Note the
ConnHealth
check here, which presumably would go red fairly quickly, on the order of an RPC heartbeat interval,cockroach/pkg/base/config.go
Lines 70 to 72 in 1c596ad
while the
isFollowerActiveSince
check will be a bit slower to fire (maybe a few seconds more? Didn't check). Either way, if in that time period we run out of quota, the range will stall until one of the checks clears.Even if the current checks might be mostly good enough most of the time, it seems desirable to exclude a node from quota pool considerations the moment it becomes draining, to avoid possibly second-long write stalls.
cc @aayushshah15 and @knz since you're both on related topics.
To Reproduce
I don't have a reproduction. One would involve going full speed on a certain range, and gracefully draining one of its members, while asserting that the write latency remains constant.
Expected behavior
Ignore the node for purposes of the quota pool when it has a Draining liveness record.
Additional data / screenshots
Environment:
Additional context
Jira issue: CRDB-3627
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: