Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: ignore draining nodes in proposal quota #55806

Open
tbg opened this issue Oct 21, 2020 · 3 comments
Open

kvserver: ignore draining nodes in proposal quota #55806

tbg opened this issue Oct 21, 2020 · 3 comments
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@tbg
Copy link
Member

tbg commented Oct 21, 2020

Describe the problem

It doesn't seem like we take the Draining status of a node into account in the quota pool. This means that when the node terminates, from the POV of the quota pool it has just disappeared.

I think we mostly get this right, though perhaps accidentally:

if !r.mu.lastUpdateTimes.isFollowerActiveSince(
ctx, rep.ReplicaID, now, r.store.cfg.RangeLeaseActiveDuration(),
) {
return
}
// Only consider followers that that have "healthy" RPC connections.
if err := r.store.cfg.NodeDialer.ConnHealth(rep.NodeID, r.connectionClass.get()); err != nil {
return
}

Note the ConnHealth check here, which presumably would go red fairly quickly, on the order of an RPC heartbeat interval,

// defaultRPCHeartbeatInterval is the default value of RPCHeartbeatInterval
// used by the rpc context.
defaultRPCHeartbeatInterval = 3 * time.Second

while the isFollowerActiveSince check will be a bit slower to fire (maybe a few seconds more? Didn't check). Either way, if in that time period we run out of quota, the range will stall until one of the checks clears.

Even if the current checks might be mostly good enough most of the time, it seems desirable to exclude a node from quota pool considerations the moment it becomes draining, to avoid possibly second-long write stalls.

cc @aayushshah15 and @knz since you're both on related topics.

To Reproduce

I don't have a reproduction. One would involve going full speed on a certain range, and gracefully draining one of its members, while asserting that the write latency remains constant.

Expected behavior
Ignore the node for purposes of the quota pool when it has a Draining liveness record.

Additional data / screenshots

Environment:

Additional context

Jira issue: CRDB-3627

Epic CRDB-39898

@tbg tbg added the A-kv-replication Relating to Raft, consensus, and coordination. label Oct 21, 2020
@blathers-crl

This comment has been minimized.

@tbg tbg added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 21, 2020
@jlinder jlinder added the T-kv KV Team label Jun 16, 2021
@erikgrinaker
Copy link
Contributor

Related to #77251.

Copy link

We have marked this issue as stale because it has been inactive for
18 months. If this issue is still relevant, removing the stale label
or adding a comment will keep it active. Otherwise, we'll close it in
10 days to keep the issue queue tidy. Thank you for your contribution
to CockroachDB!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

3 participants