Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rpc: Dialer.ConnHealth must not cause network IO #69888

Closed
erikgrinaker opened this issue Sep 7, 2021 · 0 comments · Fixed by #70017
Closed

rpc: Dialer.ConnHealth must not cause network IO #69888

erikgrinaker opened this issue Sep 7, 2021 · 0 comments · Fixed by #70017
Assignees
Labels
A-server-networking Pertains to network addressing,routing,initialization C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

Dialer.ConnHealth is used to check whether a healthy RPC connection exists to a given node. This is often done to avoid interacting with unhealthy nodes. However, internally this method actually dials the remote node if no connection already exists:

conn := n.rpcContext.GRPCDialNode(addr.String(), nodeID, class)
return conn.Health()

The method is called synchronously in several performance-critical code paths, including the main Replica.handleRaftReady Raft processing path (inside Replica.updateProposalQuotaRaftMuLocked) and the Store.raftTickLoop Raft tick path (inside Store.updateLivenessMap). This is problematic because the node dial can hang for a long time — in particular, if the remote IP address or DNS server does not respond at all (which can happen e.g. with power loss, VM shutdown, network connectivity problems, etc) the TCP/IP stack or DNS client will keep retrying until it times out, often for tens of seconds.

ConnHealth must not cause any synchronous network IO at all in order for it to be safe to use in these code paths. We must also make sure that no code implicitly relies on ConnHealth dialing the remote node.

Related to #53410 and #68419 (comment).

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team A-server-networking Pertains to network addressing,routing,initialization labels Sep 7, 2021
@erikgrinaker erikgrinaker self-assigned this Sep 7, 2021
@craig craig bot closed this as completed in 73ae236 Sep 21, 2021
tbg added a commit to tbg/cockroach that referenced this issue Sep 30, 2021
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are
pretty clear about what they protect - but it is worth mentioning
as this consideration likely influenced the design of the original
breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007

Release note: None
tbg added a commit to tbg/cockroach that referenced this issue Oct 1, 2021
See cockroachdb#68419 (comment) for the original discussion.

This commit adds a new `circuit` package that uses probing-based
circuit breakers. This breaker does *not* recruit the occasional
request to carry out the probing. Instead, the circuit breaker
is configured with an "asychronous probe" that effectively
determines when the breaker should reset.

We prefer this approach precisely because it avoids recruiting
regular traffic, which is often tied to end-user requests, and
led to inacceptable latencies there.

The potential downside of the probing approach is that the breaker setup
is more complex and there is residual risk of configuring the probe
differently from the actual client requests. In the worst case, the
breaker would be perpetually tripped even though everything should be
fine. This isn't expected - our two uses of circuit breakers are pretty
clear about what they protect - but it is worth mentioning as this
consideration likely influenced the design of the original breaker.

Touches cockroachdb#69888
Touches cockroachdb#70111
Touches cockroachdb#53410

Also, this breaker was designed to be a good fit for:
cockroachdb#33007
which will use the `Signal()` call.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-server-networking Pertains to network addressing,routing,initialization C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant