-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: Dialer.ConnHealth
must not cause network IO
#69888
Labels
A-server-networking
Pertains to network addressing,routing,initialization
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Comments
erikgrinaker
added
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
A-server-networking
Pertains to network addressing,routing,initialization
labels
Sep 7, 2021
This was referenced Sep 7, 2021
This was referenced Sep 21, 2021
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Sep 30, 2021
See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 Release note: None
tbg
added a commit
to tbg/cockroach
that referenced
this issue
Oct 1, 2021
See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 which will use the `Signal()` call. Release note: None
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
A-server-networking
Pertains to network addressing,routing,initialization
C-bug
Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.
T-kv
KV Team
Dialer.ConnHealth
is used to check whether a healthy RPC connection exists to a given node. This is often done to avoid interacting with unhealthy nodes. However, internally this method actually dials the remote node if no connection already exists:cockroach/pkg/rpc/nodedialer/nodedialer.go
Lines 223 to 224 in 793b4c8
The method is called synchronously in several performance-critical code paths, including the main
Replica.handleRaftReady
Raft processing path (insideReplica.updateProposalQuotaRaftMuLocked
) and theStore.raftTickLoop
Raft tick path (insideStore.updateLivenessMap
). This is problematic because the node dial can hang for a long time — in particular, if the remote IP address or DNS server does not respond at all (which can happen e.g. with power loss, VM shutdown, network connectivity problems, etc) the TCP/IP stack or DNS client will keep retrying until it times out, often for tens of seconds.ConnHealth
must not cause any synchronous network IO at all in order for it to be safe to use in these code paths. We must also make sure that no code implicitly relies onConnHealth
dialing the remote node.Related to #53410 and #68419 (comment).
The text was updated successfully, but these errors were encountered: