-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
rpc: circuit breaker livelock if 2nd check comes too fast after the 1st #68419
Comments
I looked into how this should work instead a little bit. The basic idea is that instead of letting through the occasional request, a breaker should have associated with it a "watcher" goroutine. If the breaker trips, it is the watcher's job (and only the watcher's job) to try to determine when the target is available again. In the meantime, everyone else will fail-fast 100% of the time; effectively all calls to In practice, the tangliness of our RPC layer raises additional questions, especially if we're also trying to fix parts of #53410. The circuit breakers sit at the level of Lines 248 to 253 in 57ad801
But on certain errors the connection gets removed from Lines 1097 to 1101 in 57ad801
and I'm not sure if I think what we want is:
The result should be that each node that "anyone" tries to actually inquire about has a watcher goroutine associated with it. Connection attempts from CRDB code will never be recruited as canary requests and so don't eat the possibly disastrous latencies that come with that. We also simplify the different layers of circuit breaking and notions of connection health, which should significantly simplify the code. I think we are then also in a position to relatively easily address what must be the issues behind #53410 - we make sure that the initial connection attempt to a node is timeboxed to, say, 1s (not sure what the current state is) and then the managed breakers are going to be doing the rest. There is a lot of hard-earned wisdom in this tangly code, so we need to be careful. |
I like this direction.
Do we actually need the breaker at all then? If we were to always keep a heartbeat loop running, presumably that could periodically try to connect to the remote node and fail any requests in the meanwhile. I suppose it might still be useful to keep the breaker as internal state, but I do like the idea of having a single actor responsible for managing the connection. |
I think it becomes a game of naming things. There will be something like a circuit breaker, the question is if there will be a
That is how it works today, except... the code is pretty hard to understand and we are not confident how well it works; we definitely it doesn't work that well for nodes for which we get an |
See cockroachdb#68419. Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
See cockroachdb#68419. Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
See cockroachdb#68419. Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
See cockroachdb#68419. Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
See cockroachdb#68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
69405: kvserver: remove extraneous circuit breaker check in Raft transport r=erikgrinaker a=tbg See #68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds). Co-authored-by: Tobias Grieger <[email protected]>
See #68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
See cockroachdb#68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 Release note: None
See cockroachdb#68419 (comment) for the original discussion. This commit adds a new `circuit` package that uses probing-based circuit breakers. This breaker does *not* recruit the occasional request to carry out the probing. Instead, the circuit breaker is configured with an "asychronous probe" that effectively determines when the breaker should reset. We prefer this approach precisely because it avoids recruiting regular traffic, which is often tied to end-user requests, and led to inacceptable latencies there. The potential downside of the probing approach is that the breaker setup is more complex and there is residual risk of configuring the probe differently from the actual client requests. In the worst case, the breaker would be perpetually tripped even though everything should be fine. This isn't expected - our two uses of circuit breakers are pretty clear about what they protect - but it is worth mentioning as this consideration likely influenced the design of the original breaker. Touches cockroachdb#69888 Touches cockroachdb#70111 Touches cockroachdb#53410 Also, this breaker was designed to be a good fit for: cockroachdb#33007 which will use the `Signal()` call. Release note: None
investigated by @nvanbenschoten and @erikgrinaker.
When a network connection drops, we also break its associated circuit breaker. For it to recover, to breaker needs to enter a "half open" state, where it lets an occasional (once per second) request through to try to re-establish the connection. If that succeeds, the breaker will be moved to closed and considered recovered. What we found is that the Raft transport was checking the circuit breaker associated with this
(destination, rpc class)
pair twice in order to establish a connection:First, before creating a new RaftTransport queue:
cockroach/pkg/kv/kvserver/raft_transport.go
Line 553 in 9f15510
Second, when the RaftTransport queue started up and dialed the destination node:
cockroach/pkg/rpc/nodedialer/nodedialer.go
Line 154 in 9f15510
cockroach/pkg/kv/kvserver/raft_transport.go
Line 621 in e1d01d0
So the theory is that once a second, a request will make it through the first call to
Breaker.Ready
. However, when it does, it launches a newRaftTransport
queue that immediately checks the breaker again. And since we haven't waited a second between calls toBreaker.Ready
, this second call will always return false. So even in the cases where we pass the first breaker check, we always immediately fail the second. And since we're not passing the second check and successfully dialing, we never mark the breaker as closed here. Instead, we shut down theRaftTransport
queue and start over again.This is a fascinating pathology. In some sense, breakers are not reentrant. This patch to https://github.com/cockroachdb/circuitbreaker demonstrates that:
So any code that requires two consecutive calls to a breaker's
Ready()
function in order to reset the breaker is bound to be starved forever.It's not yet clear what the best fix is for this. One solution is to expose an option from
nodedialer.Dialer.Dial
to skip the breaker check. Another is to do something more clever around breaker affinity.The text was updated successfully, but these errors were encountered: