-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: remove extraneous circuit breaker check in Raft transport #69405
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is going to result in the queue filling up when a node is unavailable, and all of those thousands of queued messages getting sent when it comes back up. Do we need some additional TTL logic or queue draining or something to prevent this, or do we already have safeguards for this?
That's not quite what will happen I think. Because of the second check cockroach/pkg/kv/kvserver/raft_transport.go Line 621 in e1d01d0
the newly created queue will quickly be torn down again (or slowly, depending on whether the breaker lets us through and if so if we fail-fast or fail-slow or, who knows, succeed). This isn't ideal but it seems better than the status quo. I'm currently looking into adding a control loop to the breakers (at least in our usage of them in nodedialer) but I doubt that this is something we're going to want to land now. Also, the queue has a bounded size (at least in the number of messages), so there is some protection, though not the best one. Either way, the protection is no worse than for nodes which are live. |
Right, I see that we actually drain the queue on dial failures, that's the sort of protection I was curious about. cockroach/pkg/kv/kvserver/raft_transport.go Lines 598 to 613 in e1d01d0
I seem to recall the buffer size is something like 10-20k messages though. A live node will generally be able to keep up, and so the queue is generally small, but it's true that it would pile up if the remote node should struggle. |
Mind giving this another look? The tests failed, and I decided to go the other route of bypassing the second circuit breaker check (while still reporting success to the breaker if things go smoothly). This should keep the behavioral change in this diff very very small (exactly fixes the bug, but no more). It was also forced upon me by the tests which were asserting that the queue doesn't even get started with the breaker open, and also tested that the second use of the breaker reports success. Surprisingly comprehensive and we're better off for it imo. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/kv/kvserver/raft_transport.go, line 620 at r1 (raw file):
// checked the breaker. Checking it again can cause livelock, see: // https://github.com/cockroachdb/cockroach/issues/68419 conn, err := t.dialer.DialNoBreaker(ctx, toNodeID, class)
This will bypass some internal health logging, which seems unfortunate. We may want to add some corresponding logging here.
cockroach/pkg/rpc/nodedialer/nodedialer.go
Lines 159 to 162 in f72206d
// Enforce a minimum interval between warnings for failed connections. | |
if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() { | |
log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err) | |
} |
pkg/kv/kvserver/raft_transport.go, line 624 at r1 (raw file):
breaker := t.dialer.GetCircuitBreaker(toNodeID, class) if err != nil { breaker.Fail(err)
We need to check ctx.Err()
here, otherwise context cancellation will trip the breaker.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @tbg)
pkg/rpc/nodedialer/nodedialer.go, line 94 at r2 (raw file):
} // DialNoBreaker ignores the breaker if there is an error dialing. This function
This comment should be updated to say it ignores the breaker state and always tries to dial, but reports the result.
e0b6f54
to
d5bf49a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Almost there...
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker)
pkg/kv/kvserver/raft_transport.go, line 620 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This will bypass some internal health logging, which seems unfortunate. We may want to add some corresponding logging here.
cockroach/pkg/rpc/nodedialer/nodedialer.go
Lines 159 to 162 in f72206d
// Enforce a minimum interval between warnings for failed connections. if err != nil && ctx.Err() == nil && breaker != nil && breaker.ShouldLog() { log.Health.Infof(ctx, "unable to connect to n%d: %s", nodeID, err) }
I updated DialNoBreaker to notify the breaker instead. This makes everything work as expected. I needed to update DialNoBreaker to also break on resolver errors. The only other caller is the distSQL outbox, which doesn't care much about this, and #40691 didn't offer any explanation. I think it was just flat-out trying to touch the breaker at all, which in hindsight was not ideal. Now we do what should be the right thing: notify the breaker just like Dial
would, but avoid failing fast.
pkg/kv/kvserver/raft_transport.go, line 624 at r1 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
We need to check
ctx.Err()
here, otherwise context cancellation will trip the breaker.
Obsolete.
pkg/rpc/nodedialer/nodedialer.go, line 94 at r2 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
This comment should be updated to say it ignores the breaker state and always tries to dial, but reports the result.
Rewrote the comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe consider rewording the commit message, although I suppose it's accurate as is too.
Reviewed 1 of 2 files at r2, 2 of 2 files at r3, 1 of 1 files at r4, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @tbg)
See cockroachdb#68419. We now use `DialNoBreaker` for the raft transport, taking into account the previous `Ready()` check. `DialNoBreaker` was previously bypassing the breaker as it ought to but was also *not reporting to the breaker* the result of the operation; this is not ideal and was caught by the tests. This commit changes `DialNoBreaker` to report the result (i.e. fail or success). Release justification: bug fix Release note (bug fix): Previously, after a temporary node outage, other nodes in the cluster could fail to connect to the restarted node due to their circuit breakers not resetting. This would manifest in the logs via messages "unable to dial nXX: breaker open", where `XX` is the ID of the restarted node. (Note that such errors are expected for nodes that are truly unreachable, and may still occur around the time of the restart, but for no longer than a few seconds).
bors r=erikgrinaker |
Build succeeded: |
blathers backport 21.1 |
Can we get a backport to 20.2 as well? The original incident was with 20.2.13, so it applies. |
blathers backport 20.2 |
Encountered an error creating backports. Some common things that can go wrong:
You might need to create your backport manually using the backport tool. error creating merge commit from 4304289 to blathers/backport-release-20.2-69405: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict [] you may need to manually resolve merge conflicts with the backport tool. Backport to branch 20.2 failed. See errors above. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
Ok, time to build RocksDB again then :-)
…On Fri, Sep 17, 2021 at 11:33 AM blathers-crl[bot] ***@***.***> wrote:
Encountered an error creating backports. Some common things that can go
wrong:
1. The backport branch might have already existed.
2. There was a merge conflict.
3. The backport branch contained merge commits.
You might need to create your backport manually using the backport
<https://github.com/benesch/backport> tool.
------------------------------
error creating merge commit from 4304289
<4304289>
to blathers/backport-release-20.2-69405: POST
https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge
conflict []
you may need to manually resolve merge conflicts with the backport tool.
Backport to branch 20.2 failed. See errors above.
------------------------------
🦉 Hoot! I am a Blathers <https://github.com/apps/blathers-crl>, a bot
for CockroachDB <https://github.com/cockroachdb>. My owner is otan
<https://github.com/otan>.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#69405 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABGXPZEZNWR6QOBTWFH6VCTUCMDQFANCNFSM5C22F2OA>
.
|
See #68419. We now use
DialNoBreaker
for the raft transport, taking into account the previousReady()
check.DialNoBreaker
was previously bypassing the breaker as it ought to butwas also not reporting to the breaker the result of the operation;
this is not ideal and was caught by the tests. This commit changes
DialNoBreaker
to report the result (i.e. fail or success).Release justification: bug fix
Release note (bug fix): Previously, after a temporary node outage, other
nodes in the cluster could fail to connect to the restarted node due to
their circuit breakers not resetting. This would manifest in the logs
via messages "unable to dial nXX: breaker open", where
XX
is the IDof the restarted node. (Note that such errors are expected for nodes
that are truly unreachable, and may still occur around the time of
the restart, but for no longer than a few seconds).