-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-21.1: colflow: cancel flow on ungraceful stream shutdown in outbox #64219
release-21.1: colflow: cancel flow on ungraceful stream shutdown in outbox #64219
Conversation
Previously, when an outbox `Recv`ed a non-`io.EOF` error from the gRPC stream, we would simply log it. Such an error indicates an ungraceful shutdown of the stream, and, instead, it should lead to the cancellation of the whole stream. Previously, the main goroutine of the outbox would keep on running and it would exit only the next time it attempted to `Send` something on the stream. This is now fixed, and the flow ctx will be canceled when a stream is shutdown ungracefully. This is what we do in the row-based flows, and this commit adds this behavior the vectorized outbox. Release note (bug fix): Previously, the remote flows of execution in the vectorized engine could take quite a long time to shut down whenever a node participating in the plan dies.
This wasn't a clean backport (some additional changes were needed in |
Given that we decided the issue this PR fixes is a GA-blocker, I added the corresponding label. (Not sure whether I should file a public issue for the root cause.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 15 of 15 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
Note that we hit a time out in the logic tests (#64294) which could be related to this change (but my current guess is it's not), so I'll hold off on merging it for now. |
Ok sounds good. Also wondering if we should revisit the decision to call this a GA Blocker. This bug existed in 20.2, correct? Seems like your original instinct to wait until 21.1.1 might be the right idea... |
No, the bug identified in #64294 only exists on master (because the "refactor Operator interface" PR); the "design flaw" exists on previous branches, but it is tapered over by a fix to a different problem. |
Ok, I'll trust your judgment about whether we should release 21.1.0 with or without this fix to the design flaw, but just want to make sure we're confident about this backport since we're getting into the RC stage now |
Yeah, makes sense. The fix to #64294 is not needed to be backported, but I want a bit more time to confirm that the fix is proper for the deadlock observed in that issue. I'm currently rerunning the logic tests with several fixes (none need to be backported) to make sure I understand the failure scenarios. So far looking good, so I'm optimistic. I believe it'll be very beneficial to backport this change to 21.1.0 given that we will be using the vectorized engine for the validation queries. If not for that change (which is beneficial in its own right), I would have voted to wait till 21.1.1, but with that change already in, this PR is a very nice to have, so I'll keep GA-blocker label for now. In terms of the timeline, I'll keep my gceworker running the logic tests for 24 hours, if no new failures come up, then I'll assume that I'm understanding all problems on master correctly, and I'll merge this backport tomorrow. If something new comes up, we'll likely have to let this PR go in for 21.1.1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good!
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)
With the fixes cherry-picked on master I got 100 successful runs of the logic tests and didn't hit #64294, so I'm quite confident that this PR is not making things worse in any dimension, so I'll go ahead and merge it. IMO if we want to get this for 21.1.0 (which we do) it's better to merge it sooner in order to have a bit more baking time on release-21.1 branch. |
Backport 1/1 commits from #63772.
/cc @cockroachdb/release
Previously, when an outbox
Recv
ed a non-io.EOF
error from the gRPCstream, we would simply log it. Such an error indicates an ungraceful
shutdown of the stream, and, instead, it should lead to the cancellation
of the whole stream. Previously, the main goroutine of the outbox would
keep on running and it would exit only the next time it attempted to
Send
something on the stream. This is now fixed, and the flow ctx willbe canceled when a stream is shutdown ungracefully. This is what we do
in the row-based flows, and this commit adds this behavior the
vectorized outbox.
Fixes: https://github.com/cockroachlabs/support/issues/924.
Release note (bug fix): Previously, the remote flows of execution in the
vectorized engine could take quite a long time to shut down whenever a
node participating in the plan dies.