Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: colflow: cancel flow on ungraceful stream shutdown in outbox #64219

Merged
merged 1 commit into from
Apr 29, 2021

Conversation

yuzefovich
Copy link
Member

Backport 1/1 commits from #63772.

/cc @cockroachdb/release


Previously, when an outbox Recved a non-io.EOF error from the gRPC
stream, we would simply log it. Such an error indicates an ungraceful
shutdown of the stream, and, instead, it should lead to the cancellation
of the whole stream. Previously, the main goroutine of the outbox would
keep on running and it would exit only the next time it attempted to
Send something on the stream. This is now fixed, and the flow ctx will
be canceled when a stream is shutdown ungracefully. This is what we do
in the row-based flows, and this commit adds this behavior the
vectorized outbox.

Fixes: https://github.com/cockroachlabs/support/issues/924.

Release note (bug fix): Previously, the remote flows of execution in the
vectorized engine could take quite a long time to shut down whenever a
node participating in the plan dies.

Previously, when an outbox `Recv`ed a non-`io.EOF` error from the gRPC
stream, we would simply log it. Such an error indicates an ungraceful
shutdown of the stream, and, instead, it should lead to the cancellation
of the whole stream. Previously, the main goroutine of the outbox would
keep on running and it would exit only the next time it attempted to
`Send` something on the stream. This is now fixed, and the flow ctx will
be canceled when a stream is shutdown ungracefully. This is what we do
in the row-based flows, and this commit adds this behavior the
vectorized outbox.

Release note (bug fix): Previously, the remote flows of execution in the
vectorized engine could take quite a long time to shut down whenever a
node participating in the plan dies.
@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Apr 26, 2021
@yuzefovich yuzefovich requested a review from jordanlewis April 26, 2021 16:37
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@yuzefovich
Copy link
Member Author

This wasn't a clean backport (some additional changes were needed in inbox.go). Will wait for 21.1.1.

@yuzefovich
Copy link
Member Author

yuzefovich commented Apr 27, 2021

Given that we decided the issue this PR fixes is a GA-blocker, I added the corresponding label. (Not sure whether I should file a public issue for the root cause.)

@yuzefovich yuzefovich requested a review from rytaft April 27, 2021 21:08
Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 15 of 15 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

@yuzefovich
Copy link
Member Author

Note that we hit a time out in the logic tests (#64294) which could be related to this change (but my current guess is it's not), so I'll hold off on merging it for now.

@yuzefovich yuzefovich added the do-not-merge bors won't merge a PR with this label. label Apr 28, 2021
@rytaft
Copy link
Collaborator

rytaft commented Apr 28, 2021

Ok sounds good. Also wondering if we should revisit the decision to call this a GA Blocker. This bug existed in 20.2, correct? Seems like your original instinct to wait until 21.1.1 might be the right idea...

@yuzefovich
Copy link
Member Author

No, the bug identified in #64294 only exists on master (because the "refactor Operator interface" PR); the "design flaw" exists on previous branches, but it is tapered over by a fix to a different problem.

@rytaft
Copy link
Collaborator

rytaft commented Apr 28, 2021

Ok, I'll trust your judgment about whether we should release 21.1.0 with or without this fix to the design flaw, but just want to make sure we're confident about this backport since we're getting into the RC stage now

@yuzefovich
Copy link
Member Author

Yeah, makes sense. The fix to #64294 is not needed to be backported, but I want a bit more time to confirm that the fix is proper for the deadlock observed in that issue. I'm currently rerunning the logic tests with several fixes (none need to be backported) to make sure I understand the failure scenarios. So far looking good, so I'm optimistic.

I believe it'll be very beneficial to backport this change to 21.1.0 given that we will be using the vectorized engine for the validation queries. If not for that change (which is beneficial in its own right), I would have voted to wait till 21.1.1, but with that change already in, this PR is a very nice to have, so I'll keep GA-blocker label for now.

In terms of the timeline, I'll keep my gceworker running the logic tests for 24 hours, if no new failures come up, then I'll assume that I'm understanding all problems on master correctly, and I'll merge this backport tomorrow. If something new comes up, we'll likely have to let this PR go in for 21.1.1.

Copy link
Collaborator

@rytaft rytaft left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good!

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @jordanlewis)

@yuzefovich yuzefovich removed the do-not-merge bors won't merge a PR with this label. label Apr 29, 2021
@yuzefovich
Copy link
Member Author

With the fixes cherry-picked on master I got 100 successful runs of the logic tests and didn't hit #64294, so I'm quite confident that this PR is not making things worse in any dimension, so I'll go ahead and merge it. IMO if we want to get this for 21.1.0 (which we do) it's better to merge it sooner in order to have a bit more baking time on release-21.1 branch.

@yuzefovich yuzefovich merged commit 65bd373 into cockroachdb:release-21.1 Apr 29, 2021
@yuzefovich yuzefovich deleted the backport21.1-63772 branch April 29, 2021 15:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants