-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
colrpc: propagate error immediately in Inbox #51143
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think at some point I noticed this issue (not sure if it was before or after the PR that refactored how we perform draining) but probably forgot to follow that thought through.
Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained
bors r=yuzefovich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No test?
I believe we have a LogicTest config which runs DistSQL in a mode where metadata is injected periodically and verified to be received. Was that supposed to have caught this?
I might be bundling different things here, but I feel like we've had more trouble recently with metadata flowing through so I guess I'd think about whether there's any new systemic testing to be done.
Reviewable status: complete! 1 of 0 LGTMs obtained
bors r- CI is red.
I think you're talking about |
Canceled |
I wanted to get this out ASAP so that we can restart the alpha (passing I'll think about a good way to test this and follow up with another PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is the swallowing of the ReadWithinUncertaintyIntervalError
during draining done? I can't find it.
I wonder if the right fix here (perhaps in addition to this patch) is to change that logic or the structure around there to swallow or not swallow depending on how the draining was initiated. Like, if the processor with the buffered error was asked by its consumer to drain, then the buffered error can be swallowed. If the processor itself initiated the draining because it hit this error, then we shouldn't swallow. Do you think there's anything here?
Or, do you think we should gear the signatures of methods dealing with meta, like RemoteProducerMetaToLocalMeta
and AppendTrailingMeta
, to deal with errors and make it hard for the caller to ignore them? Should we simply disallow the buffering of errors?
Reviewable status: complete! 1 of 0 LGTMs obtained
|
Updated unit tests and added a simple check that the error is propagated.
I think the logic you're talking about already exists in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I'm looking at this swallowing and I'm now a bit confused about what the problem is here. A processor doesn't seem to swallow a buffered error, only errors coming from its inputs. So I guess it's not the Inbox
that's swallowing something it shouldn't? It's a downstream processor I guess. But if the downstream processor is draining, then it should be OK to swallow that error regardless of when the Inbox received it (no?).
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
The problem is that the |
The Inbox would previously buffer any metadata received from the remote side, including errors. This could cause issues for special errors that are swallowed during draining but not execution, because all errors would only be returned during draining. Release note: None (bug not present in release)
I see. So with vectorized execution there's no way to return metadata before the root tells everybody to drain? If so, does swallowing errors in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess the swallowing still makes sense, nvm.
This fix sounds good to me now.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
bors r=yuzefovich |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)
pkg/sql/colflow/colrpc/outbox_test.go, line 65 at r2 (raw file):
streamHandlerErrCh := handleStream(ctx, inbox, rpcLayer.server, func() { close(rpcLayer.server.csChan) }) // The outbox will be sending the panic as eagerly. This Next call will
nit: s/as eagerly/eagerly/
.
Build succeeded |
I think this issue exists in the hash router as well. I'll look into seeing if we can maybe generate some |
I also think that the wrong buffering behavior is present on previous branches, e.g. https://github.com/cockroachdb/cockroach/blob/release-20.1/pkg/sql/colflow/colrpc/inbox.go#L317. I feel like we should be backporting this PR, no? |
The wrong buffering behavior is present on other branches, but it doesn't matter because we mistakenly don't swallow errors in that version of the vectorized engine. #50388 fixed this bug, which uncovered this new bug. |
The Inbox would previously buffer any metadata received from the remote side,
including errors. This could cause issues for special errors that are
swallowed during draining but not execution, because all errors would only be
returned during draining.
Release note: None (bug not present in release)
Fixes #50687