Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

colrpc: propagate error immediately in Inbox #51143

Merged
merged 1 commit into from
Jul 8, 2020
Merged

colrpc: propagate error immediately in Inbox #51143

merged 1 commit into from
Jul 8, 2020

Conversation

asubiotto
Copy link
Contributor

The Inbox would previously buffer any metadata received from the remote side,
including errors. This could cause issues for special errors that are
swallowed during draining but not execution, because all errors would only be
returned during draining.

Release note: None (bug not present in release)

Fixes #50687

@asubiotto asubiotto requested review from yuzefovich and a team July 8, 2020 14:34
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I think at some point I noticed this issue (not sure if it was before or after the PR that refactored how we perform draining) but probably forgot to follow that thought through.

Reviewed 1 of 1 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@asubiotto
Copy link
Contributor Author

bors r=yuzefovich

@celiala celiala mentioned this pull request Jul 8, 2020
25 tasks
Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No test?

I believe we have a LogicTest config which runs DistSQL in a mode where metadata is injected periodically and verified to be received. Was that supposed to have caught this?
I might be bundling different things here, but I feel like we've had more trouble recently with metadata flowing through so I guess I'd think about whether there's any new systemic testing to be done.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@yuzefovich
Copy link
Member

bors r-

CI is red.

I believe we have a LogicTest config which runs DistSQL in a mode where metadata is injected periodically and verified to be received. Was that supposed to have caught this?

I think you're talking about fakedist-metadata config, and it does make sure that the injected metadata is received, but maybe it's not immediately applicable to this issue, not sure.

@craig
Copy link
Contributor

craig bot commented Jul 8, 2020

Canceled

@asubiotto
Copy link
Contributor Author

I wanted to get this out ASAP so that we can restart the alpha (passing scaledata and CI is good enough for me). The reason the fakedist-metadata tests didn't catch this is that there is no observable difference for normal metadata (it's propagated either way). It's only when we're talking about ReadWithinUncertaintyInterval that it matters when the error is propagated.

I'll think about a good way to test this and follow up with another PR.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is the swallowing of the ReadWithinUncertaintyIntervalError during draining done? I can't find it.
I wonder if the right fix here (perhaps in addition to this patch) is to change that logic or the structure around there to swallow or not swallow depending on how the draining was initiated. Like, if the processor with the buffered error was asked by its consumer to drain, then the buffered error can be swallowed. If the processor itself initiated the draining because it hit this error, then we shouldn't swallow. Do you think there's anything here?

Or, do you think we should gear the signatures of methods dealing with meta, like RemoteProducerMetaToLocalMeta and AppendTrailingMeta, to deal with errors and make it hard for the caller to ignore them? Should we simply disallow the buffering of errors?

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@yuzefovich
Copy link
Member

Where is the swallowing of the ReadWithinUncertaintyIntervalError during draining done?

execinfra/processorsbase.go:643

@asubiotto
Copy link
Contributor Author

Updated unit tests and added a simple check that the error is propagated.

I wonder if the right fix here (perhaps in addition to this patch) is to change that logic or the structure around there to swallow or not swallow depending on how the draining was initiated.

I think the logic you're talking about already exists in processorsbase.go. MoveToDraining is called with an error that caused it, which is never swallowed. The other case is that the processor transitions to draining, notifies it's upstream processors that ConsumerDone, and then proceeds to swallow any errors.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I'm looking at this swallowing and I'm now a bit confused about what the problem is here. A processor doesn't seem to swallow a buffered error, only errors coming from its inputs. So I guess it's not the Inbox that's swallowing something it shouldn't? It's a downstream processor I guess. But if the downstream processor is draining, then it should be OK to swallow that error regardless of when the Inbox received it (no?).

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@asubiotto
Copy link
Contributor Author

The problem is that the Inbox silently buffers an error and returns a zero-length batch (equivalent to a nil row). At this point, the root of the flow (materializer) transitions to draining its inputs (which includes the Inbox). This is when the error is returned, but it is swallowed, since the materializer transitioned to draining before it encountered the error. This patch makes it so that the Inbox returns the error eagerly so that the materializer calls MoveToDraining with the error, which doesn't get swallowed, since it was received during execution.

The Inbox would previously buffer any metadata received from the remote side,
including errors. This could cause issues for special errors that are
swallowed during draining but not execution, because all errors would only be
returned during draining.

Release note: None (bug not present in release)
@andreimatei
Copy link
Contributor

The problem is that the Inbox silently buffers an error and returns a zero-length batch (equivalent to a nil row).

I see. So with vectorized execution there's no way to return metadata before the root tells everybody to drain? If so, does swallowing errors in DrainHelper() ever make sense for a vectorized processor?

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the swallowing still makes sense, nvm.
This fix sounds good to me now.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@asubiotto
Copy link
Contributor Author

bors r=yuzefovich

Copy link
Member

@yuzefovich yuzefovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 2 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)


pkg/sql/colflow/colrpc/outbox_test.go, line 65 at r2 (raw file):

	streamHandlerErrCh := handleStream(ctx, inbox, rpcLayer.server, func() { close(rpcLayer.server.csChan) })

	// The outbox will be sending the panic as eagerly. This Next call will

nit: s/as eagerly/eagerly/.

@craig
Copy link
Contributor

craig bot commented Jul 8, 2020

Build succeeded

@craig craig bot merged commit bb6d476 into cockroachdb:master Jul 8, 2020
@asubiotto
Copy link
Contributor Author

I think this issue exists in the hash router as well. I'll look into seeing if we can maybe generate some ReadWithinUncertaintyInterval errors in fakedist-meta or introduce some other way to test this behavior generally.

@yuzefovich
Copy link
Member

I also think that the wrong buffering behavior is present on previous branches, e.g. https://github.com/cockroachdb/cockroach/blob/release-20.1/pkg/sql/colflow/colrpc/inbox.go#L317. I feel like we should be backporting this PR, no?

@asubiotto
Copy link
Contributor Author

The wrong buffering behavior is present on other branches, but it doesn't matter because we mistakenly don't swallow errors in that version of the vectorized engine. #50388 fixed this bug, which uncovered this new bug.

@asubiotto asubiotto deleted the fsdt branch August 3, 2020 12:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

roachtest: scaledata/filesystem_simulator/nodes=3 failed
4 participants