colrpc: propagate error immediately in Inbox #51143

asubiotto · 2020-07-08T14:34:02Z

The Inbox would previously buffer any metadata received from the remote side,
including errors. This could cause issues for special errors that are
swallowed during draining but not execution, because all errors would only be
returned during draining.

Release note: None (bug not present in release)

Fixes #50687

cockroach-teamcity · 2020-07-08T14:34:11Z

This change is

yuzefovich

I think at some point I noticed this issue (not sure if it was before or after the PR that refactored how we perform draining) but probably forgot to follow that thought through.

Reviewed 1 of 1 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained

asubiotto · 2020-07-08T14:46:28Z

bors r=yuzefovich

andreimatei

No test?

I believe we have a LogicTest config which runs DistSQL in a mode where metadata is injected periodically and verified to be received. Was that supposed to have caught this?
I might be bundling different things here, but I feel like we've had more trouble recently with metadata flowing through so I guess I'd think about whether there's any new systemic testing to be done.

Reviewable status: complete! 1 of 0 LGTMs obtained

yuzefovich · 2020-07-08T15:17:22Z

bors r-

CI is red.

I believe we have a LogicTest config which runs DistSQL in a mode where metadata is injected periodically and verified to be received. Was that supposed to have caught this?

I think you're talking about fakedist-metadata config, and it does make sure that the injected metadata is received, but maybe it's not immediately applicable to this issue, not sure.

craig · 2020-07-08T15:17:23Z

Canceled

asubiotto · 2020-07-08T15:28:01Z

I wanted to get this out ASAP so that we can restart the alpha (passing scaledata and CI is good enough for me). The reason the fakedist-metadata tests didn't catch this is that there is no observable difference for normal metadata (it's propagated either way). It's only when we're talking about ReadWithinUncertaintyInterval that it matters when the error is propagated.

I'll think about a good way to test this and follow up with another PR.

andreimatei

Where is the swallowing of the ReadWithinUncertaintyIntervalError during draining done? I can't find it.
I wonder if the right fix here (perhaps in addition to this patch) is to change that logic or the structure around there to swallow or not swallow depending on how the draining was initiated. Like, if the processor with the buffered error was asked by its consumer to drain, then the buffered error can be swallowed. If the processor itself initiated the draining because it hit this error, then we shouldn't swallow. Do you think there's anything here?

Or, do you think we should gear the signatures of methods dealing with meta, like RemoteProducerMetaToLocalMeta and AppendTrailingMeta, to deal with errors and make it hard for the caller to ignore them? Should we simply disallow the buffering of errors?

Reviewable status: complete! 1 of 0 LGTMs obtained

yuzefovich · 2020-07-08T16:13:00Z

Where is the swallowing of the ReadWithinUncertaintyIntervalError during draining done?

execinfra/processorsbase.go:643

asubiotto · 2020-07-08T16:23:37Z

Updated unit tests and added a simple check that the error is propagated.

I wonder if the right fix here (perhaps in addition to this patch) is to change that logic or the structure around there to swallow or not swallow depending on how the draining was initiated.

I think the logic you're talking about already exists in processorsbase.go. MoveToDraining is called with an error that caused it, which is never swallowed. The other case is that the processor transitions to draining, notifies it's upstream processors that ConsumerDone, and then proceeds to swallow any errors.

andreimatei

So I'm looking at this swallowing and I'm now a bit confused about what the problem is here. A processor doesn't seem to swallow a buffered error, only errors coming from its inputs. So I guess it's not the Inbox that's swallowing something it shouldn't? It's a downstream processor I guess. But if the downstream processor is draining, then it should be OK to swallow that error regardless of when the Inbox received it (no?).

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

asubiotto · 2020-07-08T16:35:27Z

The problem is that the Inbox silently buffers an error and returns a zero-length batch (equivalent to a nil row). At this point, the root of the flow (materializer) transitions to draining its inputs (which includes the Inbox). This is when the error is returned, but it is swallowed, since the materializer transitioned to draining before it encountered the error. This patch makes it so that the Inbox returns the error eagerly so that the materializer calls MoveToDraining with the error, which doesn't get swallowed, since it was received during execution.

The Inbox would previously buffer any metadata received from the remote side, including errors. This could cause issues for special errors that are swallowed during draining but not execution, because all errors would only be returned during draining. Release note: None (bug not present in release)

andreimatei · 2020-07-08T16:44:40Z

The problem is that the Inbox silently buffers an error and returns a zero-length batch (equivalent to a nil row).

I see. So with vectorized execution there's no way to return metadata before the root tells everybody to drain? If so, does swallowing errors in DrainHelper() ever make sense for a vectorized processor?

andreimatei

I guess the swallowing still makes sense, nvm.
This fix sounds good to me now.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

asubiotto · 2020-07-08T17:23:10Z

bors r=yuzefovich

yuzefovich

Reviewed 2 of 2 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @asubiotto)

pkg/sql/colflow/colrpc/outbox_test.go, line 65 at r2 (raw file):

	streamHandlerErrCh := handleStream(ctx, inbox, rpcLayer.server, func() { close(rpcLayer.server.csChan) })

	// The outbox will be sending the panic as eagerly. This Next call will

nit: s/as eagerly/eagerly/.

craig · 2020-07-08T18:29:07Z

Build succeeded

GitHub CI (Cockroach)

asubiotto · 2020-07-09T07:41:11Z

I think this issue exists in the hash router as well. I'll look into seeing if we can maybe generate some ReadWithinUncertaintyInterval errors in fakedist-meta or introduce some other way to test this behavior generally.

yuzefovich · 2020-07-09T16:54:40Z

I also think that the wrong buffering behavior is present on previous branches, e.g. https://github.com/cockroachdb/cockroach/blob/release-20.1/pkg/sql/colflow/colrpc/inbox.go#L317. I feel like we should be backporting this PR, no?

asubiotto · 2020-07-10T08:32:59Z

The wrong buffering behavior is present on other branches, but it doesn't matter because we mistakenly don't swallow errors in that version of the vectorized engine. #50388 fixed this bug, which uncovered this new bug.

asubiotto requested review from yuzefovich and a team July 8, 2020 14:34

yuzefovich approved these changes Jul 8, 2020

View reviewed changes

celiala mentioned this pull request Jul 8, 2020

release: 20.2.0-alpha.2 #51035

Closed

25 tasks

andreimatei reviewed Jul 8, 2020

View reviewed changes

yuzefovich approved these changes Jul 8, 2020

View reviewed changes

craig bot merged commit bb6d476 into cockroachdb:master Jul 8, 2020

irfansharif mentioned this pull request Jul 9, 2020

roachtest: scaledata/filesystem_simulator/nodes=3 failed #51098

Closed

yuzefovich mentioned this pull request Jul 9, 2020

sql/logictest: TestLogic failed #51179

Closed

asubiotto deleted the fsdt branch August 3, 2020 12:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

colrpc: propagate error immediately in Inbox #51143

colrpc: propagate error immediately in Inbox #51143

asubiotto commented Jul 8, 2020

cockroach-teamcity commented Jul 8, 2020

yuzefovich left a comment

asubiotto commented Jul 8, 2020

andreimatei left a comment

yuzefovich commented Jul 8, 2020

craig bot commented Jul 8, 2020

asubiotto commented Jul 8, 2020

andreimatei left a comment

yuzefovich commented Jul 8, 2020

asubiotto commented Jul 8, 2020

andreimatei left a comment

asubiotto commented Jul 8, 2020

andreimatei commented Jul 8, 2020

andreimatei left a comment

asubiotto commented Jul 8, 2020

yuzefovich left a comment

craig bot commented Jul 8, 2020

asubiotto commented Jul 9, 2020

yuzefovich commented Jul 9, 2020

asubiotto commented Jul 10, 2020

colrpc: propagate error immediately in Inbox #51143

colrpc: propagate error immediately in Inbox #51143

Conversation

asubiotto commented Jul 8, 2020

cockroach-teamcity commented Jul 8, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

asubiotto commented Jul 8, 2020

andreimatei left a comment

Choose a reason for hiding this comment

yuzefovich commented Jul 8, 2020

craig bot commented Jul 8, 2020

Canceled

asubiotto commented Jul 8, 2020

andreimatei left a comment

Choose a reason for hiding this comment

yuzefovich commented Jul 8, 2020

asubiotto commented Jul 8, 2020

andreimatei left a comment

Choose a reason for hiding this comment

asubiotto commented Jul 8, 2020

andreimatei commented Jul 8, 2020

andreimatei left a comment

Choose a reason for hiding this comment

asubiotto commented Jul 8, 2020

yuzefovich left a comment

Choose a reason for hiding this comment

craig bot commented Jul 8, 2020

Build succeeded

asubiotto commented Jul 9, 2020

yuzefovich commented Jul 9, 2020

asubiotto commented Jul 10, 2020