sql/opt: avoid swallowing TransactionAbortedErrors #33312

benesch · 2018-12-21T00:14:15Z

An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes #32784.

Release note: None

/cc @tbg

cockroach-teamcity · 2018-12-21T00:14:23Z

This change is

RaduBerinde

Wow, what a subtle issue.. Awesome find, that couldn't have been easy to figure out!

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/sql/plan.go, line 442 at r1 (raw file):

		// If the prepared memo has been invalidated by schema or other changes,
		// re-prepare it.
		if ok, err := stmt.Prepared.Memo.IsStale(ctx, p.EvalContext(), &catalog); err != nil {

s/ok/isStale (same below)

pkg/sql/plan.go, line 444 at r1 (raw file):

		if ok, err := stmt.Prepared.Memo.IsStale(ctx, p.EvalContext(), &catalog); err != nil {
			return 0, err
		} else if !ok {

inverse check, should be if isStale

pkg/sql/opt/memo/memo.go, line 249 at r1 (raw file):

//   5. Data source privileges: current user may no longer have access to one or
//      more data sources.
//

I would add a comment about the error, that it can be associated with the transaction and has to be propagated (or something to that effect). Same for CheckDependencies

pkg/sql/opt/memo/memo.go, line 271 at r1 (raw file):

	// has changed, or if the current user no longer has sufficient privilege to
	// access the data source.
	return m.Metadata().CheckDependencies(ctx, catalog)

this should return the inverse of what CheckDeps returns. Or you can invert the function and rename to CheckStaleness

tbg

😓 thanks for finding this before we had to find it in the wild. #22615 seems well worth doing.
I'll defer to @RaduBerinde on the final LGTM

Reviewable status: complete! 0 of 0 LGTMs obtained

tbg · 2018-12-28T14:14:17Z

I think this would also fix #33375.

andy-kimball

, nice find.

Reviewable status: complete! 1 of 0 LGTMs obtained

andy-kimball · 2019-01-01T18:28:37Z

Also, will this need to be back-ported to 2.1?

benesch · 2019-01-01T18:57:03Z

Also, will this need to be back-ported to 2.1?

We'll definitely want to consider backporting. On the one hand there's no evidence that this bug can manifest without merges turned on, but on the other hand there is a good chance that there is some way to trigger it with merges turned off that I haven't thought of.

@andy-kimball @RaduBerinde I could actually use your advice on how to fix this test failure. This is the failure in question:

EXECUTE change_view
expected "pq: relation \"otherview\" does not exist", but found "pq: relation \"otherdb.public.otherview\" does not exist"

The logictest renames otherview to otherview2, then executes a prepared statement that references the view by its old name. Previously, IsStale would return true so that the query would be replanned in light of the new name of the view, and the replanning would fail with relation "otherview" does not exist, using the non-qualified name "otherview" because that's how the name is specified in the change_view prepared query.

Unfortunately, now IsStale returns the relation does not exist error directly. The optimizer always uses fully-qualified names in its memo, and the error message thus uses the fully-qualified name: relation "otherdb.public.otherview" does not exist.

I can't just change the test to expect the fully-qualified name because then it fails in the reverse direction, i.e., when the optimizer is turned off. The only thing I've come up with is teaching IsStale to return true, nil if it detects an undefined relation error, so that query continues to planning where it can fail with the proper undefined relation error. This seems kind of gross. Do either of you have a better suggestion?

benesch · 2019-01-01T18:58:15Z

I suppose we could also declare that we don't care about whether the error message uses the fully-qualified name or not, and adjust the logictest accordingly.

rytaft · 2019-01-01T21:43:34Z

The logic tests allow you to specify two possible error messages since it's just a regex. We've done this in other places:

cockroach/pkg/sql/logictest/testdata/logic_test/insert

Line 642 in a8afcab

statement error [column "y" does not exist | column "y" is being backfilled]

benesch · 2019-01-02T15:06:24Z

Oh, fantastic! Done. PTAL.

andreimatei

I can't just change the test to expect the fully-qualified name because then it fails in the reverse direction, i.e., when the optimizer is turned off.

Have you tried to unify the error between the two paths?

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/sql/internal_test.go, line 305 at r2 (raw file):

// Note that a fix to our transaction API to eliminate this class of errors is
// proposed in #22615.
func TestInternalExecutorTxnAbortNotSwallowed(t *testing.T) {

Let's talk about this test. The bug this patch is fixing is in the optimizer, not in the internal executor. The test's comment claims otherwise. I don't think this test is warranted... Is it? At the very least please add more words.

If you keep it, consider using TxnAborter instead of what you're doing.

pkg/sql/internal_test.go, line 317 at r2 (raw file):

				if r.GetHeartbeatTxn() != nil && br.Responses[i].GetHeartbeatTxn().Txn.Status == roachpb.ABORTED {
					go func() {
						heartbeatSawAbortedTxn <- ba.Txn.ID

please comment on why the distinction between a heartbeat finding out that a txn was aborted versus the commit finding out about the abort is important.

pkg/sql/internal_test.go, line 380 at r2 (raw file):

		t.Fatalf("expected query execution on aborted txn to fail, but got %+v", err)
	}
}

don't forget to rollback the txn :) That's the contract.

An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (cockroachdb#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes cockroachdb#32784. Release note: None

benesch

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)

pkg/sql/plan.go, line 444 at r1 (raw file):

Previously, RaduBerinde wrote…

inverse check, should be if isStale

🤦‍♂️ thanks, done.

pkg/sql/opt/memo/memo.go, line 271 at r1 (raw file):

Previously, RaduBerinde wrote…

this should return the inverse of what CheckDeps returns. Or you can invert the function and rename to CheckStaleness

Wow, I was really asleep at the wheel here. Thanks. Fixed.

pkg/sql/internal_test.go, line 305 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Let's talk about this test. The bug this patch is fixing is in the optimizer, not in the internal executor. The test's comment claims otherwise. I don't think this test is warranted... Is it? At the very least please add more words.

If you keep it, consider using TxnAborter instead of what you're doing.

Ok, I'll just remove it. Preserved here just in case: https://gist.github.com/benesch/53203264cf72eb6a1efe9137a306a26f

pkg/sql/internal_test.go, line 317 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment on why the distinction between a heartbeat finding out that a txn was aborted versus the commit finding out about the abort is important.

Done. (Test removed.)

pkg/sql/internal_test.go, line 380 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

don't forget to rollback the txn :) That's the contract.

Done. (Test removed.)

RaduBerinde

Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)

benesch · 2019-01-03T16:44:39Z

TFTRs!

bors r+

@tbg

33312: sql/opt: avoid swallowing TransactionAbortedErrors r=benesch a=benesch An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes #32784. Release note: None /cc @tbg 33417: opt: Inline constant values r=andy-kimball a=andy-kimball Inline constants in expressions like: SELECT x, x+1 FROM (VALUES (1)) AS t(x) ; with the new inlining rules, this becomes: VALUES (1, 2) The new inlining rules are useful for mutation expressions (e.g. UPDATE), which can nest multiple Project and Values expressions that often use constant values. For example: CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a + 1) STORED); UPDATE ab SET a=1 This now gets mapped by the optimizer to this internal equivalent: UPDATE ab SET a=1, b=2 Release note: None 33421: opt: Tighten up InlineProjectInProject rule r=andy-kimball a=andy-kimball Allow inlining nested Project in case where there are duplicate refs to same inner passthrough column. Previously, this case prevented inlining. However, only duplicate references to inner *synthesized* columns need to be detected. Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Andrew Kimball <[email protected]>

craig · 2019-01-03T17:06:41Z

Build succeeded

GitHub CI (Cockroach)

The merge queue was disabled in this test by df26cf6 due to the bug found in cockroachdb#32784. That bug was fixed by cockroachdb#33312, so we can address the TODO and re-enable merges in the test. Release note: None Release justification: test only

46408: opt: fix buggy EliminateUpsertDistinctOnNoColumns rule r=andy-kimball a=andy-kimball The fix for #37880 added a new ErrorOnDup field to the UpsertDistinctOn operator. However, the EliminateErrorDistinctNoColumns rule wasn't changed to respect this flag, and so there's still a case where ON CONFLICT DO NOTHING returns an error. This commit eliminates the error by separating out the ErrorOnDup=True case from the EliminateErrorDistinctNoColumns rule (which raises an error) and adding it instead to the EliminateDistinctNoColumns rule (which does not raise an error). Fixes #46395 Release justification: Bug fixes and low-risk updates to new functionality. This was always an error, so it's low-risk to make it a non-error. Release note: None 46431: kv: don't disable the merge queue needlessly in more tests r=nvanbenschoten a=nvanbenschoten Follow up to #46383. These tests were disabling the queue to not interfere with its AdminSplits, but since the tests were written, AdminSplit got a TTL. Release note: None Release justification: test only 46433: kv: re-enable merge queue in TestSplitTriggerRaftSnapshotRace r=nvanbenschoten a=nvanbenschoten The merge queue was disabled in this test by df26cf6 due to the bug found in #32784. That bug was fixed by #33312, so we can address the TODO and re-enable merges in the test. Release note: None Release justification: test only 46444: kv: deflake TestRangeInfo by using a manual clock r=nvanbenschoten a=nvanbenschoten Fixes #46215. Fallout from #45984. Release justification: testing only Release note: None. Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>

benesch requested review from andreimatei, andy-kimball and a team December 21, 2018 00:14

benesch requested a review from a team as a code owner December 21, 2018 00:14

benesch requested a review from a team December 21, 2018 00:14

RaduBerinde reviewed Dec 21, 2018

View reviewed changes

tbg reviewed Dec 21, 2018

View reviewed changes

benesch force-pushed the merge-bug branch from 1649565 to a353c34 Compare December 21, 2018 16:23

andy-kimball approved these changes Jan 1, 2019

View reviewed changes

benesch force-pushed the merge-bug branch from a353c34 to a1884eb Compare January 2, 2019 15:05

benesch requested a review from a team January 2, 2019 15:05

andreimatei suggested changes Jan 2, 2019

View reviewed changes

benesch force-pushed the merge-bug branch from a1884eb to 0285c3c Compare January 3, 2019 04:48

benesch commented Jan 3, 2019

View reviewed changes

RaduBerinde approved these changes Jan 3, 2019

View reviewed changes

craig bot merged commit 0285c3c into cockroachdb:master Jan 3, 2019

tbg mentioned this pull request Jan 8, 2019

storage: chaos during merges leads to overlapping range descriptors #33120

Closed

nvanbenschoten mentioned this pull request Feb 3, 2019

roachtest: tpcc-1000 roachtests have failing checks #34025

Closed

benesch deleted the merge-bug branch June 21, 2019 22:00

nvanbenschoten mentioned this pull request Mar 23, 2020

kv: re-enable merge queue in TestSplitTriggerRaftSnapshotRace #46433

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql/opt: avoid swallowing TransactionAbortedErrors #33312

sql/opt: avoid swallowing TransactionAbortedErrors #33312

benesch commented Dec 21, 2018

cockroach-teamcity commented Dec 21, 2018

RaduBerinde left a comment

tbg left a comment

tbg commented Dec 28, 2018

andy-kimball left a comment

andy-kimball commented Jan 1, 2019

benesch commented Jan 1, 2019

benesch commented Jan 1, 2019

rytaft commented Jan 1, 2019

benesch commented Jan 2, 2019

andreimatei left a comment

benesch left a comment

RaduBerinde left a comment

benesch commented Jan 3, 2019

craig bot commented Jan 3, 2019

sql/opt: avoid swallowing TransactionAbortedErrors #33312

sql/opt: avoid swallowing TransactionAbortedErrors #33312

Conversation

benesch commented Dec 21, 2018

cockroach-teamcity commented Dec 21, 2018

RaduBerinde left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

tbg commented Dec 28, 2018

andy-kimball left a comment

Choose a reason for hiding this comment

andy-kimball commented Jan 1, 2019

benesch commented Jan 1, 2019

benesch commented Jan 1, 2019

rytaft commented Jan 1, 2019

benesch commented Jan 2, 2019

andreimatei left a comment

Choose a reason for hiding this comment

benesch left a comment

Choose a reason for hiding this comment

RaduBerinde left a comment

Choose a reason for hiding this comment

benesch commented Jan 3, 2019

craig bot commented Jan 3, 2019

Build succeeded