Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql/opt: avoid swallowing TransactionAbortedErrors #33312

Merged
merged 1 commit into from
Jan 3, 2019

Conversation

benesch
Copy link
Contributor

@benesch benesch commented Dec 21, 2018

An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes #32784.

Release note: None

/cc @tbg

@benesch benesch requested review from andreimatei, andy-kimball and a team December 21, 2018 00:14
@benesch benesch requested a review from a team as a code owner December 21, 2018 00:14
@benesch benesch requested a review from a team December 21, 2018 00:14
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, what a subtle issue.. Awesome find, that couldn't have been easy to figure out!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/sql/plan.go, line 442 at r1 (raw file):

		// If the prepared memo has been invalidated by schema or other changes,
		// re-prepare it.
		if ok, err := stmt.Prepared.Memo.IsStale(ctx, p.EvalContext(), &catalog); err != nil {

s/ok/isStale (same below)


pkg/sql/plan.go, line 444 at r1 (raw file):

		if ok, err := stmt.Prepared.Memo.IsStale(ctx, p.EvalContext(), &catalog); err != nil {
			return 0, err
		} else if !ok {

inverse check, should be if isStale


pkg/sql/opt/memo/memo.go, line 249 at r1 (raw file):

//   5. Data source privileges: current user may no longer have access to one or
//      more data sources.
//

I would add a comment about the error, that it can be associated with the transaction and has to be propagated (or something to that effect). Same for CheckDependencies


pkg/sql/opt/memo/memo.go, line 271 at r1 (raw file):

	// has changed, or if the current user no longer has sufficient privilege to
	// access the data source.
	return m.Metadata().CheckDependencies(ctx, catalog)

this should return the inverse of what CheckDeps returns. Or you can invert the function and rename to CheckStaleness

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😓 thanks for finding this before we had to find it in the wild. #22615 seems well worth doing.
I'll defer to @RaduBerinde on the final LGTM

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained

@tbg
Copy link
Member

tbg commented Dec 28, 2018

I think this would also fix #33375.

Copy link
Contributor

@andy-kimball andy-kimball left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:, nice find.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@andy-kimball
Copy link
Contributor

Also, will this need to be back-ported to 2.1?

@benesch
Copy link
Contributor Author

benesch commented Jan 1, 2019

Also, will this need to be back-ported to 2.1?

We'll definitely want to consider backporting. On the one hand there's no evidence that this bug can manifest without merges turned on, but on the other hand there is a good chance that there is some way to trigger it with merges turned off that I haven't thought of.

@andy-kimball @RaduBerinde I could actually use your advice on how to fix this test failure. This is the failure in question:

EXECUTE change_view
expected "pq: relation \"otherview\" does not exist", but found "pq: relation \"otherdb.public.otherview\" does not exist"

The logictest renames otherview to otherview2, then executes a prepared statement that references the view by its old name. Previously, IsStale would return true so that the query would be replanned in light of the new name of the view, and the replanning would fail with relation "otherview" does not exist, using the non-qualified name "otherview" because that's how the name is specified in the change_view prepared query.

Unfortunately, now IsStale returns the relation does not exist error directly. The optimizer always uses fully-qualified names in its memo, and the error message thus uses the fully-qualified name: relation "otherdb.public.otherview" does not exist.

I can't just change the test to expect the fully-qualified name because then it fails in the reverse direction, i.e., when the optimizer is turned off. The only thing I've come up with is teaching IsStale to return true, nil if it detects an undefined relation error, so that query continues to planning where it can fail with the proper undefined relation error. This seems kind of gross. Do either of you have a better suggestion?

@benesch
Copy link
Contributor Author

benesch commented Jan 1, 2019

I suppose we could also declare that we don't care about whether the error message uses the fully-qualified name or not, and adjust the logictest accordingly.

@rytaft
Copy link
Collaborator

rytaft commented Jan 1, 2019

The logic tests allow you to specify two possible error messages since it's just a regex. We've done this in other places:

statement error [column "y" does not exist | column "y" is being backfilled]

@benesch benesch requested a review from a team January 2, 2019 15:05
@benesch
Copy link
Contributor Author

benesch commented Jan 2, 2019

Oh, fantastic! Done. PTAL.

Copy link
Contributor

@andreimatei andreimatei left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't just change the test to expect the fully-qualified name because then it fails in the reverse direction, i.e., when the optimizer is turned off.

Have you tried to unify the error between the two paths?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/internal_test.go, line 305 at r2 (raw file):

// Note that a fix to our transaction API to eliminate this class of errors is
// proposed in #22615.
func TestInternalExecutorTxnAbortNotSwallowed(t *testing.T) {

Let's talk about this test. The bug this patch is fixing is in the optimizer, not in the internal executor. The test's comment claims otherwise. I don't think this test is warranted... Is it? At the very least please add more words.

If you keep it, consider using TxnAborter instead of what you're doing.


pkg/sql/internal_test.go, line 317 at r2 (raw file):

				if r.GetHeartbeatTxn() != nil && br.Responses[i].GetHeartbeatTxn().Txn.Status == roachpb.ABORTED {
					go func() {
						heartbeatSawAbortedTxn <- ba.Txn.ID

please comment on why the distinction between a heartbeat finding out that a txn was aborted versus the commit finding out about the abort is important.


pkg/sql/internal_test.go, line 380 at r2 (raw file):

		t.Fatalf("expected query execution on aborted txn to fail, but got %+v", err)
	}
}

don't forget to rollback the txn :) That's the contract.

An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (cockroachdb#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes cockroachdb#32784.

Release note: None
Copy link
Contributor Author

@benesch benesch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/sql/plan.go, line 444 at r1 (raw file):

Previously, RaduBerinde wrote…

inverse check, should be if isStale

🤦‍♂️ thanks, done.


pkg/sql/opt/memo/memo.go, line 271 at r1 (raw file):

Previously, RaduBerinde wrote…

this should return the inverse of what CheckDeps returns. Or you can invert the function and rename to CheckStaleness

Wow, I was really asleep at the wheel here. Thanks. Fixed.


pkg/sql/internal_test.go, line 305 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

Let's talk about this test. The bug this patch is fixing is in the optimizer, not in the internal executor. The test's comment claims otherwise. I don't think this test is warranted... Is it? At the very least please add more words.

If you keep it, consider using TxnAborter instead of what you're doing.

Ok, I'll just remove it. Preserved here just in case: https://gist.github.com/benesch/53203264cf72eb6a1efe9137a306a26f


pkg/sql/internal_test.go, line 317 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

please comment on why the distinction between a heartbeat finding out that a txn was aborted versus the commit finding out about the abort is important.

Done. (Test removed.)


pkg/sql/internal_test.go, line 380 at r2 (raw file):

Previously, andreimatei (Andrei Matei) wrote…

don't forget to rollback the txn :) That's the contract.

Done. (Test removed.)

Copy link
Member

@RaduBerinde RaduBerinde left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (and 1 stale)

@benesch
Copy link
Contributor Author

benesch commented Jan 3, 2019

TFTRs!

bors r+

craig bot pushed a commit that referenced this pull request Jan 3, 2019
33312: sql/opt: avoid swallowing TransactionAbortedErrors r=benesch a=benesch

An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.

This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.

Fixes #32784.

Release note: None

/cc @tbg

33417: opt: Inline constant values r=andy-kimball a=andy-kimball

Inline constants in expressions like:

  SELECT x, x+1 FROM (VALUES (1)) AS t(x) ;

with the new inlining rules, this becomes:

  VALUES (1, 2)

The new inlining rules are useful for mutation expressions (e.g. UPDATE),
which can nest multiple Project and Values expressions that often use
constant values. For example:

  CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a + 1) STORED);
  UPDATE ab SET a=1

This now gets mapped by the optimizer to this internal equivalent:

  UPDATE ab SET a=1, b=2

Release note: None

33421: opt: Tighten up InlineProjectInProject rule r=andy-kimball a=andy-kimball

Allow inlining nested Project in case where there are duplicate refs
to same inner passthrough column. Previously, this case prevented
inlining. However, only duplicate references to inner *synthesized*
columns need to be detected.

Release note: None

Co-authored-by: Nikhil Benesch <[email protected]>
Co-authored-by: Andrew Kimball <[email protected]>
@craig
Copy link
Contributor

craig bot commented Jan 3, 2019

Build succeeded

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Mar 23, 2020
The merge queue was disabled in this test by df26cf6 due to the bug
found in cockroachdb#32784. That bug was fixed by cockroachdb#33312, so we can address the
TODO and re-enable merges in the test.

Release note: None
Release justification: test only
craig bot pushed a commit that referenced this pull request Mar 23, 2020
46408: opt: fix buggy EliminateUpsertDistinctOnNoColumns rule r=andy-kimball a=andy-kimball

The fix for #37880 added a new ErrorOnDup field to the UpsertDistinctOn
operator. However, the EliminateErrorDistinctNoColumns rule wasn't changed
to respect this flag, and so there's still a case where ON CONFLICT DO
NOTHING returns an error.

This commit eliminates the error by separating out the ErrorOnDup=True case
from the EliminateErrorDistinctNoColumns rule (which raises an error) and
adding it instead to the EliminateDistinctNoColumns rule (which does not
raise an error).

Fixes #46395

Release justification: Bug fixes and low-risk updates to new functionality.
This was always an error, so it's low-risk to make it a non-error.

Release note: None

46431: kv: don't disable the merge queue needlessly in more tests r=nvanbenschoten a=nvanbenschoten

Follow up to #46383.

These tests were disabling the queue to not interfere with its
AdminSplits, but since the tests were written, AdminSplit got
a TTL.

Release note: None
Release justification: test only

46433: kv: re-enable merge queue in TestSplitTriggerRaftSnapshotRace r=nvanbenschoten a=nvanbenschoten

The merge queue was disabled in this test by df26cf6 due to the bug
found in #32784. That bug was fixed by #33312, so we can address the
TODO and re-enable merges in the test.

Release note: None
Release justification: test only

46444: kv: deflake TestRangeInfo by using a manual clock r=nvanbenschoten a=nvanbenschoten

Fixes #46215.

Fallout from #45984.

Release justification: testing only
Release note: None.

Co-authored-by: Andrew Kimball <[email protected]>
Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants