-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql/opt: avoid swallowing TransactionAbortedErrors #33312
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow, what a subtle issue.. Awesome find, that couldn't have been easy to figure out!
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/sql/plan.go, line 442 at r1 (raw file):
// If the prepared memo has been invalidated by schema or other changes, // re-prepare it. if ok, err := stmt.Prepared.Memo.IsStale(ctx, p.EvalContext(), &catalog); err != nil {
s/ok/isStale
(same below)
pkg/sql/plan.go, line 444 at r1 (raw file):
if ok, err := stmt.Prepared.Memo.IsStale(ctx, p.EvalContext(), &catalog); err != nil { return 0, err } else if !ok {
inverse check, should be if isStale
pkg/sql/opt/memo/memo.go, line 249 at r1 (raw file):
// 5. Data source privileges: current user may no longer have access to one or // more data sources. //
I would add a comment about the error, that it can be associated with the transaction and has to be propagated (or something to that effect). Same for CheckDependencies
pkg/sql/opt/memo/memo.go, line 271 at r1 (raw file):
// has changed, or if the current user no longer has sufficient privilege to // access the data source. return m.Metadata().CheckDependencies(ctx, catalog)
this should return the inverse of what CheckDeps returns. Or you can invert the function and rename to CheckStaleness
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
😓 thanks for finding this before we had to find it in the wild. #22615 seems well worth doing.
I'll defer to @RaduBerinde on the final LGTM
Reviewable status: complete! 0 of 0 LGTMs obtained
I think this would also fix #33375. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
Also, will this need to be back-ported to 2.1? |
We'll definitely want to consider backporting. On the one hand there's no evidence that this bug can manifest without merges turned on, but on the other hand there is a good chance that there is some way to trigger it with merges turned off that I haven't thought of. @andy-kimball @RaduBerinde I could actually use your advice on how to fix this test failure. This is the failure in question:
The logictest renames Unfortunately, now I can't just change the test to expect the fully-qualified name because then it fails in the reverse direction, i.e., when the optimizer is turned off. The only thing I've come up with is teaching IsStale to return |
I suppose we could also declare that we don't care about whether the error message uses the fully-qualified name or not, and adjust the logictest accordingly. |
The logic tests allow you to specify two possible error messages since it's just a regex. We've done this in other places:
|
Oh, fantastic! Done. PTAL. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't just change the test to expect the fully-qualified name because then it fails in the reverse direction, i.e., when the optimizer is turned off.
Have you tried to unify the error between the two paths?
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/internal_test.go, line 305 at r2 (raw file):
// Note that a fix to our transaction API to eliminate this class of errors is // proposed in #22615. func TestInternalExecutorTxnAbortNotSwallowed(t *testing.T) {
Let's talk about this test. The bug this patch is fixing is in the optimizer, not in the internal executor. The test's comment claims otherwise. I don't think this test is warranted... Is it? At the very least please add more words.
If you keep it, consider using TxnAborter
instead of what you're doing.
pkg/sql/internal_test.go, line 317 at r2 (raw file):
if r.GetHeartbeatTxn() != nil && br.Responses[i].GetHeartbeatTxn().Txn.Status == roachpb.ABORTED { go func() { heartbeatSawAbortedTxn <- ba.Txn.ID
please comment on why the distinction between a heartbeat finding out that a txn was aborted versus the commit finding out about the abort is important.
pkg/sql/internal_test.go, line 380 at r2 (raw file):
t.Fatalf("expected query execution on aborted txn to fail, but got %+v", err) } }
don't forget to rollback the txn :) That's the contract.
An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (cockroachdb#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes cockroachdb#32784. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/sql/plan.go, line 444 at r1 (raw file):
Previously, RaduBerinde wrote…
inverse check, should be
if isStale
🤦♂️ thanks, done.
pkg/sql/opt/memo/memo.go, line 271 at r1 (raw file):
Previously, RaduBerinde wrote…
this should return the inverse of what CheckDeps returns. Or you can invert the function and rename to
CheckStaleness
Wow, I was really asleep at the wheel here. Thanks. Fixed.
pkg/sql/internal_test.go, line 305 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
Let's talk about this test. The bug this patch is fixing is in the optimizer, not in the internal executor. The test's comment claims otherwise. I don't think this test is warranted... Is it? At the very least please add more words.
If you keep it, consider using
TxnAborter
instead of what you're doing.
Ok, I'll just remove it. Preserved here just in case: https://gist.github.com/benesch/53203264cf72eb6a1efe9137a306a26f
pkg/sql/internal_test.go, line 317 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
please comment on why the distinction between a heartbeat finding out that a txn was aborted versus the commit finding out about the abort is important.
Done. (Test removed.)
pkg/sql/internal_test.go, line 380 at r2 (raw file):
Previously, andreimatei (Andrei Matei) wrote…
don't forget to rollback the txn :) That's the contract.
Done. (Test removed.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (and 1 stale)
TFTRs! bors r+ |
33312: sql/opt: avoid swallowing TransactionAbortedErrors r=benesch a=benesch An optimizer code path could, in rare cases, fail to propagate a transaction aborted error. This proved disastrous as, due to a footgun in our transaction API (#22615), swallowing a transaction aborted error results in proceeding with a brand new transaction that has no knowledge of the earlier operations performed on the original transaction. This presented as a rare and confusing bug in splits, as the split transaction uses an internal executor. The internal executor would occasionally silently return a new transaction that only had half of the necessary operations performed on it, and committing that partial transaction would result in a "range does not match splits" error. Fixes #32784. Release note: None /cc @tbg 33417: opt: Inline constant values r=andy-kimball a=andy-kimball Inline constants in expressions like: SELECT x, x+1 FROM (VALUES (1)) AS t(x) ; with the new inlining rules, this becomes: VALUES (1, 2) The new inlining rules are useful for mutation expressions (e.g. UPDATE), which can nest multiple Project and Values expressions that often use constant values. For example: CREATE TABLE ab (a INT PRIMARY KEY, b INT AS (a + 1) STORED); UPDATE ab SET a=1 This now gets mapped by the optimizer to this internal equivalent: UPDATE ab SET a=1, b=2 Release note: None 33421: opt: Tighten up InlineProjectInProject rule r=andy-kimball a=andy-kimball Allow inlining nested Project in case where there are duplicate refs to same inner passthrough column. Previously, this case prevented inlining. However, only duplicate references to inner *synthesized* columns need to be detected. Release note: None Co-authored-by: Nikhil Benesch <[email protected]> Co-authored-by: Andrew Kimball <[email protected]>
Build succeeded |
The merge queue was disabled in this test by df26cf6 due to the bug found in cockroachdb#32784. That bug was fixed by cockroachdb#33312, so we can address the TODO and re-enable merges in the test. Release note: None Release justification: test only
46408: opt: fix buggy EliminateUpsertDistinctOnNoColumns rule r=andy-kimball a=andy-kimball The fix for #37880 added a new ErrorOnDup field to the UpsertDistinctOn operator. However, the EliminateErrorDistinctNoColumns rule wasn't changed to respect this flag, and so there's still a case where ON CONFLICT DO NOTHING returns an error. This commit eliminates the error by separating out the ErrorOnDup=True case from the EliminateErrorDistinctNoColumns rule (which raises an error) and adding it instead to the EliminateDistinctNoColumns rule (which does not raise an error). Fixes #46395 Release justification: Bug fixes and low-risk updates to new functionality. This was always an error, so it's low-risk to make it a non-error. Release note: None 46431: kv: don't disable the merge queue needlessly in more tests r=nvanbenschoten a=nvanbenschoten Follow up to #46383. These tests were disabling the queue to not interfere with its AdminSplits, but since the tests were written, AdminSplit got a TTL. Release note: None Release justification: test only 46433: kv: re-enable merge queue in TestSplitTriggerRaftSnapshotRace r=nvanbenschoten a=nvanbenschoten The merge queue was disabled in this test by df26cf6 due to the bug found in #32784. That bug was fixed by #33312, so we can address the TODO and re-enable merges in the test. Release note: None Release justification: test only 46444: kv: deflake TestRangeInfo by using a manual clock r=nvanbenschoten a=nvanbenschoten Fixes #46215. Fallout from #45984. Release justification: testing only Release note: None. Co-authored-by: Andrew Kimball <[email protected]> Co-authored-by: Nathan VanBenschoten <[email protected]>
An optimizer code path could, in rare cases, fail to propagate a
transaction aborted error. This proved disastrous as, due to a footgun
in our transaction API (#22615), swallowing a transaction aborted error
results in proceeding with a brand new transaction that has no knowledge
of the earlier operations performed on the original transaction.
This presented as a rare and confusing bug in splits, as the split
transaction uses an internal executor. The internal executor would
occasionally silently return a new transaction that only had half of the
necessary operations performed on it, and committing that partial
transaction would result in a "range does not match splits" error.
Fixes #32784.
Release note: None
/cc @tbg