-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: fix handling of errors in schema change rollbacks #47446
sql: fix handling of errors in schema change rollbacks #47446
Conversation
❌ The GitHub CI (Cockroach) build has failed on e8f3287f. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
e8f3287
to
2e45b5e
Compare
❌ The GitHub CI (Cockroach) build has failed on 2e45b5ed. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
2e45b5e
to
0f27bdc
Compare
0f27bdc
to
be198c8
Compare
The schema change job migration tests were obliquely testing this behavior (and have now been updated), but I still need to write a few tests to focus on this specifically. I think it's ready for a first look, though. (Comments on the approach would be useful, since I wasn't planning to make change (2) described in the commit message until a few hours ago.) |
be198c8
to
2c26173
Compare
Added tests. |
❌ The GitHub CI (Cockroach) build has failed on 2c26173a. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan. |
2c26173
to
def588c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just the encouragement to adopt
require
Reviewed 4 of 4 files at r1.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)
pkg/sql/schema_changer_test.go, line 5920 at r1 (raw file):
Quoted 5 lines of code…
if matched, err := regexp.MatchString("cannot be reverted, manual cleanup may be required", jobErr); err != nil { t.Fatal(err) } else if !matched { t.Fatalf("unexpected job error %s", jobErr) }
nit: I highly recommend require.Regexp:
this would all be:
require.Regexp(t, "cannot be reverted, manual cleanup may be required", jobErr)
Also applies to a bunch of other stuff in this test.
This PR does two main things to fix error handling for a schema change being rolled back in the job's `OnFailOrCancel` hook: (1) errors from `runStateMachineAndBackfill` (acting on the reversed schema change) are now returned instead of being logged and swallowed; and (2) we now return the error and cause the the job to fail permanently when we encounter any error that isn't one of the specific whitelisted "retriable" errors for the schema changer. Transient "retriable" errors in the schema changer now correctly lead to retries (which are handled by the job registry). The reason for implementing (2) in addition to (1) was that I discovered some pre-existing instances of erroneous behavior in 19.2 and earlier versions where the "expected" behavior was that we would never be able to clean up a failed schema change due to inadequacies in the schema changer and backfiller. (One notable example is that because the column backfiller can't recover previous values, if a DROP COLUMN schema change is being rolled back, trying to backfill a default value on a column with constraints won't work in general.) Given this, it seemed best to just return a result to the client right away instead of pointlessly retrying and blocking indefinitely. This seemed like the best quick fix that isn't a regression from 19.2 and that somewhat reproduces the existing behavior. This change also required a few changes to testing: There's now a sentinel for job retry errors used with the errors package, and a few testing knobs in the schema change job migration tests that were relying on the old erroneous behavior have now been corrected. Release note (bug fix): Fixed a bug introduced with the new schema change job implementation in beta.3 that caused errors when rolling back a schema change to be swallowed.
def588c
to
68b0c08
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/sql/schema_changer_test.go, line 5920 at r1 (raw file):
Previously, ajwerner wrote…
if matched, err := regexp.MatchString("cannot be reverted, manual cleanup may be required", jobErr); err != nil { t.Fatal(err) } else if !matched { t.Fatalf("unexpected job error %s", jobErr) }
nit: I highly recommend require.Regexp:
this would all be:
require.Regexp(t, "cannot be reverted, manual cleanup may be required", jobErr)
Also applies to a bunch of other stuff in this test.
Done.
TFTRs bors r+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)
pkg/sql/schema_changer_test.go, line 5857 at r2 (raw file):
CREATE UNIQUE INDEX i ON t.test(v); `) require.Error(t, err)
totally not worth going back and changing but in the future you don't even need to NoError
this, Regexp
will check if the argument is nil, fail, and then if it's non-nil, detect if it's an error
will call Error()
on it and compare it to the regexp
pkg/sql/schema_changer_test.go, line 5908 at r2 (raw file):
CREATE UNIQUE INDEX i ON t.test(v); `) require.Error(t, err)
same here
Build succeeded |
46929: sql: wait to start index GC job until schema changer is done r=lucy-zhang a=lucy-zhang Previously, we were creating and starting the index GC job in a separate transaction from the one used in `PublishMultiple()` in `SchemaChanger.done()`, meaning that the index GC job could be run before the original transaction to finalize all schema changes on the table descriptor had been committed. This led to unpredictable behavior while testing. It could also potentially cause multiple GC jobs to be created if the `PublishMultiple()` closure were retried. In this PR, the GC job is now created as a `StartableJob` that is started only after the table descriptor in the finalized state has been published. Release justification: Bug fix for new feature. Release note (bug fix): Fixed a bug introduced in beta.3 that could cause multiple index GC jobs to be created for the same schema change in rare cases. 47540: sql: add unimplemented support for drop type r=rohany a=rohany This PR adds unimplemented support for the DROP TYPE command, as well as adding some missing help/annotations for the CREATE TYPE command. Release note: None 47553: sql: fix handling of errors in reversing schema change mutations r=lucy-zhang a=lucy-zhang As part of fixing error handling for rolling back schema changes in general, a bug was introduced: When an error was returned from reversing the direction of mutations, instead of returning that error, we would instead return the original resumer error. This would prevent some retriable errors from producing retries, and could lead to failed jobs that weren't cleaned up even though they could have been; fixing these problems was the point of the original PR. This patch replaces the wrong error with the right error. A new schema change testing knob has been added to inject errors for testing. See #47446 for the original error handling fix. Closes #47532. Release note (bug fix): Fixed a bug causing some schema change rollbacks to fail permanently even on transient errors. Co-authored-by: Lucy Zhang <[email protected]> Co-authored-by: Rohan Yadav <[email protected]>
This PR does two main things to fix error handling for a schema change
being rolled back in the job's
OnFailOrCancel
hook: (1) errors fromrunStateMachineAndBackfill
(acting on the reversed schema change) arenow returned instead of being logged and swallowed; and (2) we now
return the error and cause the the job to fail permanently when we
encounter any error that isn't one of the specific whitelisted
"retriable" errors for the schema changer. Transient "retriable" errors
in the schema changer now correctly lead to retries (which are handled
by the job registry).
The reason for implementing (2) in addition to (1) was that I discovered
some pre-existing instances of erroneous behavior in 19.2 and earlier
versions where the "expected" behavior was that we would never be able
to clean up a failed schema change due to inadequacies in the schema
changer and backfiller. (One notable example is that because the
column backfiller can't recover previous values, if a DROP COLUMN schema
change is being rolled back, trying to backfill a default value on a
column with constraints won't work in general.) Given this, it seemed
best to just return a result to the client right away instead of
pointlessly retrying and blocking indefinitely. This seemed like the
best quick fix that isn't a regression from 19.2 and that somewhat
reproduces the existing behavior.
This change also required a few changes to testing: There's now a
sentinel for job retry errors used with the errors package, and a few
testing knobs in the schema change job migration tests that were relying
on the old erroneous behavior have now been corrected.
See #46541 (comment) for an example of a schema change that can't be successfully rolled back.
Closes #47324.
Release note (bug fix): Fixed a bug introduced with the new schema
change job implementation in beta.3 that caused errors when rolling back
a schema change to be swallowed.