Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: fix handling of errors in schema change rollbacks #47446

Merged
merged 1 commit into from
Apr 15, 2020

Conversation

thoszhang
Copy link
Contributor

@thoszhang thoszhang commented Apr 13, 2020

This PR does two main things to fix error handling for a schema change
being rolled back in the job's OnFailOrCancel hook: (1) errors from
runStateMachineAndBackfill (acting on the reversed schema change) are
now returned instead of being logged and swallowed; and (2) we now
return the error and cause the the job to fail permanently when we
encounter any error that isn't one of the specific whitelisted
"retriable" errors for the schema changer. Transient "retriable" errors
in the schema changer now correctly lead to retries (which are handled
by the job registry).

The reason for implementing (2) in addition to (1) was that I discovered
some pre-existing instances of erroneous behavior in 19.2 and earlier
versions where the "expected" behavior was that we would never be able
to clean up a failed schema change due to inadequacies in the schema
changer and backfiller. (One notable example is that because the
column backfiller can't recover previous values, if a DROP COLUMN schema
change is being rolled back, trying to backfill a default value on a
column with constraints won't work in general.) Given this, it seemed
best to just return a result to the client right away instead of
pointlessly retrying and blocking indefinitely. This seemed like the
best quick fix that isn't a regression from 19.2 and that somewhat
reproduces the existing behavior.

This change also required a few changes to testing: There's now a
sentinel for job retry errors used with the errors package, and a few
testing knobs in the schema change job migration tests that were relying
on the old erroneous behavior have now been corrected.

See #46541 (comment) for an example of a schema change that can't be successfully rolled back.
Closes #47324.

Release note (bug fix): Fixed a bug introduced with the new schema
change job implementation in beta.3 that caused errors when rolling back
a schema change to be swallowed.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@blathers-crl
Copy link

blathers-crl bot commented Apr 13, 2020

❌ The GitHub CI (Cockroach) build has failed on e8f3287f.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@thoszhang thoszhang force-pushed the schema-change-rollback-error branch from e8f3287 to 2e45b5e Compare April 13, 2020 20:20
@blathers-crl
Copy link

blathers-crl bot commented Apr 13, 2020

❌ The GitHub CI (Cockroach) build has failed on 2e45b5ed.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@thoszhang thoszhang force-pushed the schema-change-rollback-error branch from 2e45b5e to 0f27bdc Compare April 13, 2020 22:42
@thoszhang thoszhang force-pushed the schema-change-rollback-error branch from 0f27bdc to be198c8 Compare April 14, 2020 00:28
@thoszhang thoszhang changed the title sql: permanently fail the job when a schema change rollback fails with a non-retriable error sql: fix handling of errors in schema change rollbacks Apr 14, 2020
@thoszhang thoszhang marked this pull request as ready for review April 14, 2020 00:31
@thoszhang
Copy link
Contributor Author

The schema change job migration tests were obliquely testing this behavior (and have now been updated), but I still need to write a few tests to focus on this specifically. I think it's ready for a first look, though. (Comments on the approach would be useful, since I wasn't planning to make change (2) described in the commit message until a few hours ago.)

@thoszhang thoszhang requested review from pbardea and ajwerner April 14, 2020 00:37
@thoszhang thoszhang force-pushed the schema-change-rollback-error branch from be198c8 to 2c26173 Compare April 14, 2020 01:57
@thoszhang
Copy link
Contributor Author

Added tests.

@blathers-crl
Copy link

blathers-crl bot commented Apr 14, 2020

❌ The GitHub CI (Cockroach) build has failed on 2c26173a.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

@thoszhang thoszhang force-pushed the schema-change-rollback-error branch from 2c26173 to def588c Compare April 14, 2020 05:01
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: just the encouragement to adopt require

Reviewed 4 of 4 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @lucy-zhang and @pbardea)


pkg/sql/schema_changer_test.go, line 5920 at r1 (raw file):

Quoted 5 lines of code…
	if matched, err := regexp.MatchString("cannot be reverted, manual cleanup may be required", jobErr); err != nil {
		t.Fatal(err)
	} else if !matched {
		t.Fatalf("unexpected job error %s", jobErr)
	}

nit: I highly recommend require.Regexp:

this would all be:

require.Regexp(t, "cannot be reverted, manual cleanup may be required", jobErr)

Also applies to a bunch of other stuff in this test.

This PR does two main things to fix error handling for a schema change
being rolled back in the job's `OnFailOrCancel` hook: (1) errors from
`runStateMachineAndBackfill` (acting on the reversed schema change) are
now returned instead of being logged and swallowed; and (2) we now
return the error and cause the the job to fail permanently when we
encounter any error that isn't one of the specific whitelisted
"retriable" errors for the schema changer. Transient "retriable" errors
in the schema changer now correctly lead to retries (which are handled
by the job registry).

The reason for implementing (2) in addition to (1) was that I discovered
some pre-existing instances of erroneous behavior in 19.2 and earlier
versions where the "expected" behavior was that we would never be able
to clean up a failed schema change due to inadequacies in the schema
changer and backfiller. (One notable example is that because the
column backfiller can't recover previous values, if a DROP COLUMN schema
change is being rolled back, trying to backfill a default value on a
column with constraints won't work in general.) Given this, it seemed
best to just return a result to the client right away instead of
pointlessly retrying and blocking indefinitely. This seemed like the
best quick fix that isn't a regression from 19.2 and that somewhat
reproduces the existing behavior.

This change also required a few changes to testing: There's now a
sentinel for job retry errors used with the errors package, and a few
testing knobs in the schema change job migration tests that were relying
on the old erroneous behavior have now been corrected.

Release note (bug fix): Fixed a bug introduced with the new schema
change job implementation in beta.3 that caused errors when rolling back
a schema change to be swallowed.
@thoszhang thoszhang force-pushed the schema-change-rollback-error branch from def588c to 68b0c08 Compare April 14, 2020 23:51
Copy link
Contributor Author

@thoszhang thoszhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/sql/schema_changer_test.go, line 5920 at r1 (raw file):

Previously, ajwerner wrote…
	if matched, err := regexp.MatchString("cannot be reverted, manual cleanup may be required", jobErr); err != nil {
		t.Fatal(err)
	} else if !matched {
		t.Fatalf("unexpected job error %s", jobErr)
	}

nit: I highly recommend require.Regexp:

this would all be:

require.Regexp(t, "cannot be reverted, manual cleanup may be required", jobErr)

Also applies to a bunch of other stuff in this test.

Done.

@thoszhang
Copy link
Contributor Author

TFTRs

bors r+

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner)


pkg/sql/schema_changer_test.go, line 5857 at r2 (raw file):

CREATE UNIQUE INDEX i ON t.test(v);
`)
	require.Error(t, err)

totally not worth going back and changing but in the future you don't even need to NoError this, Regexp will check if the argument is nil, fail, and then if it's non-nil, detect if it's an error will call Error() on it and compare it to the regexp


pkg/sql/schema_changer_test.go, line 5908 at r2 (raw file):

CREATE UNIQUE INDEX i ON t.test(v);
`)
	require.Error(t, err)

same here

@craig
Copy link
Contributor

craig bot commented Apr 15, 2020

Build succeeded

craig bot pushed a commit that referenced this pull request Apr 16, 2020
46929: sql: wait to start index GC job until schema changer is done r=lucy-zhang a=lucy-zhang

Previously, we were creating and starting the index GC job in a separate
transaction from the one used in `PublishMultiple()` in
`SchemaChanger.done()`, meaning that the index GC job could be run
before the original transaction to finalize all schema changes on the
table descriptor had been committed. This led to unpredictable behavior
while testing. It could also potentially cause multiple GC jobs to be
created if the `PublishMultiple()` closure were retried.

In this PR, the GC job is now created as a `StartableJob` that is
started only after the table descriptor in the finalized state has been
published.

Release justification: Bug fix for new feature.

Release note (bug fix): Fixed a bug introduced in beta.3 that could
cause multiple index GC jobs to be created for the same schema change in
rare cases.

47540: sql: add unimplemented support for drop type r=rohany a=rohany

This PR adds unimplemented support for the DROP TYPE command,
as well as adding some missing help/annotations for the CREATE
TYPE command.

Release note: None

47553: sql: fix handling of errors in reversing schema change mutations r=lucy-zhang a=lucy-zhang

As part of fixing error handling for rolling back schema changes in
general, a bug was introduced: When an error was returned from reversing
the direction of mutations, instead of returning that error, we would
instead return the original resumer error. This would prevent some
retriable errors from producing retries, and could lead to failed jobs
that weren't cleaned up even though they could have been; fixing these
problems was the point of the original PR. This patch replaces the wrong
error with the right error.

A new schema change testing knob has been added to inject errors for
testing.

See #47446 for the original error handling fix.
Closes #47532.

Release note (bug fix): Fixed a bug causing some schema change rollbacks
to fail permanently even on transient errors.

Co-authored-by: Lucy Zhang <[email protected]>
Co-authored-by: Rohan Yadav <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

sql: errors during schema change rollback are getting swallowed
5 participants