-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541
Comments
Edit: I've split off the content of this comment into #47712, since not being able to roll back or clean up the schema change at all and thus blocking other schema changes from proceeding is in some ways much worse than doing it successfully with unexpected data. We should prioritize fixing that separately. Here's a variation on the same underlying problem: We also don't correctly handle backfills for columns with defaults (or, presumably, computed values), if the column was dropped and then the change was rolled back. This does not work (in 19.2.5):
The error that's returned is one about
The error from |
47097: geo/geoindex: interface for geometry and geography indexing, and r=sumeerbhola a=sumeerbhola implementation of those interfaces by s2GMIndex and s2GGIndex Both implementations use the S2 library to map 2D space to a single dimension. The implementations have TODOs regarding heuristics that will be addressed after we have end-to-end benchmarks. The geometry index abuses S2 by using face 0 of the unit cube that will be projected to the unit sphere, to index the rectangular bounding box of the index. This introduces two distortions: rectangle => square, and square side of unit cube => unit sphere. Geometries that exceed the bounding box are clipped, and indexed both as clipped, and under a special cellid that is unused by S2. Index acceleration is restricted to three relationships: Covers, CoveredBy, Intersection. Other relationships will need to be expressed using these. CoveredBy uses an "inner covering" to deal with the fact that coverings do not have the strong properties of bounding boxes. The CoveredBy produces an expression involving set unions and intersections that is factored to eliminate common sub-expressions and output in Reverse Polish notation. How to compute this expression during query execution is an open topic. Release note: None 47446: sql: fix handling of errors in schema change rollbacks r=lucy-zhang a=lucy-zhang This PR does two main things to fix error handling for a schema change being rolled back in the job's `OnFailOrCancel` hook: (1) errors from `runStateMachineAndBackfill` (acting on the reversed schema change) are now returned instead of being logged and swallowed; and (2) we now return the error and cause the the job to fail permanently when we encounter any error that isn't one of the specific whitelisted "retriable" errors for the schema changer. Transient "retriable" errors in the schema changer now correctly lead to retries (which are handled by the job registry). The reason for implementing (2) in addition to (1) was that I discovered some pre-existing instances of erroneous behavior in 19.2 and earlier versions where the "expected" behavior was that we would never be able to clean up a failed schema change due to inadequacies in the schema changer and backfiller. (One notable example is that because the column backfiller can't recover previous values, if a DROP COLUMN schema change is being rolled back, trying to backfill a default value on a column with constraints won't work in general.) Given this, it seemed best to just return a result to the client right away instead of pointlessly retrying and blocking indefinitely. This seemed like the best quick fix that isn't a regression from 19.2 and that somewhat reproduces the existing behavior. This change also required a few changes to testing: There's now a sentinel for job retry errors used with the errors package, and a few testing knobs in the schema change job migration tests that were relying on the old erroneous behavior have now been corrected. See #46541 (comment) for an example of a schema change that can't be successfully rolled back. Closes #47324. Release note (bug fix): Fixed a bug introduced with the new schema change job implementation in beta.3 that caused errors when rolling back a schema change to be swallowed. Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Lucy Zhang <[email protected]>
This issue was discussed in some detail in the following comment thread which additionally includes a proposal. The basic idea is that we should re-order the steps of mutations due to a single transaction such that the phases which can fail can be safely rolled back. Pasting the proposal from that comment: It seems like so long as we leave the dropped column public until all other mutations which can fail complete, and then make sure we remove all constraints before dropping a column, then make the drop column mutation something which cannot be canceled then we'll be out of this pickle. Today we assume that the transaction which performs the drop sees the column in WRITE_ONLY for subsequent statements. My thinking is we should have that be true for that transaction but not actually commit the table descriptor in write only. This also gives us an opportunity to drop the constraints prior to making the dropped column non-public. That too is hazardous to roll back. My sense is as soon as we start dropping properties we need to make it impossible to roll back. In short:
A problem is that the job corresponding to all of these mutations will be cancelable for some of its lifetime but not for others. We don't currently have an API affordance to deal with that. We could either:
It's worth noting that without the completion of #47989, transactions which add and drop columns will need to perform two backfills. It's also worth noting that if we do defer dropping the column until later, this code will likely either directly solve or pave the path to solving #47719 as it will drop the constraint in one phase before dropping the column - thus the constraint will certainly have been dropped on all nodes by the time the backfill to drop the column tries to run. It's possible there's part of that issue I'm misunderstanding. |
cc @RichardJCai |
I realized that another complication in this case is that we're currently forced to implement this sequence of steps as two sequential jobs (or at least two calls to I think we should try to avoid having two jobs. The simplest way to do this would probably be to reorganize |
Turns out that this can be even worse than just losing data. It can also lead to tables being stuck in an uncoverable state. set sql_safe_updates=false;
CREATE TABLE foo (i INT PRIMARY KEY, j INT UNIQUE NOT NULL, k INT);
INSERT INTO foo VALUES (1, 1, 1), (2, 2, 1);
BEGIN;
ALTER TABLE foo DROP COLUMN j;
CREATE UNIQUE INDEX kuniq ON foo(k);
COMMIT;
-- table foo is now corrupt |
The issue in the above comment is that we remove the column data (with the column backfiller) and drop the unique index before we figure out about the failure of the unique constraint violation. We then attempt to revert but that fails because the old column data is gone and its unique constraint violation is now violated. It's pretty bad. |
We do log that manual repair may be require 😉 |
This is like #47712, right? |
If a schema change involves dropping a column, and the column doesn't have a default value (or anything that would require a backfill if the column were being added for the first time), rolling back the schema change for any reason can produce a restored column with missing data.
Here's an example, on 19.2.2, where we attempt to drop a column and add a unique index in the same transaction, and adding the unique index fails:
I haven't directly verified that it's possible to only have some (or none) of the values in the column be nulled out, instead of all of them, but it should be possible in theory if we cancel the job before the backfill to drop the column is finished.
Ultimately, this is because we treat the rollback of a dropped column identically to adding the column in the first place. Maybe a relatively easy improvement would be to run a backfill that deletes all the values whenever we're doing a rollback that is initiated in the delete-only state, to at least avoid having a partially deleted column.
This is a problem that's existed for a long time (at least since 19.1, and maybe for as long as the column backfiller has existed), but I couldn't find an existing issue where it's discussed.
Jira issue: CRDB-5083
The text was updated successfully, but these errors were encountered: