sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541

thoszhang · 2020-03-24T23:45:35Z

If a schema change involves dropping a column, and the column doesn't have a default value (or anything that would require a backfill if the column were being added for the first time), rolling back the schema change for any reason can produce a restored column with missing data.

Here's an example, on 19.2.2, where we attempt to drop a column and add a unique index in the same transaction, and adding the unique index fails:

[email protected]:54369/movr> create table t (a int, b int); insert into t values (1, 1), (2, 1);
INSERT 2

Time: 13.643ms

[email protected]:54369/movr> begin; alter table t drop column a; create unique index on t(b); commit;
pq: transaction committed but schema change aborted with error: (23505): duplicate key value (b)=(1) violates unique constraint "t_b_key"
HINT: Some of the non-DDL statements may have committed successfully, but some of the DDL statement(s) failed.
Manual inspection may be required to determine the actual state of the database.
--
See: https://github.com/cockroachdb/cockroach/issues/42061
[email protected]:54369/movr> select * from t;
  b |  a
+---+------+
  1 | NULL
  1 | NULL
(2 rows)

Time: 2.192ms

I haven't directly verified that it's possible to only have some (or none) of the values in the column be nulled out, instead of all of them, but it should be possible in theory if we cancel the job before the backfill to drop the column is finished.

Ultimately, this is because we treat the rollback of a dropped column identically to adding the column in the first place. Maybe a relatively easy improvement would be to run a backfill that deletes all the values whenever we're doing a rollback that is initiated in the delete-only state, to at least avoid having a partially deleted column.

This is a problem that's existed for a long time (at least since 19.1, and maybe for as long as the column backfiller has existed), but I couldn't find an existing issue where it's discussed.

Jira issue: CRDB-5083

The text was updated successfully, but these errors were encountered:

thoszhang · 2020-04-13T18:16:30Z

Edit: I've split off the content of this comment into #47712, since not being able to roll back or clean up the schema change at all and thus blocking other schema changes from proceeding is in some ways much worse than doing it successfully with unexpected data. We should prioritize fixing that separately.

Here's a variation on the same underlying problem: We also don't correctly handle backfills for columns with defaults (or, presumably, computed values), if the column was dropped and then the change was rolled back. This does not work (in 19.2.5):

create table t (a int, b int unique default 1);
insert into t values (1, 1), (1, 2);
begin; alter table t drop column b; create unique index on t(a); commit;

The error that's returned is one about a not having unique values, but there's also an error from b during the rollback that will cause the async schema changer to permanently get stuck retrying forever:

W200413 17:46:00.575033 1658 sql/schema_changer.go:1175  [n1,client=[::1]:50514,user=root,scExec] error purging mutation: candidate pg code: 23505
  - error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql/row.NewUniquenessConstraintViolationError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/errors.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).wrapDupError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).flush
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:113
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).mainLoop
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:225
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).doRun
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:120
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2.4
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:873
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:717
    github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).exec
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:700
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:716
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:789
    github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166
    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
    	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1337
  - error with embedded safe details: duplicate key value (%s)=(%s) violates unique constraint %q
    -- arg 1: <string>
    -- arg 2: <string>
    -- arg 3: <string>
  - duplicate key value (b)=(1) violates unique constraint "t_b_key"
while handling error: candidate pg code: 23505
  - error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql/row.NewUniquenessConstraintViolationError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/errors.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).wrapDupError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).flush
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:113
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).mainLoop
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:225
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).doRun
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:120
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2.4
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:873
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:717
    github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).exec
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:700
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:716
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:789
    github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166
    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
    	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1337
  - error with embedded safe details: duplicate key value (%s)=(%s) violates unique constraint %q
    -- arg 1: <string>
    -- arg 2: <string>
    -- arg 3: <string>
  - duplicate key value (a)=(1) violates unique constraint "t_a_key"

The error from b being swallowed poses particular problems for 20.1; see #47324.

47097: geo/geoindex: interface for geometry and geography indexing, and r=sumeerbhola a=sumeerbhola implementation of those interfaces by s2GMIndex and s2GGIndex Both implementations use the S2 library to map 2D space to a single dimension. The implementations have TODOs regarding heuristics that will be addressed after we have end-to-end benchmarks. The geometry index abuses S2 by using face 0 of the unit cube that will be projected to the unit sphere, to index the rectangular bounding box of the index. This introduces two distortions: rectangle => square, and square side of unit cube => unit sphere. Geometries that exceed the bounding box are clipped, and indexed both as clipped, and under a special cellid that is unused by S2. Index acceleration is restricted to three relationships: Covers, CoveredBy, Intersection. Other relationships will need to be expressed using these. CoveredBy uses an "inner covering" to deal with the fact that coverings do not have the strong properties of bounding boxes. The CoveredBy produces an expression involving set unions and intersections that is factored to eliminate common sub-expressions and output in Reverse Polish notation. How to compute this expression during query execution is an open topic. Release note: None 47446: sql: fix handling of errors in schema change rollbacks r=lucy-zhang a=lucy-zhang This PR does two main things to fix error handling for a schema change being rolled back in the job's `OnFailOrCancel` hook: (1) errors from `runStateMachineAndBackfill` (acting on the reversed schema change) are now returned instead of being logged and swallowed; and (2) we now return the error and cause the the job to fail permanently when we encounter any error that isn't one of the specific whitelisted "retriable" errors for the schema changer. Transient "retriable" errors in the schema changer now correctly lead to retries (which are handled by the job registry). The reason for implementing (2) in addition to (1) was that I discovered some pre-existing instances of erroneous behavior in 19.2 and earlier versions where the "expected" behavior was that we would never be able to clean up a failed schema change due to inadequacies in the schema changer and backfiller. (One notable example is that because the column backfiller can't recover previous values, if a DROP COLUMN schema change is being rolled back, trying to backfill a default value on a column with constraints won't work in general.) Given this, it seemed best to just return a result to the client right away instead of pointlessly retrying and blocking indefinitely. This seemed like the best quick fix that isn't a regression from 19.2 and that somewhat reproduces the existing behavior. This change also required a few changes to testing: There's now a sentinel for job retry errors used with the errors package, and a few testing knobs in the schema change job migration tests that were relying on the old erroneous behavior have now been corrected. See #46541 (comment) for an example of a schema change that can't be successfully rolled back. Closes #47324. Release note (bug fix): Fixed a bug introduced with the new schema change job implementation in beta.3 that caused errors when rolling back a schema change to be swallowed. Co-authored-by: sumeerbhola <[email protected]> Co-authored-by: Lucy Zhang <[email protected]>

ajwerner · 2020-05-05T21:03:30Z

This issue was discussed in some detail in the following comment thread which additionally includes a proposal.

#47989 (comment)

The basic idea is that we should re-order the steps of mutations due to a single transaction such that the phases which can fail can be safely rolled back. Pasting the proposal from that comment:

It seems like so long as we leave the dropped column public until all other mutations which can fail complete, and then make sure we remove all constraints before dropping a column, then make the drop column mutation something which cannot be canceled then we'll be out of this pickle.

Today we assume that the transaction which performs the drop sees the column in WRITE_ONLY for subsequent statements. My thinking is we should have that be true for that transaction but not actually commit the table descriptor in write only. This also gives us an opportunity to drop the constraints prior to making the dropped column non-public. That too is hazardous to roll back. My sense is as soon as we start dropping properties we need to make it impossible to roll back. In short:

Commit Time
- Create new columns and queue mutations
- Add new columns in delete-only
Add columns, indexes, and constraint
- Still can cancel or fail
- Add new columns in write-only, backfill new indexes, attempt to add constraints
Drop constraints
- Cannot cancel or fail
- Drop constraints which have been dropped
- Drop constraints for columns which are being dropped
Drop columns
- cannot cancel or fail

A problem is that the job corresponding to all of these mutations will be cancelable for some of its lifetime but not for others. We don't currently have an API affordance to deal with that. We could either:

Add logic to inspect the progress to determine whether a job is cancelable
Create another job for the uncancelable portions of the change.
- This approach seems worse for several reasons:
  - It will make waiting for all of the mutations harder
    - I suppose we could have the parent wait on the child after it creates it mitigating this point.
  - It will may exacerbate the problems in sql: schema changes can be very slow #47790

It's worth noting that without the completion of #47989, transactions which add and drop columns will need to perform two backfills.

It's also worth noting that if we do defer dropping the column until later, this code will likely either directly solve or pave the path to solving #47719 as it will drop the constraint in one phase before dropping the column - thus the constraint will certainly have been dropped on all nodes by the time the backfill to drop the column tries to run. It's possible there's part of that issue I'm misunderstanding.

ajwerner · 2020-05-05T21:04:19Z

cc @RichardJCai

thoszhang · 2020-05-06T18:54:49Z

It's worth noting that without the completion of #47989, transactions which add and drop columns will need to perform two backfills.

I realized that another complication in this case is that we're currently forced to implement this sequence of steps as two sequential jobs (or at least two calls to SchemaChanger.exec()). (This is also true for the implementation of ALTER COLUMN TYPE.) The difficulty isn't running two backfills, per se, but that we only allow one iteration through the mutation state machine per "schema change." So it's not just a matter of rearranging the steps in SchemaChanger.runBackfill() as I'd hoped.

I think we should try to avoid having two jobs. The simplest way to do this would probably be to reorganize SchemaChange.exec() (or just runStateMachineAndBackfill()) to follow the template above. It would not result in a general framework for doing any conceivable schema change we might want to support in the future with a bunch of modular schema change components, but I think lot of this will get overhauled when we start implementing transactional schema changes anyway.

ajwerner · 2021-01-05T20:03:29Z

Turns out that this can be even worse than just losing data. It can also lead to tables being stuck in an uncoverable state.

set sql_safe_updates=false;
CREATE TABLE foo (i INT PRIMARY KEY, j INT UNIQUE NOT NULL, k INT);
INSERT INTO foo VALUES (1, 1, 1), (2, 2, 1);
BEGIN;
    ALTER TABLE foo DROP COLUMN j;
    CREATE UNIQUE INDEX kuniq ON foo(k);
COMMIT;
-- table foo is now corrupt

ajwerner · 2021-01-05T20:19:30Z

The issue in the above comment is that we remove the column data (with the column backfiller) and drop the unique index before we figure out about the failure of the unique constraint violation. We then attempt to revert but that fails because the old column data is gone and its unique constraint violation is now violated. It's pretty bad.

ajwerner · 2021-01-05T20:19:43Z

We do log that manual repair may be require 😉

thoszhang · 2021-01-06T22:46:21Z

This is like #47712, right?

thoszhang added A-disaster-recovery A-schema-changes labels Mar 24, 2020

jordanlewis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 30, 2020

This was referenced Apr 13, 2020

sql: investigate ways of cleaning up failed schema change rollbacks with mutations #47456

Closed

sql: fix handling of errors in schema change rollbacks #47446

Merged

release-20.1: sql: fix handling of errors in schema change rollbacks #47499

Merged

ajwerner mentioned this issue Jul 23, 2020

sql: gate potentially unsafe schema changes behind sql_safe_updates #51863

Closed

thoszhang mentioned this issue Sep 18, 2020

sql: implement new state-machine-based schema changer #54584

Closed

jseldess mentioned this issue Dec 9, 2020

Add warning about hazardous DROP COLUMN statements cockroachdb/docs#9091

Closed

ajwerner mentioned this issue Jan 5, 2021

roachtest: schemachange workload times out due to corrupted table descriptor mutations blocking other schema changes #58344

Closed

ajwerner mentioned this issue Jan 20, 2021

sql: enable adding columns which are not written to the current primary index #59149

Closed

ajwerner mentioned this issue Jan 28, 2021

jobs: add new terminal state indicating failure in OnFailOrCancel #59542

Closed

ajwerner mentioned this issue Feb 4, 2021

sql/schemachanger: prerequisites to setting schema changer "on" by default #59788

Closed

5 tasks

jordanlewis mentioned this issue Apr 6, 2021

sql: v20.2.7: original error when rolling back mutations: Non-nullable column with no value #62974

Closed

pbardea mentioned this issue Apr 8, 2021

backupccl: more gracefully handle job restoration in cluster restore #58418

Closed

mwang1026 removed the A-disaster-recovery label May 19, 2021

jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021

jseldess added docs-done docs-known-limitation labels Nov 13, 2021

exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023

rimadeodhar added the A-schema-fixed-by-declarative label Feb 4, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541

sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541

thoszhang commented Mar 24, 2020 •

edited by cockroach-jira-scripts

Loading

thoszhang commented Apr 13, 2020 •

edited

Loading

ajwerner commented May 5, 2020

ajwerner commented May 5, 2020

thoszhang commented May 6, 2020

ajwerner commented Jan 5, 2021 •

edited

Loading

ajwerner commented Jan 5, 2021

ajwerner commented Jan 5, 2021

thoszhang commented Jan 6, 2021

sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541

sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541

Comments

thoszhang commented Mar 24, 2020 • edited by cockroach-jira-scripts Loading

thoszhang commented Apr 13, 2020 • edited Loading

ajwerner commented May 5, 2020

ajwerner commented May 5, 2020

thoszhang commented May 6, 2020

ajwerner commented Jan 5, 2021 • edited Loading

ajwerner commented Jan 5, 2021

ajwerner commented Jan 5, 2021

thoszhang commented Jan 6, 2021

thoszhang commented Mar 24, 2020 •

edited by cockroach-jira-scripts

Loading

thoszhang commented Apr 13, 2020 •

edited

Loading

ajwerner commented Jan 5, 2021 •

edited

Loading