Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql, backfill: rolling back a schema change that drops a column leads to restored column with missing data #46541

Open
thoszhang opened this issue Mar 24, 2020 · 8 comments
Labels
A-schema-changes A-schema-fixed-by-declarative C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. docs-done docs-known-limitation T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)

Comments

@thoszhang
Copy link
Contributor

thoszhang commented Mar 24, 2020

If a schema change involves dropping a column, and the column doesn't have a default value (or anything that would require a backfill if the column were being added for the first time), rolling back the schema change for any reason can produce a restored column with missing data.

Here's an example, on 19.2.2, where we attempt to drop a column and add a unique index in the same transaction, and adding the unique index fails:

[email protected]:54369/movr> create table t (a int, b int); insert into t values (1, 1), (2, 1);
INSERT 2

Time: 13.643ms

[email protected]:54369/movr> begin; alter table t drop column a; create unique index on t(b); commit;
pq: transaction committed but schema change aborted with error: (23505): duplicate key value (b)=(1) violates unique constraint "t_b_key"
HINT: Some of the non-DDL statements may have committed successfully, but some of the DDL statement(s) failed.
Manual inspection may be required to determine the actual state of the database.
--
See: https://github.com/cockroachdb/cockroach/issues/42061
[email protected]:54369/movr> select * from t;
  b |  a
+---+------+
  1 | NULL
  1 | NULL
(2 rows)

Time: 2.192ms

I haven't directly verified that it's possible to only have some (or none) of the values in the column be nulled out, instead of all of them, but it should be possible in theory if we cancel the job before the backfill to drop the column is finished.

Ultimately, this is because we treat the rollback of a dropped column identically to adding the column in the first place. Maybe a relatively easy improvement would be to run a backfill that deletes all the values whenever we're doing a rollback that is initiated in the delete-only state, to at least avoid having a partially deleted column.

This is a problem that's existed for a long time (at least since 19.1, and maybe for as long as the column backfiller has existed), but I couldn't find an existing issue where it's discussed.

Jira issue: CRDB-5083

@jordanlewis jordanlewis added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Mar 30, 2020
@thoszhang
Copy link
Contributor Author

thoszhang commented Apr 13, 2020

Edit: I've split off the content of this comment into #47712, since not being able to roll back or clean up the schema change at all and thus blocking other schema changes from proceeding is in some ways much worse than doing it successfully with unexpected data. We should prioritize fixing that separately.

Here's a variation on the same underlying problem: We also don't correctly handle backfills for columns with defaults (or, presumably, computed values), if the column was dropped and then the change was rolled back. This does not work (in 19.2.5):

create table t (a int, b int unique default 1);
insert into t values (1, 1), (1, 2);
begin; alter table t drop column b; create unique index on t(a); commit;

The error that's returned is one about a not having unique values, but there's also an error from b during the rollback that will cause the async schema changer to permanently get stuck retrying forever:

W200413 17:46:00.575033 1658 sql/schema_changer.go:1175  [n1,client=[::1]:50514,user=root,scExec] error purging mutation: candidate pg code: 23505
  - error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql/row.NewUniquenessConstraintViolationError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/errors.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).wrapDupError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).flush
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:113
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).mainLoop
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:225
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).doRun
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:120
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2.4
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:873
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:717
    github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).exec
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:700
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:716
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:789
    github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166
    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
    	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1337
  - error with embedded safe details: duplicate key value (%s)=(%s) violates unique constraint %q
    -- arg 1: <string>
    -- arg 2: <string>
    -- arg 3: <string>
  - duplicate key value (b)=(1) violates unique constraint "t_b_key"
while handling error: candidate pg code: 23505
  - error with attached stack trace:
    github.com/cockroachdb/cockroach/pkg/sql/row.NewUniquenessConstraintViolationError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/row/errors.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).wrapDupError
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*indexBackfiller).flush
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/indexbackfiller.go:113
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).mainLoop
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:225
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).doRun
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:135
    github.com/cockroachdb/cockroach/pkg/sql/rowexec.(*backfiller).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/rowexec/backfiller.go:120
    github.com/cockroachdb/cockroach/pkg/sql/flowinfra.(*FlowBase).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/flowinfra/flow.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*DistSQLPlanner).Run
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/distsql_running.go:372
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2.4
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:873
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:717
    github.com/cockroachdb/cockroach/pkg/internal/client.(*Txn).exec
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/txn.go:700
    github.com/cockroachdb/cockroach/pkg/internal/client.(*DB).Txn
    	/go/src/github.com/cockroachdb/cockroach/pkg/internal/client/db.go:716
    github.com/cockroachdb/cockroach/pkg/sql.(*SchemaChanger).distBackfill.func2
    	/go/src/github.com/cockroachdb/cockroach/pkg/sql/backfill.go:789
    github.com/cockroachdb/cockroach/pkg/util/ctxgroup.Group.GoCtx.func1
    	/go/src/github.com/cockroachdb/cockroach/pkg/util/ctxgroup/ctxgroup.go:166
    github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup.(*Group).Go.func1
    	/go/src/github.com/cockroachdb/cockroach/vendor/golang.org/x/sync/errgroup/errgroup.go:57
    runtime.goexit
    	/usr/local/go/src/runtime/asm_amd64.s:1337
  - error with embedded safe details: duplicate key value (%s)=(%s) violates unique constraint %q
    -- arg 1: <string>
    -- arg 2: <string>
    -- arg 3: <string>
  - duplicate key value (a)=(1) violates unique constraint "t_a_key"

The error from b being swallowed poses particular problems for 20.1; see #47324.

craig bot pushed a commit that referenced this issue Apr 15, 2020
47097: geo/geoindex: interface for geometry and geography indexing, and r=sumeerbhola a=sumeerbhola

implementation of those interfaces by s2GMIndex and s2GGIndex

Both implementations use the S2 library to map 2D space to a
single dimension.

The implementations have TODOs regarding heuristics that will
be addressed after we have end-to-end benchmarks. The geometry
index abuses S2 by using face 0 of the unit cube that will be
projected to the unit sphere, to index the rectangular bounding
box of the index. This introduces two distortions:
rectangle => square, and square side of unit cube => unit sphere.
Geometries that exceed the bounding box are clipped, and indexed
both as clipped, and under a special cellid that is unused by S2.

Index acceleration is restricted to three relationships:
Covers, CoveredBy, Intersection. Other relationships will need
to be expressed using these. CoveredBy uses an "inner covering"
to deal with the fact that coverings do not have the strong
properties of bounding boxes. The CoveredBy produces an expression
involving set unions and intersections that is factored to
eliminate common sub-expressions and output in Reverse Polish
notation. How to compute this expression during query execution is
an open topic.

Release note: None

47446: sql: fix handling of errors in schema change rollbacks r=lucy-zhang a=lucy-zhang

This PR does two main things to fix error handling for a schema change
being rolled back in the job's `OnFailOrCancel` hook: (1) errors from
`runStateMachineAndBackfill` (acting on the reversed schema change) are
now returned instead of being logged and swallowed; and (2) we now
return the error and cause the the job to fail permanently when we
encounter any error that isn't one of the specific whitelisted
"retriable" errors for the schema changer. Transient "retriable" errors
in the schema changer now correctly lead to retries (which are handled
by the job registry).

The reason for implementing (2) in addition to (1) was that I discovered
some pre-existing instances of erroneous behavior in 19.2 and earlier
versions where the "expected" behavior was that we would never be able
to clean up a failed schema change due to inadequacies in the schema
changer and backfiller. (One notable example is that because the
column backfiller can't recover previous values, if a DROP COLUMN schema
change is being rolled back, trying to backfill a default value on a
column with constraints won't work in general.) Given this, it seemed
best to just return a result to the client right away instead of
pointlessly retrying and blocking indefinitely. This seemed like the
best quick fix that isn't a regression from 19.2 and that somewhat
reproduces the existing behavior.

This change also required a few changes to testing: There's now a
sentinel for job retry errors used with the errors package, and a few
testing knobs in the schema change job migration tests that were relying
on the old erroneous behavior have now been corrected.

See #46541 (comment) for an example of a schema change that can't be successfully rolled back.
Closes #47324.

Release note (bug fix): Fixed a bug introduced with the new schema
change job implementation in beta.3 that caused errors when rolling back
a schema change to be swallowed.

Co-authored-by: sumeerbhola <[email protected]>
Co-authored-by: Lucy Zhang <[email protected]>
@ajwerner
Copy link
Contributor

ajwerner commented May 5, 2020

This issue was discussed in some detail in the following comment thread which additionally includes a proposal.

#47989 (comment)

The basic idea is that we should re-order the steps of mutations due to a single transaction such that the phases which can fail can be safely rolled back. Pasting the proposal from that comment:

It seems like so long as we leave the dropped column public until all other mutations which can fail complete, and then make sure we remove all constraints before dropping a column, then make the drop column mutation something which cannot be canceled then we'll be out of this pickle.

Today we assume that the transaction which performs the drop sees the column in WRITE_ONLY for subsequent statements. My thinking is we should have that be true for that transaction but not actually commit the table descriptor in write only. This also gives us an opportunity to drop the constraints prior to making the dropped column non-public. That too is hazardous to roll back. My sense is as soon as we start dropping properties we need to make it impossible to roll back. In short:

  1. Commit Time
    • Create new columns and queue mutations
    • Add new columns in delete-only
  2. Add columns, indexes, and constraint
    • Still can cancel or fail
    • Add new columns in write-only, backfill new indexes, attempt to add constraints
  3. Drop constraints
    • Cannot cancel or fail
    • Drop constraints which have been dropped
    • Drop constraints for columns which are being dropped
  4. Drop columns
    • cannot cancel or fail

A problem is that the job corresponding to all of these mutations will be cancelable for some of its lifetime but not for others. We don't currently have an API affordance to deal with that. We could either:

  1. Add logic to inspect the progress to determine whether a job is cancelable
  2. Create another job for the uncancelable portions of the change.
    • This approach seems worse for several reasons:
      • It will make waiting for all of the mutations harder
        • I suppose we could have the parent wait on the child after it creates it mitigating this point.
      • It will may exacerbate the problems in sql: schema changes can be very slow #47790

It's worth noting that without the completion of #47989, transactions which add and drop columns will need to perform two backfills.


It's also worth noting that if we do defer dropping the column until later, this code will likely either directly solve or pave the path to solving #47719 as it will drop the constraint in one phase before dropping the column - thus the constraint will certainly have been dropped on all nodes by the time the backfill to drop the column tries to run. It's possible there's part of that issue I'm misunderstanding.

@ajwerner
Copy link
Contributor

ajwerner commented May 5, 2020

cc @RichardJCai

@thoszhang
Copy link
Contributor Author

It's worth noting that without the completion of #47989, transactions which add and drop columns will need to perform two backfills.

I realized that another complication in this case is that we're currently forced to implement this sequence of steps as two sequential jobs (or at least two calls to SchemaChanger.exec()). (This is also true for the implementation of ALTER COLUMN TYPE.) The difficulty isn't running two backfills, per se, but that we only allow one iteration through the mutation state machine per "schema change." So it's not just a matter of rearranging the steps in SchemaChanger.runBackfill() as I'd hoped.

I think we should try to avoid having two jobs. The simplest way to do this would probably be to reorganize SchemaChange.exec() (or just runStateMachineAndBackfill()) to follow the template above. It would not result in a general framework for doing any conceivable schema change we might want to support in the future with a bunch of modular schema change components, but I think lot of this will get overhauled when we start implementing transactional schema changes anyway.

@ajwerner
Copy link
Contributor

ajwerner commented Jan 5, 2021

Turns out that this can be even worse than just losing data. It can also lead to tables being stuck in an uncoverable state.

set sql_safe_updates=false;
CREATE TABLE foo (i INT PRIMARY KEY, j INT UNIQUE NOT NULL, k INT);
INSERT INTO foo VALUES (1, 1, 1), (2, 2, 1);
BEGIN;
    ALTER TABLE foo DROP COLUMN j;
    CREATE UNIQUE INDEX kuniq ON foo(k);
COMMIT;
-- table foo is now corrupt

@ajwerner
Copy link
Contributor

ajwerner commented Jan 5, 2021

The issue in the above comment is that we remove the column data (with the column backfiller) and drop the unique index before we figure out about the failure of the unique constraint violation. We then attempt to revert but that fails because the old column data is gone and its unique constraint violation is now violated. It's pretty bad.

@ajwerner
Copy link
Contributor

ajwerner commented Jan 5, 2021

We do log that manual repair may be require 😉

@thoszhang
Copy link
Contributor Author

This is like #47712, right?

@jlinder jlinder added the T-sql-schema-deprecated Use T-sql-foundations instead label Jun 16, 2021
@exalate-issue-sync exalate-issue-sync bot added T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) and removed T-sql-schema-deprecated Use T-sql-foundations instead labels May 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-schema-changes A-schema-fixed-by-declarative C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. docs-done docs-known-limitation T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions)
Projects
None yet
Development

No branches or pull requests

7 participants