-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sql: transactions containing certain DDL statements are not atomic #42061
Comments
CockroachDB currently allows an explicit BEGIN..COMMIT txn containing DDL to fail atomicity in certain cases. These cases are detected during COMMIT and reported to the client as error XXA00. This patch extends the error message to link to issue cockroachdb#42061 which describes and handles this situation further. Release note (sql change): the error message generated when a txn containing DDL is both partially committed and partially rolled back (error XXA00) now contains a link to github issue cockroachdb#42061 where this situation is discussed further.
Proposing a session flag |
Perhaps related (but unlikely): #42064 |
Previously, CockroachDB always accepted DDL inside explicit txns (BEGIN...COMMIT) and would issue some of these DDL statements as asynchronous jobs after the rest of the txn was committed successfully. If that DDL subsequently failed, a client would also get error XXA00 to signal this error. However at that point the txn also would be both partially committed and partially rolled back. This is in effect an atomicity violation. It was considered to be acceptable because in practice most clients using DDL inside BEGIN...COMMIT do not, in fact, expect ACID semantics in those cases. So it was considered OK to not deliver ACID semantics in certain cases. Except that we now have clients that do want us to "do the right thing" and never allow a situation that _could_ yield atomicity violations. This patch makes a step in this direction by introducing a new session variable `strict_ddl_atomicity`, which defaults to `false` for compatibility with existing CockroachDB 19.2 clients and that of previous CockroachDB versions. When this setting is enabled / set to `true`, any DDL that would be otherwise be processed asynchronously and risk triggering error XXA00 is now fully disallowed, with the following error: ``` pq: unimplemented: cannot run this DDL statement inside BEGIN..COMMIT as its atomicity cannot be guaranteed HINT: You have attempted to use a feature that is not yet implemented. See: cockroachdb#42061 -- You can set the session variable 'strict_ddl_atomicity' to false if you are willing to accept atomicity violations. ``` Other DDL statements without async processing are still allowed without triggering that error. Examples of rejected DDL include but are not limited to: CREATE INDEX, ALTER ADD CONSTRAINT, ADD COLUMN with DEFAULT, DROP INDEX. Examples of DDL that's still processed: RENAME COLUMN, RENAME TABLE, ALTER INDEX CONFIGURE ZONE. Release note (sql change): a SQL client can now request strict atomicity for mixed DDL/DML transactions with the new session variable `strict_ddl_atomicity`, which defaults to `false`. When this variable is set to `true`, CockroachDB will refuse to accept processing those specific DDL statements inside BEGIN...COMMIT for which it cannot guarantee atomic processing (other DDL statements are still allowed).
I created an inventory of all the currently recognized DDL and how they are affected by this error:
Note about CREATE TABLE AS: this statement processes the insert |
@andy-kimball in a separate thread suggests that every "long running" DDL should be considered as split between a sync part and an async part, with the understanding that the async part may fail. A key part of this proposal is that a client must be able to verify whether the async part has completed successfully. When the complex DDL is related to a constraint, this is trivially possible, as both pg and crdb have a VALIDATE CONSTRAINT statement for this specific purpose. However as per the table in my previous comment above, the following statements do not have a clear way for clients to control completion:
|
For reference here is what pg 12 has to say about ALTER TABLE SET TYPE, ADD DEFAULT and other ALTER forms that require a table rewrite: https://www.postgresql.org/docs/12/mvcc-caveats.html
|
For reference here is what MSSQL has to say about online changes, including type changes and new columns: https://docs.microsoft.com/en-us/sql/t-sql/statements/alter-table-transact-sql?view=sql-server-ver15
The language is a bit obfuscated but this says (in other words) that ALTER COLUMN SET TYPE really adds a new column to perform the change, then converts, then drops the original column. Until/unless the conversion completes, concurrent txns either continue to observe the original column (until the DDL-containing txn commits) or get suspended while waiting on a lock for the new column (after the DDL-containing txn logically commits). Similarly for new columns, the new column remains invisible until the backfill completes. (It's not clear how clients are meant to observe whether the backfill completed successfully or not.) |
As per today's SIG notes, @lucy-zhang and @dt to review my analysis above (including the complete DDL table). |
CockroachDB currently allows an explicit BEGIN..COMMIT txn containing DDL to fail atomicity in certain cases. These cases are detected during COMMIT and reported to the client as error XXA00. This patch extends the error message to link to issue cockroachdb#42061 which describes and handles this situation further. Release note (sql change): the error message generated when a txn containing DDL is both partially committed and partially rolled back (error XXA00) now contains a link to github issue cockroachdb#42061 where this situation is discussed further.
Previously, CockroachDB always accepted DDL inside explicit txns (BEGIN...COMMIT) and would issue some of these DDL statements as asynchronous jobs after the rest of the txn was committed successfully. If that DDL subsequently failed, a client would also get error XXA00 to signal this error. However at that point the txn also would be both partially committed and partially rolled back. This is in effect an atomicity violation. It was considered to be acceptable because in practice most clients using DDL inside BEGIN...COMMIT do not, in fact, expect ACID semantics in those cases. So it was considered OK to not deliver ACID semantics in certain cases. Except that we now have clients that do want us to "do the right thing" and never allow a situation that _could_ yield atomicity violations. This patch makes a step in this direction by introducing a new session variable `strict_ddl_atomicity`, which defaults to `false` for compatibility with existing CockroachDB 19.2 clients and that of previous CockroachDB versions. When this setting is enabled / set to `true`, any DDL that would be otherwise be processed asynchronously and risk triggering error XXA00 is now fully disallowed, with the following error: ``` pq: unimplemented: cannot run this DDL statement inside BEGIN..COMMIT as its atomicity cannot be guaranteed HINT: You have attempted to use a feature that is not yet implemented. See: cockroachdb#42061 -- You can set the session variable 'strict_ddl_atomicity' to false if you are willing to accept atomicity violations. ``` Other DDL statements without async processing are still allowed without triggering that error. Examples of rejected DDL include but are not limited to: CREATE INDEX, ALTER ADD CONSTRAINT, ADD COLUMN with DEFAULT, DROP INDEX. Examples of DDL that's still processed: RENAME COLUMN, RENAME TABLE, ALTER INDEX CONFIGURE ZONE. Release note (sql change): a SQL client can now request strict atomicity for mixed DDL/DML transactions with the new session variable `strict_ddl_atomicity`, which defaults to `false`. When this variable is set to `true`, CockroachDB will refuse to accept processing those specific DDL statements inside BEGIN...COMMIT for which it cannot guarantee atomic processing (other DDL statements are still allowed).
Separately, regarding Andy's proposal to avoid an error in COMMIT even when the async DDL fails, @rafiss contributes:
|
Example Go ORM affected: http://github.com/gobuffalo/pop especially https://github.com/gobuffalo/pop/blob/master/migrator.go#L93 Will yield invalid ORM migration behavior if error XXA00 occurs. |
Yeah, this is the key, and suggests another path forward: require the |
suggests another path forward: require the NOT VALID modifier on all constraint additions in transactions
This works for constraints but does not help with ALTER SET TYPE and ADD COLUMN DEFAULT (and add computed column).
For DEFAULT columns *maybe* we can engineer a restriction that only scalar expressions that are guaranteed to never fail are allowed. That'd be hard and of very dubious UX. Also it would be much harder / impossible to engineer this type of restriction for computed columns.
And we're still left in the blue for SET TYPE.
…--
Sent from my Android device with K-9 Mail. Please excuse my brevity.
|
Are unique constraints/indexes included in the category of "constraints"? It's hard to see what the equivalent of an "unvalidated"/ |
In postgres, a |
So an I guess the other question is whether we've considered supporting DDL statements (or mixed DDL/DML transactions) that do require the table to be offline, in order to solve some of these problems. edit 2: I've been assuming throughout that unique constraints are always backed by unique indexes, and uniqueness violations manifest as key collisions in the index; if I'm wrong, please correct me. |
I think it's still somewhat useful to keep checking for uniqueness violations on some subset. At least things are not getting worse. That said, I'm not as opinionated on what to do with the index once the backfill job has failed. More important is being very deliberate and explicit about separating long-running O(# rows) operations into a sync and an async phase, and then ensuring that the sync phase is fully transactional and predictable (and documenting that). Unique constraints don't always need to be enforced via index key collisions. See #41535 for an example where they're not. |
It seems to me that it's clear what is going on: if it's unvalidated, that means it's inactive and there can still be duplicate rows. Also UNIQUE can be handled via the constraint system (the SQL standard and pg's dialect have all unique constraints receive a name in the constraint table, so we can attach state to it). |
But the guarantee provided by "unvalidated" for FK/check constraints is stronger: it means that the constraint is enforced for all writes going forward. This is not the case for unvalidated/invalid unique constraints with a partially built index, because we can't guarantee that a new row is actually globally unique in the table, even if we can guarantee that the row is unique across all rows updated since the constraint was added. This distinction seems important to me, but maybe it's unimportant for users in practice, as long as we flag the constraint as unvalidated/invalid. That's a relatively minor point. I'm mostly wondering about how well this plan (having the transaction succeed if the synchronous portion of the schema change succeeds, and alerting failures during the async phase via another channel) will actually work for users' most common use cases, especially when they're using migration tools and ORMs. Will they need to manually monitor their migration to see if it succeeded? |
42063: sql: new opt-in option `strict_ddl_atomicity` r=fqazi,chrisseto,dikshant a=knz Informs #42061. informs #87236 Fixes #108737. Previously, CockroachDB always accepted DDL inside explicit txns (BEGIN...COMMIT) and would issue some of these DDL statements as asynchronous jobs after the rest of the txn was committed successfully. If that DDL subsequently failed, a client would also get error `XXA00` to signal this error. However at that point the txn also would be both partially committed and partially rolled back. This is in effect an atomicity violation. It was considered to be acceptable because in practice most clients using DDL inside BEGIN...COMMIT do not, in fact, expect ACID semantics in those cases. So it was considered OK to not deliver ACID semantics in certain cases. Except that we now have clients that do want us to "do the right thing" and never allow a situation that _could_ yield atomicity violations. This patch makes a step in this direction by introducing a new session variable `strict_ddl_atomicity`, which defaults to `false` for compatibility with existing CockroachDB 23.1 clients and that of previous CockroachDB versions. When this setting is enabled / set to `true`, any DDL that would be otherwise be processed asynchronously and risk triggering error `XXA00` is now fully disallowed, with the following error: ``` pq: unimplemented: cannot run this DDL statement inside BEGIN..COMMIT as its atomicity cannot be guaranteed HINT: You have attempted to use a feature that is not yet implemented. See: https://go.crdb.dev/issue-v/42061 -- You can set the session variable 'strict_ddl_atomicity' to false if you are willing to accept atomicity violations. ``` Other DDL statements without async processing are still allowed without triggering that error. Examples of rejected DDL include but are not limited to: CREATE INDEX, ALTER ADD CONSTRAINT, ADD COLUMN with DEFAULT, DROP INDEX. Examples of DDL that's still processed: RENAME COLUMN, RENAME TABLE, ALTER INDEX CONFIGURE ZONE. Release note (sql change): a SQL client can now request strict atomicity for mixed DDL/DML transactions with the new session variable `strict_ddl_atomicity`, which defaults to `false`. When this variable is set to `true`, CockroachDB will refuse to accept processing those specific DDL statements inside BEGIN...COMMIT for which it cannot guarantee atomic processing (other DDL statements are still allowed). Note that schema changes implicit in certain operations (e.g. IMPORT) are not protected via the new mechanism and can still fail with `XXA00` errors. Epic: CRDB-28893 Co-authored-by: Raphael 'kena' Poss <[email protected]>
Related to #26508, #24785.
Filing this in reaction to a recent thread with @mberhault
Today, CockroachDB will accept to process DDL statements inside explicit txns (BEGIN...COMMIT), but process them after the non-DDL statements have been committed successfully.
If the DDL subsequently fails, the transaction is then both partially committed and partially rolled back. i.e. an atomicity violation.
If/when this happens, is currently reported to the client upon COMMIT via the special error code XXA00 (TransactionCommittedWithSchemaChangeFailure)
There is a discussion to be had about whether the situation should be allowed to occur at all, but until it does this issue will exist so it can be linked from the source code and docs for explanatory purposes.
Jira issue: CRDB-5394
The text was updated successfully, but these errors were encountered: