-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
copy: copy fails with transaction error #90656
Comments
Assigning to myself just to triage... |
Some background here, in 22.1 COPY commands were split into 100 row chunks which each got their own transaction, in 22.2 COPY is atomic so it all happens in one big transaction. If errors like this can happen we might want to retry automatically, the concern is this kind of error might become more common with an atomic COPY so we might want try to work out a fix and get ahead of that. |
Do you mean 22.1?
I'm not certain on this one. The lease transfer acquires latches over the keyspace for each range. However the specifics of whether a large copy in a txn will be forced to abort or could just be bumped isn't clear to me. cc @arulajmani do you know why this is the case? |
Yes sorry, typo |
If it helps in any way, these COPY chunks contain large JSONB blobs (i think up to 10MB) |
This happens because we're unable to create the transaction record for the COPY statement because of a lease transfer. The transaction record is created on the first range the statement writes to. This particular error indicates that the creation of the transaction record is failing because the range it is trying to create a transaction record on had a lease change, and the new lease started after the transaction's start timestamp.
Yeah, we shouldn't see this if we retry the transaction. If we're seeing this repeatedly, it might be worth digging into what is causing these lease transfers.
Yeah, I don't think this stuff has changed at all recently. |
This seems to happen fairly consistently with a certain user we have, however I'm unable to reproduce this with a 3-node roachprod cluster on v22.1.10 trying these big COPYs like they are. Maybe it's worth diving in. Posterity for trying to reproduce the issue: https://github.com/otan-cockroach/big-copy/blob/main/copy.go |
ooh actually i was inserting empty json blobs, i've fixed the script above slightly, now copy takes forever but i am still unable to reproduce it. |
I have the pgdump now, will try my hand at reproducing... |
i can hit this semi-reliably in v22.2 on a multi-node cluster
the line before has
raising
coinciding with
|
As noted above, this happens when the transaction record creation races with a lease transfer of the range anchoring the txn record. I suppose it's possible that this could also happen in other cases, e.g. where the txn record is removed because of a txn abort or some such, but that's the primary cause. If this happens very frequently, consider disabling e.g. load-based lease rebalancing or some such and see if that helps.
Either an async intent write failed, or someone aborted the transaction and removed the intent.
A contending transaction aborted this transaction.
Indicates that we're writing a lot of data in the transaction, so we're switching from point intent resolution to ranged intent resolution. Could be pretty bad prior to 21.2, when we implemented separated intents, but shouldn't be a huge problem anymore.
Large, long-running transactions are problematic, and we regularly discourage our users from using them because of the issues that come with them, instead recommending they break up their writes into batches (like COPY did in 22.1). The longer the transaction runs, and the larger the number of keys it touches, the more likely it is to run into conflicts with other transactions. From what I see here, it seems likely that the transaction is getting aborted by a different transaction that it contends with. This could be stuff like backups, which will start pushing the long-running transaction, possibly aborting it. I see from the above that this transaction is running with a fairly low priority (e.g. |
Since COPY is inserting into a temp table there shouldn't be any contention, right @otan? I wonder if this could be caused by auto-stats creation, could we try this again with auto stats turned off on the COPY temp table (probably just have to disable it on the whole cluster but maybe it could be done at the schema level?). |
95275: sql: add retries on COPY under certain conditions r=rafiss,cucaroach a=otan Release note (sql change): If `copy_from_retries_enabled` is set, COPY is now able to retry certain safe circumstances - namely when `copy_from_atomic_enabled` is false, there is no transaction running COPY and the error returned is retriable. This prevents users who keep running into `TransactionProtoWithRefreshError` from having issues. Informs #90656 Release justification: feature gated new feature critical for unblocking DMS Co-authored-by: Oliver Tan <[email protected]>
117840: kv: promote expiration-based lease to epoch without sequence number change r=erikgrinaker a=nvanbenschoten Fixes #117630. Fixes #90656. Fixes #98553. Informs #61986. Informs #115191. This commit updates the post-lease transfer promotion of expiration-based leases to epoch-based leases to not change the sequence number of the lease. This avoids invalidating all requests proposed under the original expiration-based lease, which can lead to `RETRY_ASYNC_WRITE_FAILURE` errors. The change accomplishes this by updating the `Lease.Equivalent` method to consider an expiration-based lease to be equivalent to an epoch-based lease that is held by the same replica and has the same start time. Doing so requires some care, because lease equivalency is checked below Raft and needs to remain deterministic across binary versions. This change requires a cluster version check, so it cannot be backported. Release note (bug fix): Improved an interaction during range lease transfers which could previously cause `RETRY_ASYNC_WRITE_FAILURE` errors to be returned to clients. 117899: backupccl: skip `TestBackupRestoreAppend` under `deadlock` r=rail a=rickystewart These tests are likely to time out. Epic: CRDB-8308 Release note: None 117940: backupccl,sql: skip a couple more tests under duress r=rail a=rickystewart These tests are all timing out. For the failures that seem suspect in some way, I have filed GitHub issues. Epic: CRDB-8308 Release note: None 117950: copy: skip TestCopyFromRetries for now r=yuzefovich a=yuzefovich We recently expanded this test and it became flaky. Skip it until we stabilize it. Informs: #117912. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>
ERROR:
restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT REASON NEW LEASE PREVENTS TXN):
"unnamed" meta=(id=62b88940 key=REDACTED pri=0.00316934 epo=0 ts=1666717645.181476639,1 min=1666717645.1797676790 seq=101} lock=true stat=PENDING rts=1666717645.181476639,1 wto=false gul=1666717645.679767679,0
HINT: See: https://www.cockroachlabs.com/docs/dev/transaction-retry-error-reference.html
abort_reason_new lease prevents_txn
Open questions:
Jira issue: CRDB-20873
Epic CRDB-34178
gz#16766
gz#17451
The text was updated successfully, but these errors were encountered: