copy: copy fails with transaction error #90656

cucaroach · 2022-10-25T20:55:30Z

ERROR:
restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT REASON NEW LEASE PREVENTS TXN):
"unnamed" meta=(id=62b88940 key=REDACTED pri=0.00316934 epo=0 ts=1666717645.181476639,1 min=1666717645.1797676790 seq=101} lock=true stat=PENDING rts=1666717645.181476639,1 wto=false gul=1666717645.679767679,0
HINT: See: https://www.cockroachlabs.com/docs/dev/transaction-retry-error-reference.html
abort_reason_new lease prevents_txn

Open questions:

How does this happen?
Can we fix it with a retry?
Does it happen in 22.2?

Jira issue: CRDB-20873

Epic CRDB-34178

gz#16766

gz#17451

cucaroach · 2022-10-25T20:56:29Z

Assigning to myself just to triage...

cucaroach · 2022-11-01T12:18:51Z

Some background here, in 22.1 COPY commands were split into 100 row chunks which each got their own transaction, in 22.2 COPY is atomic so it all happens in one big transaction. If errors like this can happen we might want to retry automatically, the concern is this kind of error might become more common with an atomic COPY so we might want try to work out a fix and get ahead of that.

kvoli · 2022-11-01T14:02:38Z

22.2 COPY commands were split into 100 row chunks

Do you mean 22.1?

How does this happen?

I'm not certain on this one. The lease transfer acquires latches over the keyspace for each range. However the specifics of whether a large copy in a txn will be forced to abort or could just be bumped isn't clear to me.

cc @arulajmani do you know why this is the case?

cucaroach · 2022-11-01T14:46:34Z

Do you mean 22.1?

Yes sorry, typo

otan · 2022-11-01T20:30:02Z

If it helps in any way, these COPY chunks contain large JSONB blobs (i think up to 10MB)

arulajmani · 2022-11-01T21:55:59Z

How does this happen?

This happens because we're unable to create the transaction record for the COPY statement because of a lease transfer. The transaction record is created on the first range the statement writes to. This particular error indicates that the creation of the transaction record is failing because the range it is trying to create a transaction record on had a lease change, and the new lease started after the transaction's start timestamp.

Can we fix it with a retry?

Yeah, we shouldn't see this if we retry the transaction. If we're seeing this repeatedly, it might be worth digging into what is causing these lease transfers.

Does it happen in 22.2?

Yeah, I don't think this stuff has changed at all recently.

otan · 2022-11-01T22:55:58Z

If we're seeing this repeatedly, it might be worth digging into what is causing these lease transfers.

This seems to happen fairly consistently with a certain user we have, however I'm unable to reproduce this with a 3-node roachprod cluster on v22.1.10 trying these big COPYs like they are. Maybe it's worth diving in.

Posterity for trying to reproduce the issue: https://github.com/otan-cockroach/big-copy/blob/main/copy.go

otan · 2022-11-01T23:11:30Z

ooh actually i was inserting empty json blobs, i've fixed the script above slightly, now copy takes forever but i am still unable to reproduce it.

cucaroach · 2022-11-15T19:58:31Z

I have the pgdump now, will try my hand at reproducing...

otan · 2023-01-11T02:32:20Z

i can hit this semi-reliably in v22.2 on a multi-node cluster

TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE - missing intent on: /Table/116/4/-9182806513006833952/0): "unnamed" meta={id=9dd74a0d key=/Table/116/4/-9183063663728498211/0 pri=0.02945731 epo=0 ts=1673403903.445208120,2 min=1673403898.070322218,0 seq=90512} lock=true stat=PENDING rts=1673403903.445208120,2 wto=false gul=1673403898.570322218,0

the line before has

W230111 03:18:14.202402 6157 kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:633 ⋮ [n1,client=107.21.216.116:42858,user=‹meuser›] 360  a transaction has hi
t the intent tracking limit (kv.transaction.max_intents_bytes); is it a bulk operation? Intent cleanup will be slower. txn: "unnamed" meta={id=6885d22e key=
/Table/130/4/‹-9207216068494971191›/‹0› pri=0.03283671 epo=1 ts=1673407090.281118737,2 min=1673407087.633802747,0 seq=0} lock=true stat=PENDING rts=16734070
90.281118737,2 wto=false gul=1673407088.133802747,0 ba: ‹1 EndTxn›

raising kv.transaction.max_intents_bytes gives me

TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" meta={id=b8785fb0 key=/Table/133/4/-9215285449855244736/0 pri=0.01049782 epo=0 ts=1673407288.634694165,1 min=1673407288.572750695,0 seq=300} lock=true stat=PENDING rts=1673407288.634694165,1 wto=false gul=1673407289.072750695,0›
E230111 03:21:29.927159 13453 9@sql/conn_executor.go:2394 ⋮ [n1,client=107.21.216.116:52966,user=‹meuser›] 592 +Error types: (1) *roachpb.TransactionRetryWithProtoRefreshError

coinciding with

I230111 03:21:28.483186 13523 kv/kvserver/replica_command.go:416 ⋮ [n1,split,s1,r271/1:‹/Table/13{3/4-4/2}›] 590  initiating a split of this range at key /Table/134 [r277] (‹span config›)‹›
I230111 03:21:28.683159 13803 kv/kvserver/replica_command.go:416 ⋮ [n1,split,s1,r273/1:‹/Table/13{4/4-5/2}›] 591  initiating a split of this range at key /Table/135 [r278] (‹span config›)‹›

erikgrinaker · 2023-01-11T13:21:58Z

ABORT_REASON_NEW_LEASE_PREVENTS_TXN

As noted above, this happens when the transaction record creation races with a lease transfer of the range anchoring the txn record. I suppose it's possible that this could also happen in other cases, e.g. where the txn record is removed because of a txn abort or some such, but that's the primary cause. If this happens very frequently, consider disabling e.g. load-based lease rebalancing or some such and see if that helps.

RETRY_ASYNC_WRITE_FAILURE

Either an async intent write failed, or someone aborted the transaction and removed the intent.

ABORT_REASON_CLIENT_REJECT

A contending transaction aborted this transaction.

a transaction has hit the intent tracking limit (kv.transaction.max_intents_bytes); is it a bulk operation?

Indicates that we're writing a lot of data in the transaction, so we're switching from point intent resolution to ranged intent resolution. Could be pretty bad prior to 21.2, when we implemented separated intents, but shouldn't be a huge problem anymore.

in 22.2 COPY is atomic so it all happens in one big transaction

Large, long-running transactions are problematic, and we regularly discourage our users from using them because of the issues that come with them, instead recommending they break up their writes into batches (like COPY did in 22.1).

The longer the transaction runs, and the larger the number of keys it touches, the more likely it is to run into conflicts with other transactions. From what I see here, it seems likely that the transaction is getting aborted by a different transaction that it contends with. This could be stuff like backups, which will start pushing the long-running transaction, possibly aborting it.

I see from the above that this transaction is running with a fairly low priority (e.g. pri=0.02945731). You can consider running it as a high-priority transaction, which will reduce the chance of it getting aborted by a contending transaction, but note that this can then end up blocking other transactions until it completes (so it may e.g. prevent backups).

cucaroach · 2023-01-11T16:55:23Z

Since COPY is inserting into a temp table there shouldn't be any contention, right @otan? I wonder if this could be caused by auto-stats creation, could we try this again with auto stats turned off on the COPY temp table (probably just have to disable it on the whole cluster but maybe it could be done at the schema level?).

95275: sql: add retries on COPY under certain conditions r=rafiss,cucaroach a=otan Release note (sql change): If `copy_from_retries_enabled` is set, COPY is now able to retry certain safe circumstances - namely when `copy_from_atomic_enabled` is false, there is no transaction running COPY and the error returned is retriable. This prevents users who keep running into `TransactionProtoWithRefreshError` from having issues. Informs #90656 Release justification: feature gated new feature critical for unblocking DMS Co-authored-by: Oliver Tan <[email protected]>

tbg · 2023-01-24T10:34:37Z

https://github.com/cockroachdb/cockroach/blob/bef7580a969b796e9e57f742e3d47d25e1cf40de/pkg/cmd/roachtest/tests/copyfrom.go

117840: kv: promote expiration-based lease to epoch without sequence number change r=erikgrinaker a=nvanbenschoten Fixes #117630. Fixes #90656. Fixes #98553. Informs #61986. Informs #115191. This commit updates the post-lease transfer promotion of expiration-based leases to epoch-based leases to not change the sequence number of the lease. This avoids invalidating all requests proposed under the original expiration-based lease, which can lead to `RETRY_ASYNC_WRITE_FAILURE` errors. The change accomplishes this by updating the `Lease.Equivalent` method to consider an expiration-based lease to be equivalent to an epoch-based lease that is held by the same replica and has the same start time. Doing so requires some care, because lease equivalency is checked below Raft and needs to remain deterministic across binary versions. This change requires a cluster version check, so it cannot be backported. Release note (bug fix): Improved an interaction during range lease transfers which could previously cause `RETRY_ASYNC_WRITE_FAILURE` errors to be returned to clients. 117899: backupccl: skip `TestBackupRestoreAppend` under `deadlock` r=rail a=rickystewart These tests are likely to time out. Epic: CRDB-8308 Release note: None 117940: backupccl,sql: skip a couple more tests under duress r=rail a=rickystewart These tests are all timing out. For the failures that seem suspect in some way, I have filed GitHub issues. Epic: CRDB-8308 Release note: None 117950: copy: skip TestCopyFromRetries for now r=yuzefovich a=yuzefovich We recently expanded this test and it became flaky. Skip it until we stabilize it. Informs: #117912. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ricky Stewart <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]>

cucaroach added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 25, 2022

blathers-crl bot added the T-sql-queries SQL Queries Team label Oct 25, 2022

blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Oct 25, 2022

cucaroach self-assigned this Oct 25, 2022

exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Oct 28, 2022

blathers-crl bot added the T-kv KV Team label Nov 1, 2022

exalate-issue-sync bot removed the T-kv KV Team label Nov 4, 2022

exalate-issue-sync bot added T-sql-queries SQL Queries Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Nov 17, 2022

mgartner unassigned cucaroach Nov 22, 2022

otan mentioned this issue Jan 11, 2023

sql: add more copy error logging #95038

Merged

blathers-crl bot mentioned this issue Jan 19, 2023

release-22.2: sql: add retries on COPY under certain conditions #95496

Merged

otan mentioned this issue Jan 19, 2023

release-22.1: sql: add retries on COPY under certain conditions #95505

Merged

otan mentioned this issue Mar 14, 2023

kv: mass INSERTs easily run into TransactionRetryWithProtoRefreshError #98553

Closed

cucaroach mentioned this issue Mar 23, 2023

roachtest: COPY FROM fails with TransactionRetryWithProtoRefreshError #99102

Closed

cucaroach mentioned this issue Jul 5, 2023

roachtest: copyfrom/crdb-atomic/sf=1/nodes=1 failed #106142

Closed

cucaroach mentioned this issue Jul 20, 2023

roachtest: copyfrom/crdb-atomic/sf=1/nodes=1 failed #107160

Closed

nickelnickel added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023

cockroach-teamcity added this to SQL Queries Nov 7, 2023

github-project-automation bot moved this to Triage in SQL Queries Nov 7, 2023

michae2 moved this from Triage to 24.1 Release in SQL Queries Nov 21, 2023

michae2 added the P-2 Issues/test failures with a fix SLA of 3 months label Nov 21, 2023

mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 27, 2023

yuzefovich moved this from 24.2 Release to 24.1 Release in SQL Queries Nov 30, 2023

daniel-crlabs assigned kvoli and yuzefovich Dec 1, 2023

kvoli removed their assignment Dec 1, 2023

yuzefovich removed their assignment Dec 1, 2023

mgartner assigned yuzefovich Jan 4, 2024

nvanbenschoten mentioned this issue Jan 10, 2024

kv: post-lease transfer promotion from expiration-based lease to epoch-based lease causes failed writes #117630

Closed

nvanbenschoten mentioned this issue Jan 16, 2024

kv: promote expiration-based lease to epoch without sequence number change #117840

Merged

craig bot closed this as completed in 54ebbac Jan 19, 2024

github-project-automation bot moved this from 24.1 Release to Done in SQL Queries Jan 19, 2024

github-project-automation bot added this to KV Aug 28, 2024

github-project-automation bot moved this to Closed in KV Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

copy: copy fails with transaction error #90656

copy: copy fails with transaction error #90656

cucaroach commented Oct 25, 2022 •

edited by rickystewart

Loading

cucaroach commented Oct 25, 2022

cucaroach commented Nov 1, 2022 •

edited

Loading

kvoli commented Nov 1, 2022

cucaroach commented Nov 1, 2022

otan commented Nov 1, 2022

arulajmani commented Nov 1, 2022

otan commented Nov 1, 2022

otan commented Nov 1, 2022 •

edited

Loading

cucaroach commented Nov 15, 2022

otan commented Jan 11, 2023 •

edited

Loading

erikgrinaker commented Jan 11, 2023 •

edited

Loading

cucaroach commented Jan 11, 2023

tbg commented Jan 24, 2023

copy: copy fails with transaction error #90656

copy: copy fails with transaction error #90656

Comments

cucaroach commented Oct 25, 2022 • edited by rickystewart Loading

cucaroach commented Oct 25, 2022

cucaroach commented Nov 1, 2022 • edited Loading

kvoli commented Nov 1, 2022

cucaroach commented Nov 1, 2022

otan commented Nov 1, 2022

arulajmani commented Nov 1, 2022

otan commented Nov 1, 2022

otan commented Nov 1, 2022 • edited Loading

cucaroach commented Nov 15, 2022

otan commented Jan 11, 2023 • edited Loading

erikgrinaker commented Jan 11, 2023 • edited Loading

cucaroach commented Jan 11, 2023

tbg commented Jan 24, 2023

cucaroach commented Oct 25, 2022 •

edited by rickystewart

Loading

cucaroach commented Nov 1, 2022 •

edited

Loading

otan commented Nov 1, 2022 •

edited

Loading

otan commented Jan 11, 2023 •

edited

Loading

erikgrinaker commented Jan 11, 2023 •

edited

Loading