Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

copy: copy fails with transaction error #90656

Closed
cucaroach opened this issue Oct 25, 2022 · 19 comments · Fixed by #117840
Closed

copy: copy fails with transaction error #90656

cucaroach opened this issue Oct 25, 2022 · 19 comments · Fixed by #117840
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-sql-queries SQL Queries Team

Comments

@cucaroach
Copy link
Contributor

cucaroach commented Oct 25, 2022

ERROR:
restart transaction: TransactionRetryWithProtoRefreshError: TransactionAbortedError(ABORT REASON NEW LEASE PREVENTS TXN):
"unnamed" meta=(id=62b88940 key=REDACTED pri=0.00316934 epo=0 ts=1666717645.181476639,1 min=1666717645.1797676790 seq=101} lock=true stat=PENDING rts=1666717645.181476639,1 wto=false gul=1666717645.679767679,0
HINT: See: https://www.cockroachlabs.com/docs/dev/transaction-retry-error-reference.html
abort_reason_new lease prevents_txn

Open questions:

  • How does this happen?
  • Can we fix it with a retry?
  • Does it happen in 22.2?

Jira issue: CRDB-20873

Epic CRDB-34178

gz#16766

gz#17451

@cucaroach cucaroach added the C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. label Oct 25, 2022
@blathers-crl blathers-crl bot added the T-sql-queries SQL Queries Team label Oct 25, 2022
@blathers-crl blathers-crl bot added the T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) label Oct 25, 2022
@cucaroach cucaroach self-assigned this Oct 25, 2022
@cucaroach
Copy link
Contributor Author

Assigning to myself just to triage...

@exalate-issue-sync exalate-issue-sync bot removed the T-sql-queries SQL Queries Team label Oct 28, 2022
@cucaroach
Copy link
Contributor Author

cucaroach commented Nov 1, 2022

Some background here, in 22.1 COPY commands were split into 100 row chunks which each got their own transaction, in 22.2 COPY is atomic so it all happens in one big transaction. If errors like this can happen we might want to retry automatically, the concern is this kind of error might become more common with an atomic COPY so we might want try to work out a fix and get ahead of that.

@kvoli
Copy link
Collaborator

kvoli commented Nov 1, 2022

22.2 COPY commands were split into 100 row chunks

Do you mean 22.1?

How does this happen?

I'm not certain on this one. The lease transfer acquires latches over the keyspace for each range. However the specifics of whether a large copy in a txn will be forced to abort or could just be bumped isn't clear to me.

cc @arulajmani do you know why this is the case?

@cucaroach
Copy link
Contributor Author

Do you mean 22.1?

Yes sorry, typo

@blathers-crl blathers-crl bot added the T-kv KV Team label Nov 1, 2022
@otan
Copy link
Contributor

otan commented Nov 1, 2022

If it helps in any way, these COPY chunks contain large JSONB blobs (i think up to 10MB)

@arulajmani
Copy link
Collaborator

How does this happen?

This happens because we're unable to create the transaction record for the COPY statement because of a lease transfer. The transaction record is created on the first range the statement writes to. This particular error indicates that the creation of the transaction record is failing because the range it is trying to create a transaction record on had a lease change, and the new lease started after the transaction's start timestamp.

Can we fix it with a retry?

Yeah, we shouldn't see this if we retry the transaction. If we're seeing this repeatedly, it might be worth digging into what is causing these lease transfers.

Does it happen in 22.2?

Yeah, I don't think this stuff has changed at all recently.

@otan
Copy link
Contributor

otan commented Nov 1, 2022

If we're seeing this repeatedly, it might be worth digging into what is causing these lease transfers.

This seems to happen fairly consistently with a certain user we have, however I'm unable to reproduce this with a 3-node roachprod cluster on v22.1.10 trying these big COPYs like they are. Maybe it's worth diving in.

Posterity for trying to reproduce the issue: https://github.com/otan-cockroach/big-copy/blob/main/copy.go

@otan
Copy link
Contributor

otan commented Nov 1, 2022

ooh actually i was inserting empty json blobs, i've fixed the script above slightly, now copy takes forever but i am still unable to reproduce it.

@exalate-issue-sync exalate-issue-sync bot removed the T-kv KV Team label Nov 4, 2022
@cucaroach
Copy link
Contributor Author

I have the pgdump now, will try my hand at reproducing...

@exalate-issue-sync exalate-issue-sync bot added T-sql-queries SQL Queries Team and removed T-sql-foundations SQL Foundations Team (formerly SQL Schema + SQL Sessions) labels Nov 17, 2022
@otan
Copy link
Contributor

otan commented Jan 11, 2023

i can hit this semi-reliably in v22.2 on a multi-node cluster

TransactionRetryWithProtoRefreshError: TransactionRetryError: retry txn (RETRY_ASYNC_WRITE_FAILURE - missing intent on: /Table/116/4/-9182806513006833952/0): "unnamed" meta={id=9dd74a0d key=/Table/116/4/-9183063663728498211/0 pri=0.02945731 epo=0 ts=1673403903.445208120,2 min=1673403898.070322218,0 seq=90512} lock=true stat=PENDING rts=1673403903.445208120,2 wto=false gul=1673403898.570322218,0

the line before has

W230111 03:18:14.202402 6157 kv/kvclient/kvcoord/txn_interceptor_pipeliner.go:633 ⋮ [n1,client=107.21.216.116:42858,user=‹meuser›] 360  a transaction has hi
t the intent tracking limit (kv.transaction.max_intents_bytes); is it a bulk operation? Intent cleanup will be slower. txn: "unnamed" meta={id=6885d22e key=
/Table/130/4/‹-9207216068494971191›/‹0› pri=0.03283671 epo=1 ts=1673407090.281118737,2 min=1673407087.633802747,0 seq=0} lock=true stat=PENDING rts=16734070
90.281118737,2 wto=false gul=1673407088.133802747,0 ba: ‹1 EndTxn›

raising kv.transaction.max_intents_bytes gives me

TransactionAbortedError(ABORT_REASON_CLIENT_REJECT): "unnamed" meta={id=b8785fb0 key=/Table/133/4/-9215285449855244736/0 pri=0.01049782 epo=0 ts=1673407288.634694165,1 min=1673407288.572750695,0 seq=300} lock=true stat=PENDING rts=1673407288.634694165,1 wto=false gul=1673407289.072750695,0›
E230111 03:21:29.927159 13453 9@sql/conn_executor.go:2394 ⋮ [n1,client=107.21.216.116:52966,user=‹meuser›] 592 +Error types: (1) *roachpb.TransactionRetryWithProtoRefreshError

coinciding with

I230111 03:21:28.483186 13523 kv/kvserver/replica_command.go:416 ⋮ [n1,split,s1,r271/1:‹/Table/13{3/4-4/2}›] 590  initiating a split of this range at key /Table/134 [r277] (‹span config›)‹›
I230111 03:21:28.683159 13803 kv/kvserver/replica_command.go:416 ⋮ [n1,split,s1,r273/1:‹/Table/13{4/4-5/2}›] 591  initiating a split of this range at key /Table/135 [r278] (‹span config›)‹›

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 11, 2023

ABORT_REASON_NEW_LEASE_PREVENTS_TXN

As noted above, this happens when the transaction record creation races with a lease transfer of the range anchoring the txn record. I suppose it's possible that this could also happen in other cases, e.g. where the txn record is removed because of a txn abort or some such, but that's the primary cause. If this happens very frequently, consider disabling e.g. load-based lease rebalancing or some such and see if that helps.

RETRY_ASYNC_WRITE_FAILURE

Either an async intent write failed, or someone aborted the transaction and removed the intent.

ABORT_REASON_CLIENT_REJECT

A contending transaction aborted this transaction.

a transaction has hit the intent tracking limit (kv.transaction.max_intents_bytes); is it a bulk operation?

Indicates that we're writing a lot of data in the transaction, so we're switching from point intent resolution to ranged intent resolution. Could be pretty bad prior to 21.2, when we implemented separated intents, but shouldn't be a huge problem anymore.

in 22.2 COPY is atomic so it all happens in one big transaction

Large, long-running transactions are problematic, and we regularly discourage our users from using them because of the issues that come with them, instead recommending they break up their writes into batches (like COPY did in 22.1).

The longer the transaction runs, and the larger the number of keys it touches, the more likely it is to run into conflicts with other transactions. From what I see here, it seems likely that the transaction is getting aborted by a different transaction that it contends with. This could be stuff like backups, which will start pushing the long-running transaction, possibly aborting it.

I see from the above that this transaction is running with a fairly low priority (e.g. pri=0.02945731). You can consider running it as a high-priority transaction, which will reduce the chance of it getting aborted by a contending transaction, but note that this can then end up blocking other transactions until it completes (so it may e.g. prevent backups).

@cucaroach
Copy link
Contributor Author

Since COPY is inserting into a temp table there shouldn't be any contention, right @otan? I wonder if this could be caused by auto-stats creation, could we try this again with auto stats turned off on the COPY temp table (probably just have to disable it on the whole cluster but maybe it could be done at the schema level?).

craig bot pushed a commit that referenced this issue Jan 19, 2023
95275: sql: add retries on COPY under certain conditions r=rafiss,cucaroach a=otan

Release note (sql change): If `copy_from_retries_enabled` is set, COPY is now able to retry certain safe circumstances - namely when `copy_from_atomic_enabled` is false, there is no transaction running COPY and the error returned is retriable. This prevents users who keep running into `TransactionProtoWithRefreshError` from having issues.

Informs #90656

Release justification: feature gated new feature critical for unblocking DMS

Co-authored-by: Oliver Tan <[email protected]>
@nickelnickel nickelnickel added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Nov 7, 2023
@github-project-automation github-project-automation bot moved this to Triage in SQL Queries Nov 7, 2023
@michae2 michae2 moved this from Triage to 24.1 Release in SQL Queries Nov 21, 2023
@michae2 michae2 added the P-2 Issues/test failures with a fix SLA of 3 months label Nov 21, 2023
@mgartner mgartner moved this from 24.1 Release to 24.2 Release in SQL Queries Nov 27, 2023
@yuzefovich yuzefovich moved this from 24.2 Release to 24.1 Release in SQL Queries Nov 30, 2023
@kvoli kvoli removed their assignment Dec 1, 2023
@yuzefovich yuzefovich removed their assignment Dec 1, 2023
craig bot pushed a commit that referenced this issue Jan 19, 2024
117840: kv: promote expiration-based lease to epoch without sequence number change r=erikgrinaker a=nvanbenschoten

Fixes #117630.
Fixes #90656.
Fixes #98553.

Informs #61986.
Informs #115191.

This commit updates the post-lease transfer promotion of expiration-based leases to epoch-based leases to not change the sequence number of the lease. This avoids invalidating all requests proposed under the original expiration-based lease, which can lead to `RETRY_ASYNC_WRITE_FAILURE` errors.

The change accomplishes this by updating the `Lease.Equivalent` method to consider an expiration-based lease to be equivalent to an epoch-based lease that is held by the same replica and has the same start time. Doing so requires some care, because lease equivalency is checked below Raft and needs to remain deterministic across binary versions.

This change requires a cluster version check, so it cannot be backported.

Release note (bug fix): Improved an interaction during range lease transfers which could previously cause `RETRY_ASYNC_WRITE_FAILURE` errors to be returned to clients.

117899: backupccl: skip `TestBackupRestoreAppend` under `deadlock` r=rail a=rickystewart

These tests are likely to time out.

Epic: CRDB-8308
Release note: None

117940: backupccl,sql: skip a couple more tests under duress r=rail a=rickystewart

These tests are all timing out. For the failures that seem suspect in some way, I have filed GitHub issues.

Epic: CRDB-8308
Release note: None

117950: copy: skip TestCopyFromRetries for now r=yuzefovich a=yuzefovich

We recently expanded this test and it became flaky. Skip it until we stabilize it.

Informs: #117912.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Ricky Stewart <[email protected]>
Co-authored-by: Yahor Yuzefovich <[email protected]>
@craig craig bot closed this as completed in 54ebbac Jan 19, 2024
@github-project-automation github-project-automation bot moved this from 24.1 Release to Done in SQL Queries Jan 19, 2024
@github-project-automation github-project-automation bot moved this to Closed in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs P-2 Issues/test failures with a fix SLA of 3 months T-sql-queries SQL Queries Team
Projects
Archived in project
Status: Closed
Development

Successfully merging a pull request may close this issue.

9 participants