Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvnemesis: add backoff to retry loop on txn aborts #63799

Merged

Conversation

nvanbenschoten
Copy link
Member

This commit adds an exponential backoff to the transaction retry loop
when it detects that a transaction has been aborted. This was observed
to prevent thrashing under heavy read-write contention on global_read
ranges, which are added to kvnemesis in #63747. These ranges have an
added propensity to cause thrashing because every write to these ranges
gets bumped to a higher timestamp, so it is currently imperative that a
transaction be able to refresh its reads after writing to a global_read
range. If other transactions continue to invalidate a read-write
transaction's reads, it may never complete and will repeatedly abort
conflicting txns after detecting deadlocks. This commit prevents this
from stalling kvnemesis indefinitely.

I see two ways that we can improve this situation in the future.

  1. The first option is that we could introduce some form of pessimistic
    read-locking for long running read-write transactions, so that they can
    eventually prevent other transactions from invalidating their reads as
    they proceed to write to a global_reads range and get their write
    timestamp bumped. This ensures that when the long-running transaction
    returns to refresh (if it even needs to, depending on the durability of
    the read locks) its reads, the refresh will have a high likelihood of
    succeeding. This is discussed in kv: pessimistic-mode, replicated read locks to enable large, long running transactions  #52768.
  2. The second option is to allow a transaction to re-write its existing
    intents in new epochs without being bumped by the closed timestamp. If a
    transaction only got bumped by the closed timestamp when writing new
    intents, then after a transaction was forced to retry, it would have a
    high likelihood of succeeding on its second epoch as long as it didn't
    write to a new set of keys. This is discussed in kv: don't bump transaction on closed timestamp when re-writing existing intent  #63796.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)


pkg/kv/kvnemesis/applier.go, line 98 at r1 (raw file):

		var savedTxn *kv.Txn
		txnErr := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
			if savedTxn != nil && savedTxn.TestingCloneTxn().Epoch == 0 {

I think I'm missing something basic here, but I don't follow why we're checking the Epoch of the previous attempt. IIUC the way this is written, we would only backoff if the previous attempt was not a retry right?

This commit adds an exponential backoff to the transaction retry loop
when it detects that a transaction has been aborted. This was observed
to prevent thrashing under heavy read-write contention on `global_read`
ranges, which are added to kvnemesis in cockroachdb#63747. These ranges have an
added propensity to cause thrashing because every write to these ranges
gets bumped to a higher timestamp, so it is currently imperative that a
transaction be able to refresh its reads after writing to a global_read
range. If other transactions continue to invalidate a read-write
transaction's reads, it may never complete and will repeatedly abort
conflicting txns after detecting deadlocks. This commit prevents this
from stalling kvnemesis indefinitely.

I see two ways that we can improve this situation in the future.
1. The first option is that we could introduce some form of pessimistic
   read-locking for long running read-write transactions, so that they can
   eventually prevent other transactions from invalidating their reads as
   they proceed to write to a global_reads range and get their write
   timestamp bumped. This ensures that when the long-running transaction
   returns to refresh (if it even needs to, depending on the durability of
   the read locks) its reads, the refresh will have a high likelihood of
   succeeding. This is discussed in cockroachdb#52768.
2. The second option is to allow a transaction to re-write its existing
   intents in new epochs without being bumped by the closed timestamp. If a
   transaction only got bumped by the closed timestamp when writing new
   intents, then after a transaction was forced to retry, it would have a
   high likelihood of succeeding on its second epoch as long as it didn't
   write to a new set of keys. This is discussed in cockroachdb#63796.
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/kvnemesisRetry branch from 15187bb to af85ccf Compare April 23, 2021 06:28
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)


pkg/kv/kvnemesis/applier.go, line 98 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I think I'm missing something basic here, but I don't follow why we're checking the Epoch of the previous attempt. IIUC the way this is written, we would only backoff if the previous attempt was not a retry right?

Good catch! You're not missing anything. This was supposed to check the Epoch of the new txn. I made the change and confirmed that it still solves the starvation.

Copy link
Contributor

@aayushshah15 aayushshah15 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

@nvanbenschoten
Copy link
Member Author

bors r+

@craig
Copy link
Contributor

craig bot commented Apr 23, 2021

Build succeeded:

@craig craig bot merged commit 74022e7 into cockroachdb:master Apr 23, 2021
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/kvnemesisRetry branch April 27, 2021 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants