kvnemesis: add backoff to retry loop on txn aborts #63799

nvanbenschoten · 2021-04-16T20:02:20Z

This commit adds an exponential backoff to the transaction retry loop
when it detects that a transaction has been aborted. This was observed
to prevent thrashing under heavy read-write contention on global_read
ranges, which are added to kvnemesis in #63747. These ranges have an
added propensity to cause thrashing because every write to these ranges
gets bumped to a higher timestamp, so it is currently imperative that a
transaction be able to refresh its reads after writing to a global_read
range. If other transactions continue to invalidate a read-write
transaction's reads, it may never complete and will repeatedly abort
conflicting txns after detecting deadlocks. This commit prevents this
from stalling kvnemesis indefinitely.

I see two ways that we can improve this situation in the future.

The first option is that we could introduce some form of pessimistic
read-locking for long running read-write transactions, so that they can
eventually prevent other transactions from invalidating their reads as
they proceed to write to a global_reads range and get their write
timestamp bumped. This ensures that when the long-running transaction
returns to refresh (if it even needs to, depending on the durability of
the read locks) its reads, the refresh will have a high likelihood of
succeeding. This is discussed in kv: pessimistic-mode, replicated read locks to enable large, long running transactions #52768.
The second option is to allow a transaction to re-write its existing
intents in new epochs without being bumped by the closed timestamp. If a
transaction only got bumped by the closed timestamp when writing new
intents, then after a transaction was forced to retry, it would have a
high likelihood of succeeding on its second epoch as long as it didn't
write to a new set of keys. This is discussed in kv: don't bump transaction on closed timestamp when re-writing existing intent #63796.

cockroach-teamcity · 2021-04-16T20:02:27Z

This change is

aayushshah15

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @nvanbenschoten)

pkg/kv/kvnemesis/applier.go, line 98 at r1 (raw file):

		var savedTxn *kv.Txn
		txnErr := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
			if savedTxn != nil && savedTxn.TestingCloneTxn().Epoch == 0 {

I think I'm missing something basic here, but I don't follow why we're checking the Epoch of the previous attempt. IIUC the way this is written, we would only backoff if the previous attempt was not a retry right?

This commit adds an exponential backoff to the transaction retry loop when it detects that a transaction has been aborted. This was observed to prevent thrashing under heavy read-write contention on `global_read` ranges, which are added to kvnemesis in cockroachdb#63747. These ranges have an added propensity to cause thrashing because every write to these ranges gets bumped to a higher timestamp, so it is currently imperative that a transaction be able to refresh its reads after writing to a global_read range. If other transactions continue to invalidate a read-write transaction's reads, it may never complete and will repeatedly abort conflicting txns after detecting deadlocks. This commit prevents this from stalling kvnemesis indefinitely. I see two ways that we can improve this situation in the future. 1. The first option is that we could introduce some form of pessimistic read-locking for long running read-write transactions, so that they can eventually prevent other transactions from invalidating their reads as they proceed to write to a global_reads range and get their write timestamp bumped. This ensures that when the long-running transaction returns to refresh (if it even needs to, depending on the durability of the read locks) its reads, the refresh will have a high likelihood of succeeding. This is discussed in cockroachdb#52768. 2. The second option is to allow a transaction to re-write its existing intents in new epochs without being bumped by the closed timestamp. If a transaction only got bumped by the closed timestamp when writing new intents, then after a transaction was forced to retry, it would have a high likelihood of succeeding on its second epoch as long as it didn't write to a new set of keys. This is discussed in cockroachdb#63796.

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aayushshah15)

pkg/kv/kvnemesis/applier.go, line 98 at r1 (raw file):

Previously, aayushshah15 (Aayush Shah) wrote…

I think I'm missing something basic here, but I don't follow why we're checking the Epoch of the previous attempt. IIUC the way this is written, we would only backoff if the previous attempt was not a retry right?

Good catch! You're not missing anything. This was supposed to check the Epoch of the new txn. I made the change and confirmed that it still solves the starvation.

aayushshah15

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aayushshah15)

nvanbenschoten · 2021-04-23T07:11:46Z

bors r+

craig · 2021-04-23T07:50:21Z

Build succeeded:

GitHub CI (Cockroach)

nvanbenschoten requested a review from aayushshah15 April 16, 2021 20:02

aayushshah15 reviewed Apr 19, 2021

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/kvnemesisRetry branch from 15187bb to af85ccf Compare April 23, 2021 06:28

nvanbenschoten commented Apr 23, 2021

View reviewed changes

aayushshah15 approved these changes Apr 23, 2021

View reviewed changes

craig bot merged commit 74022e7 into cockroachdb:master Apr 23, 2021

nvanbenschoten deleted the nvanbenschoten/kvnemesisRetry branch April 27, 2021 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvnemesis: add backoff to retry loop on txn aborts #63799

kvnemesis: add backoff to retry loop on txn aborts #63799

nvanbenschoten commented Apr 16, 2021

cockroach-teamcity commented Apr 16, 2021

aayushshah15 left a comment

nvanbenschoten left a comment

aayushshah15 left a comment

nvanbenschoten commented Apr 23, 2021

craig bot commented Apr 23, 2021

kvnemesis: add backoff to retry loop on txn aborts #63799

kvnemesis: add backoff to retry loop on txn aborts #63799

Conversation

nvanbenschoten commented Apr 16, 2021

cockroach-teamcity commented Apr 16, 2021

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

aayushshah15 left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Apr 23, 2021

craig bot commented Apr 23, 2021