Skip to content

Commit

Permalink
Merge #63799
Browse files Browse the repository at this point in the history
63799: kvnemesis: add backoff to retry loop on txn aborts r=nvanbenschoten a=nvanbenschoten

This commit adds an exponential backoff to the transaction retry loop
when it detects that a transaction has been aborted. This was observed
to prevent thrashing under heavy read-write contention on `global_read`
ranges, which are added to kvnemesis in #63747. These ranges have an
added propensity to cause thrashing because every write to these ranges
gets bumped to a higher timestamp, so it is currently imperative that a
transaction be able to refresh its reads after writing to a global_read
range. If other transactions continue to invalidate a read-write
transaction's reads, it may never complete and will repeatedly abort
conflicting txns after detecting deadlocks. This commit prevents this
from stalling kvnemesis indefinitely.

I see two ways that we can improve this situation in the future.
1. The first option is that we could introduce some form of pessimistic
   read-locking for long running read-write transactions, so that they can
   eventually prevent other transactions from invalidating their reads as
   they proceed to write to a global_reads range and get their write
   timestamp bumped. This ensures that when the long-running transaction
   returns to refresh (if it even needs to, depending on the durability of
   the read locks) its reads, the refresh will have a high likelihood of
   succeeding. This is discussed in #52768.
2. The second option is to allow a transaction to re-write its existing
   intents in new epochs without being bumped by the closed timestamp. If a
   transaction only got bumped by the closed timestamp when writing new
   intents, then after a transaction was forced to retry, it would have a
   high likelihood of succeeding on its second epoch as long as it didn't
   write to a new set of keys. This is discussed in #63796.

Co-authored-by: Nathan VanBenschoten <[email protected]>
  • Loading branch information
craig[bot] and nvanbenschoten committed Apr 23, 2021
2 parents dd5b836 + af85ccf commit 74022e7
Showing 1 changed file with 13 additions and 1 deletion.
14 changes: 13 additions & 1 deletion pkg/kv/kvnemesis/applier.go
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ package kvnemesis

import (
"context"
"time"

"github.com/cockroachdb/cockroach/pkg/kv"
"github.com/cockroachdb/cockroach/pkg/roachpb"
Expand Down Expand Up @@ -86,8 +87,19 @@ func applyOp(ctx context.Context, db *kv.DB, op *Operation) {
err := db.AdminTransferLease(ctx, o.Key, o.Target)
o.Result = resultError(ctx, err)
case *ClosureTxnOperation:
// Use a backoff loop to avoid thrashing on txn aborts. Don't wait between
// epochs of the same transaction to avoid waiting while holding locks.
retryOnAbort := retry.StartWithCtx(ctx, retry.Options{
InitialBackoff: 1 * time.Millisecond,
MaxBackoff: 250 * time.Millisecond,
})
var savedTxn *kv.Txn
txnErr := db.Txn(ctx, func(ctx context.Context, txn *kv.Txn) error {
if savedTxn != nil && txn.TestingCloneTxn().Epoch == 0 {
// If the txn's current epoch is 0 and we've run at least one prior
// iteration, we were just aborted.
retryOnAbort.Next()
}
savedTxn = txn
for i := range o.Ops {
op := &o.Ops[i]
Expand Down Expand Up @@ -116,7 +128,7 @@ func applyOp(ctx context.Context, db *kv.DB, op *Operation) {
})
o.Result = resultError(ctx, txnErr)
if txnErr == nil {
o.Txn = savedTxn.Sender().TestingCloneTxn()
o.Txn = savedTxn.TestingCloneTxn()
}
default:
panic(errors.AssertionFailedf(`unknown operation type: %T %v`, o, o))
Expand Down

0 comments on commit 74022e7

Please sign in to comment.