kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563

lidorcarmel · 2022-01-07T03:59:12Z

Previously kv users could lose parts of a transaction without getting an
error. After Send() returned a retryable error the state of txn got reset
which made it usable again. If the caller ignored the error they could
continue applying more operations without realizing the first part of the
transaction was discarded. See more details in the issue (#22615).

The simple case example is where the retryable closure of DB.Txn() returns
nil instead of returning the retryable error back to the retry loop - in this
case the retry loop declares success without realizing we lost the first part
of the transaction (all the operations before the retryable error).

This PR leaves the txn in a "poisoned" state after encountering an error, so
that all future operations fail fast. The caller is therefore expected to
reset the txn handle back to a usable state intentionally, by calling
Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will
reset the handle and run the retry even if the callback returned nil.

Closes #22615

Release note: None

cockroach-teamcity · 2022-01-07T03:59:19Z

This change is

nvanbenschoten

Nice job! It will be a big step forward to finally resolve this issue.

Reviewed 10 of 10 files at r1, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @andreimatei and @lidorcarmel)

pkg/kv/txn.go, line 978 at r1 (raw file):

		if err == nil {
			txn.mu.Lock()
			txnErr := txn.mu.sender.GetTxnRetryableErr(ctx)

Why do we need to explicitly check this here? Shouldn't the call to txn.Commit() below and the logic in maybeRejectClientLocked be enough to detect a poisoned txn and return an error?

pkg/kv/txn.go, line 1421 at r1 (raw file):

	}
	// Verify we have an initialized Transaction before creating a new root sender.
	if newTxn.ID == (uuid.UUID{}) {

Is this a valid case to hit?

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 115 at r1 (raw file):

		txnState txnState

		// storedRetryableErr is set when tsnState == txnRetryableError. This

s/tsnState/txnState/

pkg/sql/conn_fsm.go, line 377 at r1 (raw file):

			Description: "ROLLBACK TO SAVEPOINT (not cockroach_restart) failed because txn needs restart",
			Next:        stateAborted{},
			Action:      prepareTxnForRetry,

Are you sure this is where we wanted the call to PrepareForRetry, as opposed to the next event which handles the cockroach_restart savepoint and advances the txnState with a txnRestart event? I'm a little surprised to not see a 1:1 mapping between uses of txnRestart and calls to PrepareForRetry.

Is this related to the need to provide an error to PrepareForRetry?

pkg/kv/db_test.go, line 724 at r1 (raw file):

			// Only write if we have not written before, because otherwise we will keep aborting
			// the other txn forever.
			r, e := hpTxn.Get(ctx, "aa")

nit: s/e/err/ throughout.

pkg/kv/db_test.go, line 755 at r1 (raw file):

		return e
	})
	require.NoError(t, err)

Want to add require.Equal(t, 1, runNumber) here to assert that we performed exactly one retry?

pkg/kv/db_test.go, line 758 at r1 (raw file):

	err1 := db.Txn(context.Background(), func(ctx context.Context, txn *kv.Txn) error {
		// The high priority txn was overridden by the successful  retry.

nit: extra space in "successful retry". Also, I think s/overridden/overwritten/ is slightly more accurate.

pkg/kv/db_test.go, line 822 at r1 (raw file):

	// db.Txn should return the retryable error that poisoned txn.
	expectedErr := (*roachpb.TransactionRetryWithProtoRefreshError)(nil)
	require.True(t, errors.As(err, &expectedErr))

Same thing here. Consider a require.Equal(t, 0, runNumber).

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 705 at r1 (raw file):

		{
			// On abort, nothing changes but we get a new priority to use for
			// the next attempt.

It looks like we broke what the test was trying to demonstrate. Can we revive it by adding a call to PrepareForRetry?

lidorcarmel

hi @nvanbenschoten , looks like that test failure in rttanalysis was real, the issue was that when things are slow (for some reason only on bazel) then the txn may get aborted by a heartbeat failure, BUT, when we try to PrepareForRetry without an error we fail because we cannot find the error - a retryable error created by the heartbeat failure did not poison the handle, so it's a real bug! thanks bazel :)
I tried adding a test, please feel free to tell me that there is a much better/easier way to do this, but this test does the job.
BTW I think that regardless of this change we should call handleRetryableErrLocked() when we see that heartbeat failure.
Last - the lock/unlock stuff is not related to the test failure.
PTAL, thanks!

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)

nvanbenschoten

Reviewed 2 of 5 files at r5, 4 of 4 files at r6, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, and @yuzefovich)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 731 at r6 (raw file):

		abortedErr := roachpb.NewErrorWithTxn(
			roachpb.NewTransactionAbortedError(roachpb.ABORT_REASON_CLIENT_REJECT), &tc.mu.txn)
		if tc.typ == kv.LeafTxn {

LeafTxns don't use txnHeartbeaters, so I don't see how this case could be possible. I think I just missed that when this was last touched in 1a84480. Do you mind instrumenting it with a fatal and seeing whether we ever hit it in unit tests? If not, I think we should remove it.

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 737 at r6 (raw file):

		}
		// Root txns handle retriable errors.
		return roachpb.NewError(tc.handleRetryableErrLocked(ctx, abortedErr))

This is a good catch. I agree with you that we should make this change. Can we do so in a separate commit though? That way, we can better pinpoint any fallout from it.

pkg/sql/conn_fsm.go, line 506 at r6 (raw file):

	ts := args.Extended.(*txnState)
	ts.mu.Lock()
	ts.mu.txn.PrepareForRetry(ts.Ctx)

I don't think it needs to hold up this PR, but one thing in here that I'd be interested to get @andreimatei's take on is whether it is more appropriate for these calls to PrepareForRetry to live in these fsm Action functions or whether they should be moved into txnStateTransitionsApplyWrapper and handled on the txnRestart event. I don't think I ever understood what belongs where and why all connExecutor state updates weren't handled directly in an fsm Action.

pkg/kv/db_test.go, line 17 at r6 (raw file):

	"context"
	"fmt"
	"github.com/cockroachdb/cockroach/pkg/kv/kvclient/kvcoord"

nit: move this into the grouping below. crlfmt should do this automatically, so I wonder whether you have that set up: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher. If not, I'd recommend it.

pkg/kv/db_test.go, line 802 at r6 (raw file):

}

// Verify the txn sees a retryable error without using the handle: Normally the

This might be a more appropriate test for pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go.

pkg/kv/db_test.go, line 810 at r6 (raw file):

// handle stays in txnPending then this means we do not have the error. This
// error is needed for resetting the handle in PrepareForRetry.
func TestDB_PrepareForRetryAfterHeartbeatFailure(t *testing.T) {

How long does this test to run? At least 1 second, right? If you replace setup with a call to serverutils.StartServer directly, you should be able to configure a shorter txn heartbeat interval?

pkg/kv/db_test.go, line 833 at r6 (raw file):

		// This is here because we want to call maybeRejectClientLocked, which will
		// poison the handle if there is a heartbeat failure.
		tc.GetLeafTxnFinalState(ctx, kv.OnlyPending)

nit: _, _ = tc.GetLeafTxnFinalState(ctx, kv.OnlyPending) to show that you're intentionally ignoring the result. CI might actually be complaining about this already.

pkg/kv/db_test.go, line 841 at r6 (raw file):

			return errors.New("the handle is not poisoned yet")
		}
		return nil

Should we assert on the contents of the error after we've checked that it's non-nil?

lidorcarmel

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 731 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

LeafTxns don't use txnHeartbeaters, so I don't see how this case could be possible. I think I just missed that when this was last touched in 1a84480. Do you mind instrumenting it with a fatal and seeing whether we ever hit it in unit tests? If not, I think we should remove it.

Tested locally with some unit tests, and also pushed to test in CI.
Removing this in a separate commit (I'm getting used to thinking about commits!).
CI is still running, if it fails I'll revert.

pkg/kv/kvclient/kvcoord/txn_coord_sender.go, line 737 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is a good catch. I agree with you that we should make this change. Can we do so in a separate commit though? That way, we can better pinpoint any fallout from it.

Done.

pkg/sql/conn_fsm.go, line 377 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

These changes look very good. Thank you for making them.

Thank you!

pkg/sql/conn_fsm.go, line 506 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I don't think it needs to hold up this PR, but one thing in here that I'd be interested to get @andreimatei's take on is whether it is more appropriate for these calls to PrepareForRetry to live in these fsm Action functions or whether they should be moved into txnStateTransitionsApplyWrapper and handled on the txnRestart event. I don't think I ever understood what belongs where and why all connExecutor state updates weren't handled directly in an fsm Action.

Looks like we're good here, verified with Andrei, conn_fsm should manage the txn object.

pkg/kv/db_test.go, line 17 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: move this into the grouping below. crlfmt should do this automatically, so I wonder whether you have that set up: https://cockroachlabs.atlassian.net/wiki/spaces/ENG/pages/154206209/Goland+Tips+and+Tricks#Enable-crlfmt-Watcher. If not, I'd recommend it.

Done. Strange! I had crlfmt.. and I'm sure it was working until recently but now it was disabled.... mmm.

pkg/kv/db_test.go, line 802 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This might be a more appropriate test for pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go.

Done.

pkg/kv/db_test.go, line 810 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

How long does this test to run? At least 1 second, right? If you replace setup with a call to serverutils.StartServer directly, you should be able to configure a shorter txn heartbeat interval?

Done.

pkg/kv/db_test.go, line 833 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: _, _ = tc.GetLeafTxnFinalState(ctx, kv.OnlyPending) to show that you're intentionally ignoring the result. CI might actually be complaining about this already.

Done.
No idea why it didn't complain :(

Code quote (from pkg/kv/kvclient/kvcoord/txn_coord_sender.go):

		newTxn := roachpb.PrepareTransactionForRetry(
			ctx, abortedErr, roachpb.NormalUserPriority, tc.clock)
		return roachpb.NewError(roachpb.NewTransactionRetryWithProtoRefreshError(
			abortedErr.String(), tc.mu.txn.ID, newTxn))

pkg/kv/db_test.go, line 841 at r6 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we assert on the contents of the error after we've checked that it's non-nil?

I added
require.Equal(t, txn.ID(), pErr.TxnID)
I'm happy to add more..

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):

		// happens due to a heartbeat failure. On the other hand
		// GetLeafTxnFinalState() calls maybeRejectClientLocked() without trying a
		// Send().

Please take a look at this comment Nathan, I forgot this when we talked - this is why Get() doesn't work here. We can do a sleep and then Get() but this is cleaner.

nvanbenschoten

This , although I think we can clean up the test just a bit further!

Reviewed 13 of 13 files at r7, 12 of 13 files at r8, 1 of 1 files at r9, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 272 at r8 (raw file):

	// Create a DB with a short heartbeat interval.
	s := createTestDBWithKnobs(t, &kvserver.StoreTestingKnobs{
		DisableScanner:    true,

Do we need these three knobs?

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

Please take a look at this comment Nathan, I forgot this when we talked - this is why Get() doesn't work here. We can do a sleep and then Get() but this is cleaner.

Interesting. I wasn't expecting the Get to notice that the txn was aborted and return an error. This must be because we check the abort span during the read, which is populated on the target range when the high-priority txn aborts our intent. See WriteAbortSpanOnResolve and the pkg/kv/kvserver/abortspan package for a bit of an explanation. The high-level role of the "abort span" is to prevent zombie transactions from failing to read their own writes after those writes have been aborted.

One way to get around this is to send the put and the get to different ranges. For instance, you could perform a split on key "b" (keyB := roachpb.Key("b")) using s.DB.AdminSplit and then perform Get("b") requests in this SucceedsSoon loop. What do you think about trying that out?

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

	pErr := tc.GetTxnRetryableErr(ctx)
	require.NotNil(t, pErr)
	require.Equal(t, txn.ID(), pErr.TxnID)

Should we assert the error type here as well?

lidorcarmel

yep, thanks, PTAL.

Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andreimatei, @nvanbenschoten, and @yuzefovich)

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 272 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we need these three knobs?

No, removed.

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 323 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Interesting. I wasn't expecting the Get to notice that the txn was aborted and return an error. This must be because we check the abort span during the read, which is populated on the target range when the high-priority txn aborts our intent. See WriteAbortSpanOnResolve and the pkg/kv/kvserver/abortspan package for a bit of an explanation. The high-level role of the "abort span" is to prevent zombie transactions from failing to read their own writes after those writes have been aborted.

One way to get around this is to send the put and the get to different ranges. For instance, you could perform a split on key "b" (keyB := roachpb.Key("b")) using s.DB.AdminSplit and then perform Get("b") requests in this SucceedsSoon loop. What do you think about trying that out?

Done, works like a charm.
I did like my "creative" approach but this is probably better, thanks for the details!

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Should we assert the error type here as well?

so, GetTxnRetryableErr() returns *roachpb.TransactionRetryWithProtoRefreshError, not "error", so the type will always be fine, right? or maybe I'm missing something..

nvanbenschoten

feel free to merge if/when you feel good about the last two comments. No need for another round of reviews.

Reviewed all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @lidorcarmel, @nvanbenschoten, and @yuzefovich)

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

Previously, lidorcarmel (Lidor Carmel) wrote…

so, GetTxnRetryableErr() returns *roachpb.TransactionRetryWithProtoRefreshError, not "error", so the type will always be fine, right? or maybe I'm missing something..

Ah, ok, then maybe just add an assertion that pErr.Transaction.ID is not equal to pErr.TxnID, which indicates an aborted txn.

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 300 at r10 (raw file):

	// Note that if we read from the same range then we will cause the txn to be
	// poisoned immediately (Send() will fail), even before the heartbeat failure,
	// which is not the case we want to test here.

Let's mention the abort span here to explain why we need to send the Get to a different range.

lidorcarmel

Thanks! done.

Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @andreimatei, @nvanbenschoten, and @yuzefovich)

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 336 at r8 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Ah, ok, then maybe just add an assertion that pErr.Transaction.ID is not equal to pErr.TxnID, which indicates an aborted txn.

Done.

pkg/kv/kvclient/kvcoord/txn_coord_sender_test.go, line 300 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Let's mention the abort span here to explain why we need to send the Get to a different range.

Done.

Instead of calling handleRetryableErrLocked() we did similar work when the transaction sees a retryable error from a heartbeat failure. Instead, it's better to call the existing handleRetryableErrLocked() to take care of those errors in one place. Release note: None

Previously kv users could lose parts of a transaction without getting an error. After Send() returned a retryable error the state of txn got reset which made it usable again. If the caller ignored the error they could continue applying more operations without realizing the first part of the transaction was discarded. See more details in the issue (cockroachdb#22615). The simple case example is where the retryable closure of DB.Txn() returns nil instead of returning the retryable error back to the retry loop - in this case the retry loop declares success without realizing we lost the first part of the transaction (all the operations before the retryable error). This PR leaves the txn in a "poisoned" state after encountering an error, so that all future operations fail fast. The caller is therefore expected to reset the txn handle back to a usable state intentionally, by calling Txn.PrepareForRetry(). In the simple case of DB.Txn() the retry loop will reset the handle and run the retry even if the callback returned nil. Closes cockroachdb#22615 Release note: None

LeafTxns don't use heartbeats, therefore we should never hit this condition. Release note: None

lidorcarmel · 2022-02-15T16:37:02Z

TFTR Nathan!
bors r+

craig · 2022-02-15T17:42:03Z

Build succeeded:

GitHub CI (Cockroach)

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 8 times, most recently from 4e3d064 to e23a68a Compare January 7, 2022 23:01

lidorcarmel changed the title ~~Lidor txn retry fix v2~~ kv,kvcoord: poison txnCoordSender after a retryable error Jan 7, 2022

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 10 times, most recently from b7c0300 to 5c7dfef Compare January 14, 2022 22:36

lidorcarmel marked this pull request as ready for review January 14, 2022 22:38

lidorcarmel requested a review from a team as a code owner January 14, 2022 22:38

lidorcarmel requested review from nvanbenschoten and andreimatei January 14, 2022 22:38

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 2 times, most recently from 6ded62d to 05b749d Compare January 15, 2022 00:10

lidorcarmel changed the title ~~kv,kvcoord: poison txnCoordSender after a retryable error~~ kv,kvcoord,sql: poison txnCoordSender after a retryable error Jan 15, 2022

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch from 05b749d to 39718c4 Compare January 15, 2022 02:42

lidorcarmel mentioned this pull request Jan 18, 2022

kv: test for issue #22615 #74183

Closed

nvanbenschoten reviewed Jan 19, 2022

View reviewed changes

lidorcarmel removed the request for review from srosenberg February 8, 2022 02:36

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch from cdccc69 to 75223e2 Compare February 8, 2022 05:45

lidorcarmel commented Feb 8, 2022

View reviewed changes

nvanbenschoten reviewed Feb 8, 2022

View reviewed changes

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch 7 times, most recently from 35d5423 to cdf18bf Compare February 10, 2022 06:06

lidorcarmel commented Feb 10, 2022

View reviewed changes

nvanbenschoten approved these changes Feb 11, 2022

View reviewed changes

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch from cdf18bf to 9b5eea7 Compare February 12, 2022 01:12

lidorcarmel commented Feb 12, 2022

View reviewed changes

nvanbenschoten approved these changes Feb 14, 2022

View reviewed changes

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch from 9b5eea7 to 8c92071 Compare February 15, 2022 00:50

lidorcarmel commented Feb 15, 2022

View reviewed changes

lidorcarmel added 3 commits February 14, 2022 19:49

kvcoord: removing unnecessary handling in LeafTxn

a098962

LeafTxns don't use heartbeats, therefore we should never hit this condition. Release note: None

lidorcarmel force-pushed the lidor_txn_retry_fix_v2 branch from 8c92071 to a098962 Compare February 15, 2022 04:15

craig bot merged commit 9132eac into cockroachdb:master Feb 15, 2022

lidorcarmel deleted the lidor_txn_retry_fix_v2 branch February 15, 2022 18:05

rafiss mentioned this pull request Aug 11, 2022

kvclient: require explicit acknowledgement from client on restart #85980

Closed

This was referenced Aug 11, 2022

kv/kvclient/kvcoord: GetLeafTxnFinalState should return a retry error if there was one #85999

Open

kv: remove PrepareRetryableError from the (*kv.Txn) API #86361

Closed

nvanbenschoten mentioned this pull request Mar 13, 2023

roachtest: tpcc/headroom failed during release qualification #97141

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563

kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563

lidorcarmel commented Jan 7, 2022 •

edited

Loading

cockroach-teamcity commented Jan 7, 2022

nvanbenschoten left a comment

lidorcarmel left a comment

nvanbenschoten left a comment

lidorcarmel left a comment

nvanbenschoten left a comment

lidorcarmel left a comment

nvanbenschoten left a comment

lidorcarmel left a comment

lidorcarmel commented Feb 15, 2022

craig bot commented Feb 15, 2022

kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563

kv,kvcoord,sql: poison txnCoordSender after a retryable error #74563

Conversation

lidorcarmel commented Jan 7, 2022 • edited Loading

cockroach-teamcity commented Jan 7, 2022

nvanbenschoten left a comment

Choose a reason for hiding this comment

lidorcarmel left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

lidorcarmel left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

lidorcarmel left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

lidorcarmel left a comment

Choose a reason for hiding this comment

lidorcarmel commented Feb 15, 2022

craig bot commented Feb 15, 2022

lidorcarmel commented Jan 7, 2022 •

edited

Loading