-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage: unexpected COMMITTED state, likely due to parallel commits #37866
Comments
See cockroachdb#37866. This has been failing in roachtests like cockroachdb#37488. This will make it easier to debug until the issue is fixed. Release note: None
37867: kv: replace unexpected txn state assertion with error r=nvanbenschoten a=nvanbenschoten See #37866. This has been failing in roachtests like #37488. This will make it easier to debug until the issue is fixed. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
I went on a little survey of all of those crashes that I could find to see if I could spot any commonalities. Ended up with seven results (https://gist.github.com/tbg/a3a40b6db38c079ff88122c537bc765b):
|
I sprinkled some assertions that should fire earlier (i.e. when the weird update is made). Branch is at https://github.com/tbg/cockroach/tree/repro/tpcc-37866 There have been enough test failures here to hopefully catch this. |
One of my assertions fired (the fatalf block below, the rest is regular TxnCoordSender code). The stack trace has a commit attempt ( // If we succeeded to commit, or we attempted to rollback, we move to
// txnFinalized.
req, ok := ba.GetArg(roachpb.EndTransaction)
if ok {
etReq := req.(*roachpb.EndTransactionRequest)
if etReq.Commit {
if pErr == nil {
tc.mu.txnState = txnFinalized
tc.cleanupTxnLocked(ctx)
tc.maybeSleepForLinearizable(ctx, br, startNs)
}
} else {
// Rollbacks always move us to txnFinalized.
tc.mu.txnState = txnFinalized
tc.cleanupTxnLocked(ctx)
}
}
if tc.mu.txnState == txnPending && tc.mu.txn.Status == roachpb.COMMITTED {
log.Fatalf(ctx, "TBG txnPending but committed %+v: %s; pErr=%v", req, tc.mu.txn, pErr)
}
This doesn't exactly tell us when the COMMITTED status leaked into the proto (it wasn't from
What's interesting here is the epoch zero. The failures all have epoch one, which means there was some restart attempt involved. I'll have to dig some more, but tentatively I assume that the bug is something like this
The problem of course is real. If the txn is in fact committed, the client shouldn't be able to accidentally restart and commit. I think the write timestamp cache prevents it from committing (have to double check that, but this sounds like something @nvanbenschoten thought of by using the epoch zero timestamp), so the major problem might be getting the bookkeeping right. |
There's no code path (that I don't have an assertion on) that would reset I looked some more and discovered that heartbeats also write to the txn (although under a different name). This led me to the following interesting piece of code: cockroach/pkg/kv/txn_interceptor_heartbeater.go Lines 435 to 464 in db482d3
Sure seems likely that the txn gets committed, and then the heartbeat picks up the committed proto but doesn't do the bookkeeping around it. I'll verify with a print statement before I fix it. |
This is roughly what I was thinking was going on as well. I mentioned above that it's likely that the assertion failure happens when a heartbeat request finds the COMMITTED record and the TransactionStatusErrors happens when the EndTransaction finds the COMMITTED record first. This would allow a single issue to explain both of these symptoms.
Yes it does, so the transaction wouldn't be able to commit a second time. The big question is why the transaction is being COMMITTED (probably by a txn recovery attempt) when the client isn't expecting it to be. There should be no ambiguity to the TxnCoordSender about whether a parallel commit attempt resulted in an implicit commit state (mod AmbiguousCommitErrors). Maybe there's a bug in I meant to mention above that one thing that might help us reproduce this is to enable |
Confirmed the sequence of events above. No idea yet why the client is attempting to commit, going to dig into that next. As @nvanbenschoten mentioned, we'd expect the client to either know that the parallel commit succeeded, or know that it definitely failed, or not know either way. The "not know either way" case is the one likely to be mishandled here. For example, I could see this code cockroach/pkg/kv/txn_interceptor_committer.go Lines 175 to 183 in 9d874cc
downgrade from STAGING to PENDING, and if the error doesn't terminate the txn we can see a restart. Naively this might happen if the error is just a txn retry error. I'll add logging there to see if we get anything interesting.
|
I made cockroach/pkg/storage/batcheval/cmd_end_transaction.go Lines 210 to 211 in db9c121
I'll make that assertion a warning to see if it translates itself into an assertion failure higher up in the stack. |
We expect to hit that code path frequently (especially with However, it would be unexpected to hit that code path when the |
Hmm, there's something going on that I don't understand. I am indeed seeing this code path hit by staging EndTransactions with inflight writes all the time (in fact 100% of the time in the log I'm looking at which has maybe a dozen occurrences). The only explanation I have for this is that
I'm running with "nuclear logging" (logging every request that doesn't early exit in the txn committer), so I should know more soon. |
I got another repro with a bit more logging, which I think corresponds to the following sequence of events:
As an additional bit of support for this sequence of events, the pErr logged in that assertion is a At first glance, the unwrapping of the MixedSuccessError seems fishy and removing that (turning it into an ambiguous commit instead) would likely solve the problem (I'll try that tomorrow just to validate the above sequence of events further). However, it seems that the underlying issue is a bit deeper than that. Once we get rid of MixedSuccessError we'd have the same kind of problem again: we might get |
Good investigation. |
I agree that we should not introduce ambiguous results willy-nilly, but as a short term fix (until Nathan is back next week) to get back on safe ground I think it'd be acceptable. That said, I haven't seen this cause test failures since we toned down the assertion so if that stays like that I'd rather put work into the real fix, whatever that may be. |
Toned down the assertion? Like, the user gets a non-ambiguous error but the txn is committed? This can't stay so, right? |
I meant this PR: #37867 |
This commit adds write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits spec. This causes an assertion to be triggered by the hazard discussed in cockroachdb#37866. The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed. Running the model right now hits this issue. The next commit will propose a solution to fixing it. Release note: None
…mmit This commit proposes a medium-term solution to cockroachdb#37866. In doing so, it resolved the model failure from the previous commit. The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from the pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore. 3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. Release note: None
Thanks again for tracking this down @tbg. I proposed a medium-term solution to this over in #37900. I don't know that we can completely eliminate the ambiguity due to a combination of loss of information during intent resolution and transaction record GC, but we can get pretty close and throw |
37900: tla-plus: QueryTxn on ambiguous QueryIntent failure during ParallelCommit r=nvanbenschoten a=nvanbenschoten This PR starts by adding write pipelining and concurrent intent resolution by conflicting transactions to the parallel commits TLA+ spec. This causes an assertion to be triggered by the hazard discussed in #37866. The assertion fires when the pre-commit QueryIntent for a pipelined write gets confused about an already resolved intent. This triggers a transaction retry, at which point the transaction record is unexpectedly already committed. This PR then proposes a medium-term solution to #37866. In doing so, it resolved the model failure from the previous commit. The medium-term solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from the pre-commit QueryIntent batch (right around [here](https://github.com/cockroachdb/cockroach/blob/2428567cba5e0838615521cbc9d0e1310f0ee6ad/pkg/kv/dist_sender.go#L916)). When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from #37866. Ignore. 3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. This is a significant reduction in the cases where an AmbiguousCommitError will be needed and I suspect it will be able to tide us over until we're able to eliminate the loss of information caused by intent resolution almost entirely (e.g. by storing transaction IDs in committed values). There will still be some loss of information if we're not careful about MVCC GC, and it's still not completely clear to me how we'll need to handle that in every case. That's a discussion for a different time. 37918: roachtest: Move jepsen-initialized marker out of defer r=tbg a=bdarnell This was running whether the test succeeded or failed, which is silly. Fixes #37831 Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Ben Darnell <[email protected]>
…lel commit Fixes cockroachdb#37866. This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore. 3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. Release note: None
…lel commit Fixes cockroachdb#37866. This commit implements the medium-term solution to cockroachdb#37866 proposed and modeled in cockroachdb#37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from cockroachdb#37866. Ignore. 3. COMMITTED: Unambiguously the issue from cockroachdb#37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. Release note: None
37987: kv: detect if missing intent is due to intent resolution during parallel commit r=nvanbenschoten a=nvanbenschoten Fixes #37866. This commit implements the medium-term solution to #37866 proposed and modeled in #37900. The solution is to catch `IntentMissingErrors` in `DistSender`'s `divideAndSendParallelCommit` method coming from a parallel commit's pre-commit QueryIntent batch. When we see one of these errors, we immediately send a `QueryTxn` request to the transaction record. This will result in one of the four statuses: 1. PENDING: Unexpected because the parallel commit `EndTransactionRequest` succeeded. Ignore. 2. STAGING: Unambiguously not the issue from #37866. Ignore. 3. COMMITTED: Unambiguously the issue from #37866. Strip the error and return the updated proto. 4. ABORTED: Still ambiguous. Transform error into an AmbiguousCommitError and return. This solution isolates the ambiguity caused by the loss of information during intent resolution to just the case where the result of the QueryTxn is ABORTED. This is because an ABORTED record can mean either 1) the transaction was ABORTED and the missing intent was removed or 2) the transaction was COMMITTED, all intents were resolved, and the transaction record was GCed. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
We've seen a few instances of this assertion fire because a TxnCoordSender in the
txnPending
state has a COMMITTED transaction proto. The assertion failure looks like:I've been able to reproduce this once by running TPC-C. When attempting to do so, I also saw a
TransactionStatusError("already committed")
error from cmd_end_transaction.go:211.It's likely that these errors have the same underlying cause. Somehow a transaction record is being
COMMITTED
when a transaction coordinator does not expect it to be, which is probably related to parallel commits somehow. The first error comes when the coordinator's heartbeat finds theCOMMITTED
record and the second error comes when an EndTransactionRequest finds theCOMMITTED
record. The cause of the unexpectedCOMMITTED
status is currently unknown.I've downgraded the assertion to an error so that it is easier to debug in roachtest failures. Past that point, I won't be around for the next week to do too much debugging and track down the root cause. If this causes enough instability that we deem it necessary to temporarily disable parallel commits, here's the patch to do so (cc. @tbg):
Click to expand
The text was updated successfully, but these errors were encountered: