-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: tpcc/multiregion/survive=region/chaos=true failed #85711
Comments
roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 524fd14da3fefcd849f44a835cc5f88f5dbdadcc:
Parameters: Same failure on other branches
|
roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ f59620ec646d1181d358d0dc41ab60815ecf59c9:
Parameters: Same failure on other branches
|
In the last failure, a node crashed with the fatal error:
We'll want KV to investigate, so moving this to KV. |
roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 4dcb32c0346e20a95847763f89b9b0796d9ed4dc:
Parameters: Same failure on other branches
|
Looking. |
There are two failure modes in this roachtest.
cockroach/pkg/kv/kvclient/kvcoord/txn_interceptor_span_refresher.go Lines 219 to 224 in 4e8b0bc
|
Logs weren't informative. Kicking off a few runs with the following additional logs: diff --git c/pkg/kv/kvclient/kvcoord/txn_coord_sender.go i/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
index 7f3156a332..10f801d0bf 100644
--- c/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
+++ i/pkg/kv/kvclient/kvcoord/txn_coord_sender.go
@@ -130,7 +130,7 @@ type TxnCoordSender struct {
closed bool
// txn is the Transaction proto attached to all the requests and updated on
- // all the responses.
+ // all the responses. // XXX: Is there a response this is updated on?
txn roachpb.Transaction
// userPriority is the txn's priority. Used when restarting the transaction.
@@ -519,7 +519,13 @@ func (tc *TxnCoordSender) Send(
// Clone the Txn's Proto so that future modifications can be made without
// worrying about synchronization.
+ // XXX: Here? There's code elsewhere that mutates the embedded txn on WTOE
+ // retry. There's a rollback attempt as well. So we could've mucked with
+ // this state on an error.
ba.Txn = tc.mu.txn.Clone()
+ if ba.Txn.WriteTooOld {
+ log.Infof(ctx, "xxx: set wto bit on batch request's txn")
+ }
// Send the command through the txnInterceptor stack.
br, pErr := tc.interceptorStack[0].SendLocked(ctx, ba)
@@ -770,7 +776,7 @@ func (tc *TxnCoordSender) UpdateStateOnRemoteRetryableErr(
// not be usable afterwards (in case of TransactionAbortedError). The caller is
// expected to check the ID of the resulting transaction. If the TxnCoordSender
// can still be used, it will have been prepared for a new epoch.
-func (tc *TxnCoordSender) handleRetryableErrLocked(
+func (tc *TxnCoordSender) handleRetryableErrLocked( // XXX: Latest. Look at callstacks.
ctx context.Context, pErr *roachpb.Error,
) *roachpb.TransactionRetryWithProtoRefreshError {
// If the error is a transaction retry error, update metrics to
@@ -808,7 +814,7 @@ func (tc *TxnCoordSender) handleRetryableErrLocked(
tc.metrics.RestartsUnknown.Inc()
}
errTxnID := pErr.GetTxn().ID
- newTxn := roachpb.PrepareTransactionForRetry(ctx, pErr, tc.mu.userPriority, tc.clock)
+ newTxn := roachpb.PrepareTransactionForRetry(ctx, pErr, tc.mu.userPriority, tc.clock) // XXX: Here? We update the embedded txn
// We'll pass a TransactionRetryWithProtoRefreshError up to the next layer.
retErr := roachpb.NewTransactionRetryWithProtoRefreshError(
@@ -837,6 +843,9 @@ func (tc *TxnCoordSender) handleRetryableErrLocked(
// This is where we get a new epoch.
tc.mu.txn.Update(&newTxn)
+ if tc.mu.txn.WriteTooOld {
+ log.Infof(ctx, "xxx: set wto bit on embedded txn")
+ } Using:
|
Can't read much from the test history, failed first ~ august 7th (when this issue was filed) and failed sporadically since. We've been spamming 22.1 failures but that was something else: #78619 (comment). |
Still speculative since my repros are still running (these are really long running tests and expensive -- 10 nodes!), but my money's on #85101 (+cc @yuzefovich). |
6 runs in parallel, taking 2 hrs each, didn't turn up anything. I'll try just reproducing it more directly next, maybe seeing what codepaths #85101 changed. This feels like a valid release blocker. |
This code we deleted claims that only 19.2 code "might give us an error with the WriteTooOld flag set" But isn't it possible in master, due to the following sequence: cockroach/pkg/kv/kvserver/replica_evaluate.go Line 364 in ad17678
cockroach/pkg/kv/kvserver/replica_evaluate.go Line 390 in ad17678
I'm wholly unfamiliar with this code and am grasping for straws. @yuzefovich, do you mind taking a quick pass to see if the above looks sound to you? We may want to bring back that code in light of this, which you say has data-race implications with the work in #84946. |
Nice debugging. This sound correct to me. Would it be worthwhile to try to write a test that hits that sequence and returns an error where If we demonstrate that this is possible, I don't see why we need to revert all of #85101. Isn't reverting the change to |
Yes, this sounds right to me. |
roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ 95677eb5f8d006629b16024fb7d87d55344c1470:
Parameters: Same failure on other branches
|
@irfansharif looks like this is the only beta blocker - AFAIU #87739 should fix it, so it'd be good to merge that change. |
I haven’t written a test or repro for it but happy to merge to unblock the beta. Want to LGTM? |
#85711 (comment) was mistaken, I missed this defer clause which unsets the WriteTooOld bit at the server side for errors. cockroach/pkg/kv/kvserver/replica_evaluate.go Lines 157 to 164 in d06a355
I'm back to being confused about what's happening here. |
We should be stripping the
|
NVM. Nathan pointed out that we combine WTO bits set on specific BatchResponses with errors from others: cockroach/pkg/kv/kvclient/kvcoord/dist_sender.go Lines 1289 to 1292 in c8c29f7
So it's still possible to bubble up a pErr with the WTO bit set up to the client. |
Touches cockroachdb#85711 fixing one of the failure modes. In cockroachdb#85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix
87739: kvcoord: (partially) de-flake tpcc/multiregion r=irfansharif a=irfansharif Touches #85711 fixing one of the failure modes. In #85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix 88174: rowenc: fix needed column families computation for secondary indexes r=yuzefovich a=yuzefovich Previously, when determining the "minimal set of column families" required to retrieve all of the needed columns for the scan operation we could incorrectly not include the special zeroth family into the set. The KV for the zeroth column family is always present, so it might need to be fetched even when it's not explicitly needed when the "needed" column families are all nullable. Before this patch the code for determining whether all of the needed column families are nullable incorrectly assumed that all columns in a family are stored, but this is only true for the primary indexes - for the secondary indexes only those columns mentioned in `STORING` clause are actually stored (apart from the indexed and PK columns). As a result we could incorrectly not fetch a row if: - the unique secondary index is used - the needed column has a NULL value - all non-nullable columns from the same column family as the needed column are not stored in the index - other column families are not fetched. This is now fixed by considering only the set of stored columns. The bug seems relatively minor since it requires a multitude of conditions to be met, so I don't think it's a TA worthy. Fixes: #88110. Release note (bug fix): Previously, CockroachDB could incorrectly not fetch rows with NULL values when reading from the unique secondary index when multiple column families are defined for the table and the index doesn't store some of the NOT NULL columns. 88182: sql: fix relocate commands with NULLs r=yuzefovich a=yuzefovich Previously, we would crash when evaluating `EXPERIMENTAL_RELOCATE` commands when some of the values involved where NULL, and this is now fixed. There is no release note since the commands are "experimental" after all. Fixes: #87371. Release note: None 88187: util/growstack: increase stack growth for 1.19 r=nvanbenschoten a=ajwerner Now that we've adopted go 1.19, we notice that the performance is much worse (~8%) than we observed in go 1.18. Interestingly, we observe in profiles that we spend a lot more time increasing our stack size underneath request evaluation. This implied to me that some part of this is probably due to the runtime's new stack growth behavior. Perhaps what is going on is that the initial stacks are now smaller than they used to be so when we grow it, we grow it by less than we need to. I ran a benchmark that seems to indicate that this theory is true. I'd like to merge this to master and then backport it after we collect some more data. We never released this, so no note. Touches #88038 Release note: None 88195: colbuilder: don't use optimized IN operator for empty tuple r=yuzefovich a=yuzefovich This commit makes it so that we don't use the optimized IN operator for empty tuples since they handle NULLs incorrectly. This wasn't supposed to happen already due to 9b590d3 but there we only looked at the type and not at the actual datum. This is not a production bug since the optimizer normalizes such expressions away. Fixes: #88141. Release note: None Co-authored-by: irfan sharif <[email protected]> Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Andrew Werner <[email protected]>
Touches #85711 fixing one of the failure modes. In #85101 we deleted code in the span refresher interceptor that terminated WriteTooOld flags. We did so assuming these flags were only set in 19.2 servers, but that's not the case -- TestWTOBitTerminatedOnErrorResponses demonstrates that it's possible for the server to return error responses with the bit set if a response is combined with an error from another request in the same batch request. Since we were no longer terminating the flag, it was possible to update the TxnCoordSender's embedded txn with this bit, an then use it when issuing subsequent batch requests -- something we were asserting against. Release note: None Release justification: Bug fix
Remaining failure mode is listed #85711 (comment). Leaving this issue open to track that. |
In regards to "no inbound stream" error when validating unique constraints at the end of the import: this looks somewhat similar to #87104 since nodes are being randomly restarted (due to the chaos), and we are likely shutting down the node that is part of the distributed query that validates the unique constraints. I agree with Irfan that we should be more resilient here - it's a shame to effectively complete the import and then fail it altogether due to a transient error when validating the constraints. |
roachtest.tpcc/multiregion/survive=region/chaos=true failed with artifacts on master @ a7c91f06d8ee0fa2096bcd626f689009024947bb:
Parameters:
ROACHTEST_cloud=gce
,ROACHTEST_cpu=4
,ROACHTEST_ssd=0
Help
See: roachtest README
See: How To Investigate (internal)
Same failure on other branches
This test on roachdash | Improve this report!
Jira issue: CRDB-18395
Epic CRDB-19172
The text was updated successfully, but these errors were encountered: