-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore2TB/nodes=10 failed #53197
Comments
(roachtest).restore2TB/nodes=10 failed on master@ef55609797c92e46eb9efd08facc9db49558291d:
More
Artifacts: /restore2TB/nodes=10
See this test on roachdash |
Looks like the first failure timed out: |
(roachtest).restore2TB/nodes=10 failed on master@cd8267ca84f6a0267a898220cfaa6630188b45c5:
More
Artifacts: /restore2TB/nodes=10
See this test on roachdash |
Most Recent Failure
I guess the question here is if we should be tolerant to such failures. 2nd most recent failure
Full stack
Which reminds me of #52468 (comment). @adityamaru has been trying to reproduce this stuck behaviour while running restore2TB without much luck. I tried looking through the debug.zip to see if there were any logs that might help us uncover the root cause but my searching didn't turn up anything. (Going to assign to him for now since he's been digging into it.) cc @andreimatei in case there's either, since the symptoms look familiar to #52468 (comment):
|
Release justification: small change improving logging, hoping to help debug cockroachdb#53197. This patch moves the logging that the DistSender does when it detects an RPC that is slow such that it kicks in in more cases: 1) if the RPC was successful (but slow) 2) if the lower layer in the DistSender failed to contact any replica and returned a sendError{}. In this case the range descriptor will be refreshed and the RPC will be retried, similar to the case in which a replica returned RangeKeyMismatchError. The RKME case was already covered by the logging, but the sendError case wasn't. Touches cockroachdb#53197 Release note: None
I'm not sure what to say. This looks different from #52468 because here, if goroutine 913610 is indeed looping constantly, it's doing it at a higher level than in #52468. The |
Release justification: small change improving logging, hoping to help debug cockroachdb#53197. This patch moves the logging that the DistSender does when it detects an RPC that is slow such that it kicks in in more cases: 1) if the RPC was successful (but slow) 2) if the lower layer in the DistSender failed to contact any replica and returned a sendError{}. In this case the range descriptor will be refreshed and the RPC will be retried, similar to the case in which a replica returned RangeKeyMismatchError. The RKME case was already covered by the logging, but the sendError case wasn't. Touches cockroachdb#53197 Release note: None
53705: kvclient: improve logging of slow RPCs in DistSender r=andreimatei a=andreimatei Release justification: small change improving logging, hoping to help debug #53197. This patch moves the logging that the DistSender does when it detects an RPC that is slow such that it kicks in in more cases: 1) if the RPC was successful (but slow) 2) if the lower layer in the DistSender failed to contact any replica and returned a sendError{}. In this case the range descriptor will be refreshed and the RPC will be retried, similar to the case in which a replica returned RangeKeyMismatchError. The RKME case was already covered by the logging, but the sendError case wasn't. Touches #53197 Release note: None 53721: geo: rebrand geospatial to spatial with minor refactors r=sumeerbhola a=otan * Rename geospatial to spatial (for outwards facing docs). * Replace all (args[d].(*tree.DType)) to tree.MustBeDType(args[d]) * Add tests to ensure all geo builtin info has a capital letter and full stop. Release justification: low risk update to new functionality Release note: None 53778: backupccl: fixes cluster backup error string r=pbardea,dt a=adityamaru Fixes: #52018 Release justification: low risk, high benefit changes to existing functionality 53785: opt: remove xfunc references and clean up EnsureKey r=RaduBerinde a=mgartner #### norm: clean up EnsureKey decorrelation function This commit replaces a single-case type switch statement in EnsureKey with an `if` with a type assertion. It also adds clariying comments. Release justification: This is a low-risk change to new and minor changes to existing functionality. Release note: None #### opt: fix comment in xform custom_funcs.go This commit removes references to the non-existent `xfunc` package. Release justification: This is a non-production code-change. Release note: None 53793: sql: skip TestCancelQueryPermissions r=solongordon a=solongordon This test is exposing more data races. Skipping until we can fix them. Refs: #53791 Release justification: non-production code changes Release note: None Co-authored-by: Andrei Matei <[email protected]> Co-authored-by: Oliver Tan <[email protected]> Co-authored-by: Aditya Maru <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Solon Gordon <[email protected]>
(roachtest).restore2TB/nodes=10 failed on master@9fae2abb48a00c26ac0344f35cd100df7faa2aec:
|
Looks like another timeout, got a bit of logs this time: Hanging adminsplit request stack:
Some of the logs: Grepping the stuck goroutine above:
And all slow range logs:
|
@pbardea If these are still hanging |
I think that makes sense -- I'm not sure if there's more investigation that we can do on the Bulk IO side. Happy to help if there's anything that Bulk IO can help with. |
This reproduced for me in 1/20 runs. |
Finally figured it out. Will have a fix tomorrow. |
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing the token's interface - now different token methods take the receiver by pointer and mutate it, and the DistSender methods share a single the token instance between them. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
54468: kvclient: fix a request routing bug r=andreimatei a=andreimatei This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes #54118 Fixes #53197 Release note: None Co-authored-by: Andrei Matei <[email protected]>
reopening until backported |
now backported |
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes #54118 Fixes #53197 Release note: None
(roachtest).restore2TB/nodes=10 failed on master@bbbedabbf6ea0b1ff6fc799a0c04a75295a9f4c2:
More
Artifacts: /restore2TB/nodes=10
Related:
roachtest: restore2TB/nodes=10 failed #51199 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-provisional_202007081918_v20.2.0-alpha.2 release-blocker
roachtest: restore2TB/nodes=10 failed #51127 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-provisional_202007071743_v20.2.0-alpha.2 release-blocker
roachtest: restore2TB/nodes=10 failed #49503 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-release-19.1 release-blocker
roachtest: restore2TB/nodes=10 failed #49480 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-release-19.2 release-blocker
roachtest: restore2TB/nodes=10 failed #49440 roachtest: restore2TB/nodes=10 failed C-test-failure O-roachtest O-robot branch-release-20.1 release-blocker
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: