-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachtest: restore2TB/nodes=10 failed #54118
Comments
timed out; must be the same as #53197 |
(roachtest).restore2TB/nodes=10 failed on release-20.2@a504fc7f26be39586c14c3fa22ad728ed8fa9b6d:
More
Artifacts: /restore2TB/nodes=10
See this test on roachdash |
I'm going crazy with this one. There's something misbehaving in the range cache, which I've tracked down to the
not being equal (on the machines where it happens) to what the caller gets here
I've got no idea what's going on... Still looking. Relatedly, it has come to my attention that they way in which we're using the |
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing the token's interface - now different token methods take the receiver by pointer and mutate it, and the DistSender methods share a single the token instance between them. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes cockroachdb#54118 Fixes cockroachdb#53197 Release note: None
54468: kvclient: fix a request routing bug r=andreimatei a=andreimatei This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes #54118 Fixes #53197 Release note: None Co-authored-by: Andrei Matei <[email protected]>
reopening until backported |
now backported |
This patch fixes a bug in the DistSender's interaction with the range cache. The DistSender interacts with the cache through an EvictionToken; this token is used to update the lease information for a cache entry and to evict the entry completely if the descriptor is found to be too stale to be useful. The problem was that the eviction does not work as intended if a copy of the token had been used to perform a lease update prior. Operations that update the cache take a token and return an updated token, and only the most up to date such token can be used in order to perform subsequent cache updates (in particular, to evict the cache entry). The bug was that the DistSender sometimes lost track of this most up to date token, and tried to evict using an older copy - which eviction was a no-op. Specifically, the bad scenario was the following: - DistSender.sendPartialBatch made a copy of the token and handed it to sendToReplicas. - sendToReplicas updated the cached lease through this token (through tok.UpdateLease). The updated token was further used by sendToReplicas but not returned to sendPartialBatch. - No replica can serve the request cause the descriptor is stale, and control flow gets back to sendPartialBatch. - sendPartialBatch uses its own, stale, copy of the token to attempt a cache eviction. The cache ignores the eviction request, claiming that it has a more up to date entry than the one that sendPartialBatch is trying to evict (which is true). - sendPartialBatch then asks the cache for a new descriptor, and the cache returns the same one as before. It also returns the lease that we have added before, but that doesn't help very much. - All this can happen over and over again in some case that I don't fully understand. In the restore2TB test, what happens is that the node that's supposed to be the leaseholder (say, n1), perpetually doesn't know about the range, and another node perpetually thinks that n1 is the leaseholder. I'm not sure why the latter happens, but this range is probably in an unusual state because the request that's spinning in this way is an ImportRequest. This patch fixes the problem by changing how descriptor evictions work: instead of comparing a token's pointer to the current cache entry and refusing to do anything if they don't match, now evicting a descriptor compares the descriptor's value with what's in the cache. The cache code generally moves away from comparing pointers anywhere, and an EvictionToken stops storing pointers to the cache's insides for clarity. Fixes #54118 Fixes #53197 Release note: None
(roachtest).restore2TB/nodes=10 failed on release-20.2@d9ff57ed3d5891ab93b7756be657ba868ecc87a6:
More
Artifacts: /restore2TB/nodes=10
Related:
See this test on roachdash
powered by pkg/cmd/internal/issues
The text was updated successfully, but these errors were encountered: