release-20.2: kvclient: fix a request routing bug #54533
Merged
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 1/1 commits from #54468.
/cc @cockroachdb/release
This patch fixes a bug in the DistSender's interaction with the range
cache. The DistSender interacts with the cache through an EvictionToken;
this token is used to update the lease information for a cache entry and
to evict the entry completely if the descriptor is found to be too stale
to be useful. The problem was that the eviction does not work as intended
if a copy of the token had been used to perform a lease update prior.
Operations that update the cache take a token and return an updated
token, and only the most up to date such token can be used in order to
perform subsequent cache updates (in particular, to evict the cache
entry). The bug was that the DistSender sometimes lost track of this
most up to date token, and tried to evict using an older copy - which
eviction was a no-op.
Specifically, the bad scenario was the following:
sendToReplicas.
tok.UpdateLease). The updated token was further used by sendToReplicas
but not returned to sendPartialBatch.
control flow gets back to sendPartialBatch.
cache eviction. The cache ignores the eviction request, claiming that
it has a more up to date entry than the one that sendPartialBatch is
trying to evict (which is true).
cache returns the same one as before. It also returns the lease that
we have added before, but that doesn't help very much.
fully understand. In the restore2TB test, what happens is that the
node that's supposed to be the leaseholder (say, n1), perpetually
doesn't know about the range, and another node perpetually thinks that
n1 is the leaseholder. I'm not sure why the latter happens, but this
range is probably in an unusual state because the request that's
spinning in this way is an ImportRequest.
This patch fixes the problem by changing how descriptor evictions work:
instead of comparing a token's pointer to the current cache entry and
refusing to do anything if they don't match, now evicting a descriptor
compares the descriptor's value with what's in the cache.
The cache code generally moves away from comparing pointers anywhere,
and an EvictionToken stops storing pointers to the cache's insides for
clarity.
Fixes #54118
Fixes #53197
Release note: None