-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: cancel consistency checks more reliably #86883
kvserver: cancel consistency checks more reliably #86883
Conversation
This forked off of #86591. Ready for review, PTAL. |
Review commit by commit might be best. |
d9a6a25
to
1a2dd2c
Compare
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)
-- commits
line 34 at r4:
doesn't really reject useful ones, but makes them really slow since they have to share a lot of resources.
-- commits
line 49 at r4:
This isn't exactly true, right? When the remote call never arrives, the checksum computation will run to completion. So this commit without the next one isn't going to make things better. That's fine as we discussed (once the computation is started in parallel, the case in which a call never arrives before the caller times out is hopefully rare), but let's be precise. (We didn't want the receiver to also cancel itself after 5s because then we get the mixed-version concern when the caller still processes itself synchronously).
-- commits
line 76 at r5:
Good explanation.
pkg/kv/kvserver/replica_consistency.go
line 453 at r4 (raw file):
// getReplicaChecksum creates a replicaChecksum tracker for the given ID, or // gets the existing one. The start flag indicates whether the computation is // starting, so the record is updated accordingly.
Mention the contract to call at most once with start=true
for a given ID.
pkg/kv/kvserver/replica_consistency.go
line 472 at r4 (raw file):
c, ok := r.mu.checksums[id] if !ok { taskCtx, taskCancel := context.WithCancel(context.Background())
r.store.Stopper().WithCancelOnQuiesce(r.AnnotateCtx(context.Background)))
🥪
pkg/kv/kvserver/replica_consistency.go
line 481 at r4 (raw file):
} else if start { if c.started { log.Fatalf(ctx, "attempted to start checksum computation twice for ID %s", id)
Not your code, I know, but could we return an error from this method instead? We ought to avoid unnecessary fatals as a matter of principle; if there were ever an ill-timed replay or something the like, we shouldn't have it have a blast radius larger than necessary. There is no need to shut down the node here. (We can't just ignore this problem since otherwise we probably double-close c.notify
)
pkg/kv/kvserver/replica_consistency.go
line 489 at r4 (raw file):
} func (r *Replica) releaseReplicaChecksum(ctx context.Context, id uuid.UUID, gcAt time.Time) {
I stumbled over this method name a bit. release
somehow suggests that we "hold a reference" which we don't. This simply updates the gc timestamp. updateReplicaChecksumExpiration
?
pkg/kv/kvserver/replica_consistency.go
line 493 at r4 (raw file):
defer r.mu.Unlock() if c, ok := r.mu.checksums[id]; ok { c.gcTimestamp = gcAt
Should we pick the smaller of the existing timestamps? At least add a comment about what constellations are expected here.
pkg/kv/kvserver/replica_consistency.go
line 494 at r4 (raw file):
if c, ok := r.mu.checksums[id]; ok { c.gcTimestamp = gcAt r.mu.checksums[id] = c
This write-back would be unnecessary if the map were map[uuid.UUID]*replicaChecksum
. I'd be in favor of that change; having to write back is an invitation for bugs.
Also, having the accessors sit on *Replica
is a bit annoying; we just didn't know better when we first wrote this stuff.
I'll defer to your judgment on whether this is worth your while right now, but we could add a commit that replaces the (Replica).mu.checksums
field with
type consistencyCheckMap map[uuid.UUID]*replicaChecksum
func (ccm consistencyCheckMap) setExpiration(id uuid.UUID, gcAt time.Time) { /* ... */ }
func (ccm consistencyCheckMap) getOrCreate(/* ... */)
then you can also have true unit tests for the logic, etc, and it doesn't clutter *Replica
by being better encapsulated with in.
This is a reasonable clean-up in my eyes (I've done quite a few like this over the last release) but if you're not so sure I won't press it.
pkg/kv/kvserver/replica_consistency.go
line 505 at r4 (raw file):
func (r *Replica) getChecksum(ctx context.Context, id uuid.UUID) (CollectChecksumResponse, error) { now := timeutil.Now() c := r.getReplicaChecksum(ctx, id, now, false)
false /* start */
is somewhere in our guidelines though of course Goland already shows the var name inline. (So it's mostly for reviews I guess).
pkg/kv/kvserver/replica_consistency.go
line 514 at r4 (raw file):
// All the code paths below must c.cancel(), except when we received the // computation result (successful or not). In the latter case we allow the // task to stop gracefully, so we can't simply defer c.cancel() here.
A robust way to do this would be
maybeCancel := c.cancel
defer func() {
maybeCancel() // wrapped func allows overriding maybeCancel below
}()
if err := foo(); err != nil {
return err // will cancel
}
if err := bar(); err != nil {
return err // will cancel
}
maybeCancel = func() // got a result!
return nil
pkg/kv/kvserver/replica_consistency.go
line 523 at r4 (raw file):
} // If the checksum has not completed, commit to waiting the full deadline. if !computed {
It's a bit odd that there is a computed
path here. If computed
comes back as true
, shouldn't we get the same result if we enter the checksumWait
branch anyway, just without blocking? I think this would simplify matters by avoiding some conditionals here. There is no medal for avoiding an extra non-blocking select. (I know, not your code originally)
pkg/kv/kvserver/replica_consistency.go
line 801 at r4 (raw file):
stopper := r.store.Stopper() c := r.getReplicaChecksum(ctx, cc.ChecksumID, timeutil.Now(), true)
/* start */
pkg/kv/kvserver/replica_consistency.go
line 802 at r4 (raw file):
c := r.getReplicaChecksum(ctx, cc.ChecksumID, timeutil.Now(), true) // All the code paths below must call computeChecksumDone and c.cancel(). We
Ditto for the pattern
pkg/kv/kvserver/replica_consistency.go
line 924 at r4 (raw file):
}); err != nil { defer snap.Close()
This defer shouldn't be there.
pkg/kv/kvserver/replica_consistency.go
line 927 at r4 (raw file):
c.cancel() r.computeChecksumDone(ctx, cc.ChecksumID, nil, nil) log.Errorf(ctx, "could not run async checksum computation (ID = %s): %v", cc.ChecksumID, err)
I know not your code, but I doubt this is worth logging. This is just the error branch of RunAsyncTask
, and an error is returned only if the stopper is quiescing, i.e. the surrounding node is shutting down. So this is just going to produce spurious logging during some unit tests and not much else.
pkg/kv/kvserver/replica_consistency.go
line 56 at r5 (raw file):
// check could take a long time due to rate limiting and range size. Now, all // requests are sent immediately in parallel, so we can remove the GC altogether // and synchronize the request with the task.
You could mention the release timing here. 22.1 waits for the first check to complete, so when interacting with 22.1 we want to be lenient with the timeouts.
If the parallelism commit makes it into 22.2.0 (note the 0) then the code that will be 23.1 (master today) can be aggressive. If parallelism is only in 22.2.N (N>0) then technically we can't make that assumption, though I would be ready to throw a few consistency checks against 22.2.0 under the bus since upgrades from that version into 23.1 are rare anyway and the worst case is bounded.
pkg/kv/kvserver/replica_consistency.go
line 352 at r5 (raw file):
RangeID: r.RangeID, ChecksumID: id, Checksum: nil, // TODO(pavelkalinnikov): Deprecate this field.
You could throw out this field in an extra commit, vs writing the TODO. I see below that the receiver of this proto already checks whether the field is set. So you can just remove the field from the proto and be done with it.
pkg/kv/kvserver/replica_consistency.go
line 381 at r5 (raw file):
} replicas := r.Desc().Replicas().Descriptors()
nit: try to get localReplica
from replicas
, simply because there isn't any synchronization between this line and the earlier call r.GetReplicaDescriptor
; the replica descriptor might've changed. The only consequential change could be that "self" was removed in which case the replica will now be destroyed, so this change is mostly cosmetic, but it is better to tighten it anyway.
pkg/kv/kvserver/replica_consistency.go
line 406 at r5 (raw file):
}, ); err != nil { // If we can't start tasks, the node is likely draining. Return the error
I think we'd want to cancel everyone in this case, since you don't want the node shutdown to block on some remote RPC. Consider adding WithCancelOnQuiesce
to the context used in this method if it isn't there already.
pkg/kv/kvserver/replica_consistency_test.go
line 108 at r3 (raw file):
tc.repl.mu.Unlock() rc, err = tc.repl.getChecksum(ctx, id) require.ErrorContains(t, err, "context deadline exceeded")
nit: here we should ideally check for require.True(t, errors.Is(ctx, context.DeadlineExceeded))
pkg/kv/kvserver/stores_server.go
line 75 at r5 (raw file):
// // TODO(pavelkalinnikov): The Checksum field is no longer populated. // Drop this check, or move it to where the ComputeChecksum.Terminate
Let's drop it, not worth it.
Also can't you do this now? The code below shows that 22.1 nodes can already handle the case where it's not set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work! I basically agree with Tobi's comments here.
Reviewed 3 of 3 files at r1, 2 of 2 files at r2, 1 of 1 files at r3, 2 of 2 files at r4, 3 of 3 files at r5, all commit messages.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @pavelkalinnikov and @tbg)
Previously, tbg (Tobias Grieger) wrote…
doesn't really reject useful ones, but makes them really slow since they have to share a lot of resources.
Nah, we do reject them beyond 7 concurrent checks (consistencyCheckAsyncConcurrency
). I'm hoping we can remove that limit once we land this work, but let's do that later.
Previously, tbg (Tobias Grieger) wrote…
Good explanation.
+1, nice writeup.
pkg/kv/kvserver/replica_consistency.go
line 475 at r4 (raw file):
c = replicaChecksum{ ctx: taskCtx, cancel: taskCancel,
I'm slightly uneasy about storing a context here, since it's considered a bit of an anti-pattern. Then again, we do this in a few other places already, and it gets the job done. We could perhaps consider either using a context given by computeChecksumPostApply
, hiding the inner context behind a more general API, or use some other primitive -- but I won't press for it, this is fine too.
pkg/kv/kvserver/replica_consistency.go
line 481 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Not your code, I know, but could we return an error from this method instead? We ought to avoid unnecessary fatals as a matter of principle; if there were ever an ill-timed replay or something the like, we shouldn't have it have a blast radius larger than necessary. There is no need to shut down the node here. (We can't just ignore this problem since otherwise we probably double-close
c.notify
)
+1, would really prefer an errors.AssertionFailedf
error return path here.
pkg/kv/kvserver/replica_consistency.go
line 388 at r5 (raw file):
ctx, cancel := context.WithCancel(ctx) defer close(resultCh) // close the channel only after
nit: we don't really need to close the channel here in this case, since there is no reader anymore (the for
-loop can't still be running)
pkg/kv/kvserver/replica_consistency.go
line 406 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I think we'd want to cancel everyone in this case, since you don't want the node shutdown to block on some remote RPC. Consider adding
WithCancelOnQuiesce
to the context used in this method if it isn't there already.
Won't the defer cancel()
do just that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker, @pavelkalinnikov, and @tbg)
pkg/kv/kvserver/replica_consistency.go
line 406 at r5 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
Won't the
defer cancel()
do just that?
Yep, the idea was that all exit paths will have cancel()+wg.Wait()
, so we are covered for the case when err != nil
here. An additional WithCancelOnQuiesce
would be nice though too, for the case when all tasks have started, an quiescing happened mid-RPC.
570daa4
to
07024c9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)
Previously, erikgrinaker (Erik Grinaker) wrote…
Nah, we do reject them beyond 7 concurrent checks (
consistencyCheckAsyncConcurrency
). I'm hoping we can remove that limit once we land this work, but let's do that later.
Yeah, there is a WaitSem: false
, which drops tasks on the floor if the semaphore is full. We may be able to remove it when leaks are minimal, e.g. when we make the checksum computation require an active incoming collection request (in a follow-up PR).
Erik and I discussed previously that it would be nice to add some metrics here. This could further help supporting decisions on whether to block or drop requests, etc. Something like: the number of checksum computations initiated / started / dropped / returned.
Previously, tbg (Tobias Grieger) wrote…
This isn't exactly true, right? When the remote call never arrives, the checksum computation will run to completion. So this commit without the next one isn't going to make things better. That's fine as we discussed (once the computation is started in parallel, the case in which a call never arrives before the caller times out is hopefully rare), but let's be precise. (We didn't want the receiver to also cancel itself after 5s because then we get the mixed-version concern when the caller still processes itself synchronously).
This is perhaps the wording issue, reworded.
What I meant is that the cancelation of RPCs that have arrived is now respected with this commit, which is strictly better than the status quo, regardless of the next commit (that's why I ordered this one first). The next commit increases the chance of RPCs making it through in the first place, and improves timing.
pkg/kv/kvserver/replica_consistency.go
line 453 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Mention the contract to call at most once with
start=true
for a given ID.
Obsolete now.
pkg/kv/kvserver/replica_consistency.go
line 472 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
r.store.Stopper().WithCancelOnQuiesce(r.AnnotateCtx(context.Background)))
🥪
Done.
pkg/kv/kvserver/replica_consistency.go
line 475 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
I'm slightly uneasy about storing a context here, since it's considered a bit of an anti-pattern. Then again, we do this in a few other places already, and it gets the job done. We could perhaps consider either using a context given by
computeChecksumPostApply
, hiding the inner context behind a more general API, or use some other primitive -- but I won't press for it, this is fine too.
I made a refactoring after which storing the context is unnecessary.
pkg/kv/kvserver/replica_consistency.go
line 481 at r4 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
+1, would really prefer an
errors.AssertionFailedf
error return path here.
Obsolete.
pkg/kv/kvserver/replica_consistency.go
line 489 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I stumbled over this method name a bit.
release
somehow suggests that we "hold a reference" which we don't. This simply updates the gc timestamp.updateReplicaChecksumExpiration
?
Went with a slightly shorter setReplicaChecksumGC
.
pkg/kv/kvserver/replica_consistency.go
line 493 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Should we pick the smaller of the existing timestamps? At least add a comment about what constellations are expected here.
I can see it both ways. I think it's okay if the last one wins, we don't have to be super precise.
See the refactoring, we now always update to now.Add(gcInterval), so it's always a "touch" where the latest wins.
pkg/kv/kvserver/replica_consistency.go
line 494 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This write-back would be unnecessary if the map were
map[uuid.UUID]*replicaChecksum
. I'd be in favor of that change; having to write back is an invitation for bugs.Also, having the accessors sit on
*Replica
is a bit annoying; we just didn't know better when we first wrote this stuff.I'll defer to your judgment on whether this is worth your while right now, but we could add a commit that replaces the
(Replica).mu.checksums
field withtype consistencyCheckMap map[uuid.UUID]*replicaChecksum func (ccm consistencyCheckMap) setExpiration(id uuid.UUID, gcAt time.Time) { /* ... */ } func (ccm consistencyCheckMap) getOrCreate(/* ... */)then you can also have true unit tests for the logic, etc, and it doesn't clutter
*Replica
by being better encapsulated with in.This is a reasonable clean-up in my eyes (I've done quite a few like this over the last release) but if you're not so sure I won't press it.
SGTM, I added another prereq commit to this PR with a refactoring a similar kind.
I did this in #86591 first, and wanted it to be a follow-up PR. I tried to push back the refactoring, so that we could land simple but important fixes first. It was hard to reason about though, so I decided to refactor first in this PR, and then build on top of it.
pkg/kv/kvserver/replica_consistency.go
line 505 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
false /* start */
is somewhere in our guidelines though of course Goland already shows the var name inline. (So it's mostly for reviews I guess).
Not relevant now.
pkg/kv/kvserver/replica_consistency.go
line 514 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
A robust way to do this would be
maybeCancel := c.cancel defer func() { maybeCancel() // wrapped func allows overriding maybeCancel below }() if err := foo(); err != nil { return err // will cancel } if err := bar(); err != nil { return err // will cancel } maybeCancel = func() // got a result! return nil
Nice suggestion. Done first, but then realised it can be simpler.
pkg/kv/kvserver/replica_consistency.go
line 523 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
It's a bit odd that there is a
computed
path here. Ifcomputed
comes back astrue
, shouldn't we get the same result if we enter thechecksumWait
branch anyway, just without blocking? I think this would simplify matters by avoiding some conditionals here. There is no medal for avoiding an extra non-blocking select. (I know, not your code originally)
Agreed. Originally I addressed this in a follow-up PR, but now I pushed it to this PR.
pkg/kv/kvserver/replica_consistency.go
line 801 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
/* start */
Not relevant now.
pkg/kv/kvserver/replica_consistency.go
line 802 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Ditto for the pattern
Done.
pkg/kv/kvserver/replica_consistency.go
line 924 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This defer shouldn't be there.
Do you suggest removing defer
(seems fine), or removing the whole defer snap.Close()
(not clear to me why we can safely do it)?
We grabbed the snapshot in this func above, so attempting to close it in the async task, or, if the task is not started, here. If we don't close it here, wouldn't it leak?
Since this statement pre-existed, and is not a clear cut, I would refrain from removing it in this PR, but we can remove it in another one if needed?
pkg/kv/kvserver/replica_consistency.go
line 927 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I know not your code, but I doubt this is worth logging. This is just the error branch of
RunAsyncTask
, and an error is returned only if the stopper is quiescing, i.e. the surrounding node is shutting down. So this is just going to produce spurious logging during some unit tests and not much else.
Removed.
pkg/kv/kvserver/replica_consistency.go
line 56 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You could mention the release timing here. 22.1 waits for the first check to complete, so when interacting with 22.1 we want to be lenient with the timeouts.
If the parallelism commit makes it into 22.2.0 (note the 0) then the code that will be 23.1 (master today) can be aggressive. If parallelism is only in 22.2.N (N>0) then technically we can't make that assumption, though I would be ready to throw a few consistency checks against 22.2.0 under the bus since upgrades from that version into 23.1 are rare anyway and the worst case is bounded.
Done. Better?
pkg/kv/kvserver/replica_consistency.go
line 352 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
You could throw out this field in an extra commit, vs writing the TODO. I see below that the receiver of this proto already checks whether the field is set. So you can just remove the field from the proto and be done with it.
Done.
pkg/kv/kvserver/replica_consistency.go
line 381 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: try to get
localReplica
fromreplicas
, simply because there isn't any synchronization between this line and the earlier callr.GetReplicaDescriptor
; the replica descriptor might've changed. The only consequential change could be that "self" was removed in which case the replica will now be destroyed, so this change is mostly cosmetic, but it is better to tighten it anyway.
Done.
pkg/kv/kvserver/replica_consistency.go
line 388 at r5 (raw file):
Previously, erikgrinaker (Erik Grinaker) wrote…
nit: we don't really need to close the channel here in this case, since there is no reader anymore (the
for
-loop can't still be running)
Agree. This would remove the Haiku though :(
pkg/kv/kvserver/replica_consistency_test.go
line 108 at r3 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: here we should ideally check for
require.True(t, errors.Is(ctx, context.DeadlineExceeded))
Made it a require.ErrorIs
pkg/kv/kvserver/stores_server.go
line 75 at r5 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Let's drop it, not worth it.
Also can't you do this now? The code below shows that 22.1 nodes can already handle the case where it's not set.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the reviews! PTAL, this is ready for another round. I think all commits are in a good shape now, but the last one needs more consideration. I still don't have a good feel for when the long-poll must or must not return a Snapshot
.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 4 files at r6, 1 of 1 files at r7, 2 of 2 files at r8, 1 of 1 files at r9, 3 of 3 files at r10, 4 of 4 files at r11, 2 of 2 files at r12, 2 of 2 files at r13, 4 of 4 files at r14, all commit messages.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @pavelkalinnikov)
pkg/kv/kvserver/replica_consistency.go
line 924 at r4 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
Do you suggest removing
defer
(seems fine), or removing the wholedefer snap.Close()
(not clear to me why we can safely do it)?We grabbed the snapshot in this func above, so attempting to close it in the async task, or, if the task is not started, here. If we don't close it here, wouldn't it leak?
Since this statement pre-existed, and is not a clear cut, I would refrain from removing it in this PR, but we can remove it in another one if needed?
I meant that it's odd that we're deferring it, rather than just calling snap.Close()
.
pkg/kv/kvserver/replica_consistency.go
line 927 at r4 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
Removed.
Now you're returning the error but the caller calls log.Error
, doesn't it? The caller could avoid this logging if errors.Is(err, stop.ErrUnavailable)
.
pkg/kv/kvserver/replica_consistency.go
line 550 at r11 (raw file):
) { // TODO(pavelkalinnikov): Communicate through the replicaChecksum directly, // without using r.mu. E.g. send the CollectChecksumResponse through c.notify.
The typical pattern is
type foo struct {
notify <-chan struct{}
result something
}
and result
is written by whoever closes notify
, before they do. So consumers can
select {
case <-c.notify:
// continue
case <-ctx.Done
return ctx.Err()
}
fmt.Println("hey look here: ", c.result)
so you could probably reinterpret what is happening here under this pattern.
But since the result is only consumed once here we can also send it through the channel, that definitely avoids a class of data races.
Seeing that you're sending to the channel, looks good.
pkg/kv/kvserver/replica_consistency.go
line 461 at r12 (raw file):
// is the second request for this ID, then both got the hold of it, and we // no longer need to keep it in the map for GC. delete(r.mu.checksums, id)
This sure seems convenient but also brittle. Now a call getReplicaChecksum
has the opaque side effect of deleting the checksum from the map if it already exists. It's a very very subtle refcounting scheme. The alternative is actual refcounting, which with the channels you have there right now can't be implemented, and I don't think there's large benefit in adding more. But let's rename this method to createOrDetachReplicaChecksum
so that the semantics are more obvious.
pkg/kv/kvserver/replica_consistency_test.go
line 108 at r3 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
Made it a
require.ErrorIs
That's fine here, just a heads up that it doesn't always work, since require.ErrorIs
doesn't use cockroachdb/errors
and thus can't type match all errors. (I don't know the specifics but I know it is an issue).
The interesting bits of code are here. We have already run a consistency check, we are the consistency checker queue, and now we're trying to do a second round of consistency checks, that will
cockroach/pkg/kv/kvserver/replica_consistency.go Lines 287 to 302 in 07024c9
The last commit makes it such that all nodes return the contents of their replica. Before, the nodes would be given the SHA the leaseholder had already computed for itself, and if it matched the result of their own computation, they'd deduct that their diff wasn't needed, and elide the snapshot. Instead what we should be doing is to set the As a follow-up, when we re-work the diff mechanism to be offline, instead we will only tell the minority nodes to checkpoint and terminate. But, we will also need a checkpoint off one of the good nodes, so we will probably also ask the local node (I think that's guaranteed to be idx zero) to also take a checkpoint (but not terminate). That way, we end up with a checkpoint of the "good" SHA, and (at least one) checkpoint of the "bad" SHA, and can write a tool that computes the diff. |
Sounds doable. One issue with this approach is that the second check is almost completely independent from the first one, it will be run on a different snapshot. It's possible that it fails on a different set of replicas (or even succeeds), so the snapshots only from the original inconsistent replicas won't be enough. How about collecting snapshots from all replicas during the second round? |
I have yet to see that in practice but yes, it's possible. Getting snapshots on all replicas sounds good, this might also be helpful when investigating, since then there are a few more LSMs to poke at. |
Control request cancelation in the event of store stopper quiescing higher up the stack for convenience. Release justification: part of a performance improvement PR Release note: None
07024c9
to
bacd591
Compare
The checksum computation can take long, so if the store quiesces, it's better to cancel it. Release justification: part of a performance improvement PR Release note: None
Release justification: part of a performance improvement PR Release note: None
Release justification: part of a performance improvement PR Release note: None
Return error from Replica.computeChecksumPostApply for a better introspection in tests, and to avoid lengthy log messages in favor of compositing them. Release justification: part of a performance improvement PR Release note: None
This refactoring makes replica.checksums map store *replicaChecksum pointers instead of values. This way we can modify the entries directly without doing another map roundtrip, which was error-prone and required handling situations when an entry was in the map and then disappeared. We still need to lock Replica.mu for reading/writing the entries, but we can avoid this too by putting sync primitives in the entry itself (see the follow-up commit). Release justification: part of a performance improvement PR Release note: None
bacd591
to
4daec23
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done (I think).
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @erikgrinaker and @tbg)
pkg/kv/kvserver/replica_consistency.go
line 924 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
I meant that it's odd that we're deferring it, rather than just calling
snap.Close()
.
Ah, right. De-deferred it. One difference is that defer
-red stmt is guaranteed to run even if there is a panic before the func exits. I don't know if that was the intention, WDYT? But since it was the first statement in this clause, it kinda did not benefit from this property anyway.
pkg/kv/kvserver/replica_consistency.go
line 927 at r4 (raw file):
Previously, tbg (Tobias Grieger) wrote…
Now you're returning the error but the caller calls
log.Error
, doesn't it? The caller could avoid this logging iferrors.Is(err, stop.ErrUnavailable)
.
Aye. Done.
pkg/kv/kvserver/replica_consistency.go
line 550 at r11 (raw file):
Previously, tbg (Tobias Grieger) wrote…
The typical pattern is
type foo struct { notify <-chan struct{} result something }and
result
is written by whoever closesnotify
, before they do. So consumers canselect { case <-c.notify: // continue case <-ctx.Done return ctx.Err() } fmt.Println("hey look here: ", c.result)so you could probably reinterpret what is happening here under this pattern.
But since the result is only consumed once here we can also send it through the channel, that definitely avoids a class of data races.
Seeing that you're sending to the channel, looks good.
Ack.
pkg/kv/kvserver/replica_consistency.go
line 461 at r12 (raw file):
Previously, tbg (Tobias Grieger) wrote…
This sure seems convenient but also brittle. Now a call
getReplicaChecksum
has the opaque side effect of deleting the checksum from the map if it already exists. It's a very very subtle refcounting scheme. The alternative is actual refcounting, which with the channels you have there right now can't be implemented, and I don't think there's large benefit in adding more. But let's rename this method tocreateOrDetachReplicaChecksum
so that the semantics are more obvious.
Moved all the deletion into a single gcReplicaChecksum
method to avoid this side effect here, and explained the guarantees in detail in its comment.
ca84f00
to
f72d785
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@tbg ready for another review
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 8 of 8 files at r15, 1 of 1 files at r16, 2 of 2 files at r17, 1 of 1 files at r18, 3 of 3 files at r19, 4 of 4 files at r20, 2 of 2 files at r21, 2 of 2 files at r22, 4 of 4 files at r23, all commit messages.
Dismissed @erikgrinaker and @pavelkalinnikov from 2 discussions.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @erikgrinaker and @pavelkalinnikov)
pkg/kv/kvserver/stores_server.go
line 54 at r23 (raw file):
Previously, pavelkalinnikov (Pavel Kalinnikov) wrote…
Previously this was always non-nil. Does this have to be the case?
No, and it's more idiomatic to return nil
on error.
@erikgrinaker LMK if you would like to take a final look. I may add a couple of tests and merge this PR by EOW. |
I don't think Erik needs to take another look, esp since he's out sick. Merge away! |
PS test failure isn't yours, see here |
The replicaChecksum type helps bridging the checksum computation task and checksum collection request, for a certain computation ID. The code for the lifecycle of replicaChecksum is scattered across replica_consistency.go, and is difficult to understand. This commit simplifies the semantics of replicaChecksum by using Go channels, and stating the invariant on their state. It also minimizes the extensive Replica mutex locking (which is used by nearly every part of the system) by using channels for the communication between the task and the handler. It also resolves the TODO/bug in which some replicaChecksum entries could stay in the map forever if the request does not have a deadline. Release justification: part of a performance improvement PR Release note: None
Consistency checks are initiated by ComputeChecksum command in the Raft log, and run until completion under a background context. The result is collected by the initiator via the CollectChecksum long poll. The task is synchronized with the collection handler via the map of replicaChecksum structs. When the CollectChecksum handler exits due to a canceled context (for example, the request timed out, or the remote caller crashed), the background task continues to run. If it was not running, it may start in the future. In both cases, the consistency checks pool (which has a limited size and processing rate) spends resources on running dangling checks, and rejects useful ones. This commit makes sure that abandoned checksum computation tasks are: - stopped if the waiting collection request is canceled - never started if there was a recent collection request that gave up When starting, the checksum computation task first checks whether the corresponding collection request has previously been abandoned. If so, the task terminates early. Otherwise it starts and sends a cancel func through the channel that it used to notify the collection handler, so that it can abort the task when it abandons the request. Release justification: performance and stability improvement Release note (bug fix): A consistency check is now skipped/stopped when the collection request is canceled before/while running the check computation. Previously such checks would start and run until completion, and, due to the limited size of the worker pool, prevent the useful checks from running.
Currently, the replica initiating the consistency check sends a collection request to itself first, and only then to other replicas in parallel. This results in substantial asynchrony on the receiving replica, between the incoming CollectChecksum request and the checksum computation task started by the ComputeChecksum message. The current solution to that is keeping the checksum computation results in memory for replicaChecksumGCInterval to return them to late arriving requests. The reason why the first checksum collection blocks the others is that it computes the "master checksum", which is then added to all other requests. However, this field is only used by the receiving end to log an inconsistency error. The actual killing of this replica happens on the second phase of the protocol, after the initiating replica commits another Raft message with the Terminate field populated. So, there is no strong reason to keep this blocking behaviour. If the initiating replica fails to compute its local checksum, it does not send requests to other replicas. This is problematic because the checksum tasks will be run on all replicas, which opens the possibility for accumulating many such dangling checks. This commit makes all the checksum collection requests parallel. Benefits: - There is less asynchrony between the sender and receiver, so we can drop the GC (in follow-up commits), and require an incoming request before starting the checksum computation task. - All the outgoing collection requests are now explicitly canceled if the local computation fails. This way, the cancelation signal has more chance to propagate to all replicas and cancel the tasks that were started anyway. Release justification: performance and stability improvement Release note (bug fix): Consistency checks are now sent to all replicas in parallel, previously it would be blocked on processing the local replica first. This a) reduces the latency of one check 2x, and b) allows better propagation of the cancelation signal which results in fewer abandoned tasks on remote replicas, and more resources spent on useful checks.
f72d785
to
6d8ace3
Compare
Added a couple of checks in tests for extra niceness. No review needed. |
bors r=tbg |
Build succeeded: |
blathers backport 22.2 |
This PR increases chance of propagating cancelation signal to replicas to
prevent them from running abandoned consistency check tasks. Specifically:
able to cancel all the requests explicitly, instead of skipping some of them.
Background
Consistency checks are initiated by
ComputeChecksum
command in the Raft log,and run until completion under a background context. The result is collected by
the initiator via the
CollectChecksum
long poll. The task is synchronized withthe collection handler via the map of
replicaChecksum
structs.Currently, the replica initiating the consistency check sends a collection
request to itself first, and only then to other replicas in parallel. This
results in substantial asynchrony on the receiving replica, between the request
handler and the computation task. The current solution to that is keeping the
checksum computation results in memory for
replicaChecksumGCInterval
to returnthem to late arriving requests. However, there is no symmetry here: if the
computation starts late instead, it doesn't learn about a previously failed request.
The reason why the initiator blocks on its local checksum first is that it
computes the "master checksum", which is then added to all other requests.
However, this field is only used by the receiving end to log an inconsistency
error. The actual killing of this replica happens on the second phase of the
protocol, after the initiating replica commits another Raft message with the
Terminate
field populated. So, there is no strong reason to keep this blockingbehaviour.
When the
CollectChecksum
handler exits due to a canceled context (for example,the request timed out, or the remote caller crashed), the background task
continues to run. If it was not running, it may start in the future. In both
cases, the consistency checks pool (which has a limited size and processing
rate) spends resources on running dangling checks, and rejects useful ones.
If the initiating replica fails to compute its local checksum, it does not send
requests (or any indication to cancel) to other replicas. This is problematic
because the checksum tasks will be run on all replicas, which opens the
possibility for accumulating many such dangling checks.
Part of #77432
Release justification: performance and stability improvement
Release note(bug fix): A consistency check is now skipped/stopped when its
remote initiator gives up on it. Previously such checks would still be
attempted to run, and, due to the limited size of the worker pool, prevent the
useful checks from running. In addition, consistency check requests are now
sent in parallel, and cancelation signal propagates more reliably.