-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
release-22.2: kvserver: cancel consistency checks more reliably #87841
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Currently, the replica initiating the consistency check sends a collection request to itself first, and only then to other replicas in parallel. This results in substantial asynchrony on the receiving replica, between the incoming CollectChecksum request and the checksum computation task started by the ComputeChecksum message. The current solution to that is keeping the checksum computation results in memory for replicaChecksumGCInterval to return them to late arriving requests. The reason why the first checksum collection blocks the others is that it computes the "master checksum", which is then added to all other requests. However, this field is only used by the receiving end to log an inconsistency error. The actual killing of this replica happens on the second phase of the protocol, after the initiating replica commits another Raft message with the Terminate field populated. So, there is no strong reason to keep this blocking behaviour. If the initiating replica fails to compute its local checksum, it does not send requests to other replicas. This is problematic because the checksum tasks will be run on all replicas, which opens the possibility for accumulating many such dangling checks. This commit makes all the checksum collection requests parallel. Benefits: - There is less asynchrony between the sender and receiver, so we can drop the GC (in follow-up commits), and require an incoming request before starting the checksum computation task. - All the outgoing collection requests are now explicitly canceled if the local computation fails. This way, the cancelation signal has more chance to propagate to all replicas and cancel the tasks that were started anyway. Release justification: performance and stability improvement Release note (bug fix): Consistency checks are now sent to all replicas in parallel, previously it would be blocked on processing the local replica first. This a) reduces the latency of one check 2x, and b) allows better propagation of the cancelation signal which results in fewer abandoned tasks on remote replicas, and more resources spent on useful checks.
Control request cancelation in the event of store stopper quiescing higher up the stack for convenience. Release justification: part of a performance improvement PR Release note: None
Release justification: part of a performance improvement PR Release note: None
The replicaChecksum type helps bridging the checksum computation task and checksum collection request, for a certain computation ID. The code for the lifecycle of replicaChecksum is scattered across replica_consistency.go, and is difficult to understand. This commit simplifies the semantics of replicaChecksum by using Go channels, and stating the invariant on their state. It also minimizes the extensive Replica mutex locking (which is used by nearly every part of the system) by using channels for the communication between the task and the handler. It also resolves the TODO/bug in which some replicaChecksum entries could stay in the map forever if the request does not have a deadline. Release justification: part of a performance improvement PR Release note: None
Consistency checks are initiated by ComputeChecksum command in the Raft log, and run until completion under a background context. The result is collected by the initiator via the CollectChecksum long poll. The task is synchronized with the collection handler via the map of replicaChecksum structs. When the CollectChecksum handler exits due to a canceled context (for example, the request timed out, or the remote caller crashed), the background task continues to run. If it was not running, it may start in the future. In both cases, the consistency checks pool (which has a limited size and processing rate) spends resources on running dangling checks, and rejects useful ones. This commit makes sure that abandoned checksum computation tasks are: - stopped if the waiting collection request is canceled - never started if there was a recent collection request that gave up When starting, the checksum computation task first checks whether the corresponding collection request has previously been abandoned. If so, the task terminates early. Otherwise it starts and sends a cancel func through the channel that it used to notify the collection handler, so that it can abort the task when it abandons the request. Release justification: performance and stability improvement Release note (bug fix): A consistency check is now skipped/stopped when the collection request is canceled before/while running the check computation. Previously such checks would start and run until completion, and, due to the limited size of the worker pool, prevent the useful checks from running.
The checksum computation can take long, so if the store quiesces, it's better to cancel it. Release justification: part of a performance improvement PR Release note: None
Release justification: part of a performance improvement PR Release note: None
Return error from Replica.computeChecksumPostApply for a better introspection in tests, and to avoid lengthy log messages in favor of compositing them. Release justification: part of a performance improvement PR Release note: None
This refactoring makes replica.checksums map store *replicaChecksum pointers instead of values. This way we can modify the entries directly without doing another map roundtrip, which was error-prone and required handling situations when an entry was in the map and then disappeared. We still need to lock Replica.mu for reading/writing the entries, but we can avoid this too by putting sync primitives in the entry itself (see the follow-up commit). Release justification: part of a performance improvement PR Release note: None
blathers-crl
bot
force-pushed
the
blathers/backport-release-22.2-86883
branch
from
September 12, 2022 15:51
bc625aa
to
c7e129f
Compare
Thanks for opening a backport. Please check the backport criteria before merging:
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
Add a brief release justification to the body of your PR to justify this backport. Some other things to consider:
|
blathers-crl
bot
added
blathers-backport
This is a backport that Blathers created automatically.
O-robot
Originated from a bot.
labels
Sep 12, 2022
erikgrinaker
approved these changes
Sep 12, 2022
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
blathers-backport
This is a backport that Blathers created automatically.
O-robot
Originated from a bot.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Backport 9/9 commits from #86883 on behalf of @pavelkalinnikov.
/cc @cockroachdb/release
This PR increases chance of propagating cancelation signal to replicas to
prevent them from running abandoned consistency check tasks. Specifically:
able to cancel all the requests explicitly, instead of skipping some of them.
Background
Consistency checks are initiated by
ComputeChecksum
command in the Raft log,and run until completion under a background context. The result is collected by
the initiator via the
CollectChecksum
long poll. The task is synchronized withthe collection handler via the map of
replicaChecksum
structs.Currently, the replica initiating the consistency check sends a collection
request to itself first, and only then to other replicas in parallel. This
results in substantial asynchrony on the receiving replica, between the request
handler and the computation task. The current solution to that is keeping the
checksum computation results in memory for
replicaChecksumGCInterval
to returnthem to late arriving requests. However, there is no symmetry here: if the
computation starts late instead, it doesn't learn about a previously failed request.
The reason why the initiator blocks on its local checksum first is that it
computes the "master checksum", which is then added to all other requests.
However, this field is only used by the receiving end to log an inconsistency
error. The actual killing of this replica happens on the second phase of the
protocol, after the initiating replica commits another Raft message with the
Terminate
field populated. So, there is no strong reason to keep this blockingbehaviour.
When the
CollectChecksum
handler exits due to a canceled context (for example,the request timed out, or the remote caller crashed), the background task
continues to run. If it was not running, it may start in the future. In both
cases, the consistency checks pool (which has a limited size and processing
rate) spends resources on running dangling checks, and rejects useful ones.
If the initiating replica fails to compute its local checksum, it does not send
requests (or any indication to cancel) to other replicas. This is problematic
because the checksum tasks will be run on all replicas, which opens the
possibility for accumulating many such dangling checks.
Part of #77432
Release justification: performance and stability improvement
Release note(bug fix): A consistency check is now skipped/stopped when its
remote initiator gives up on it. Previously such checks would still be
attempted to run, and, due to the limited size of the worker pool, prevent the
useful checks from running. In addition, consistency check requests are now
sent in parallel, and cancelation signal propagates more reliably.
Release justification: