release-22.2: kvserver: cancel consistency checks more reliably #87841

blathers-crl · 2022-09-12T15:51:14Z

Backport 9/9 commits from #86883 on behalf of @pavelkalinnikov.

/cc @cockroachdb/release

This PR increases chance of propagating cancelation signal to replicas to
prevent them from running abandoned consistency check tasks. Specifically:

The computation is aborted if the collection request is canceled.
The computation is not started if the collection request gave up recently.
The initiator runs all requests in parallel to reduce asynchrony, and to be
able to cancel all the requests explicitly, instead of skipping some of them.

Background

Consistency checks are initiated by ComputeChecksum command in the Raft log,
and run until completion under a background context. The result is collected by
the initiator via the CollectChecksum long poll. The task is synchronized with
the collection handler via the map of replicaChecksum structs.

Currently, the replica initiating the consistency check sends a collection
request to itself first, and only then to other replicas in parallel. This
results in substantial asynchrony on the receiving replica, between the request
handler and the computation task. The current solution to that is keeping the
checksum computation results in memory for replicaChecksumGCInterval to return
them to late arriving requests. However, there is no symmetry here: if the
computation starts late instead, it doesn't learn about a previously failed request.

The reason why the initiator blocks on its local checksum first is that it
computes the "master checksum", which is then added to all other requests.
However, this field is only used by the receiving end to log an inconsistency
error. The actual killing of this replica happens on the second phase of the
protocol, after the initiating replica commits another Raft message with the
Terminate field populated. So, there is no strong reason to keep this blocking
behaviour.

When the CollectChecksum handler exits due to a canceled context (for example,
the request timed out, or the remote caller crashed), the background task
continues to run. If it was not running, it may start in the future. In both
cases, the consistency checks pool (which has a limited size and processing
rate) spends resources on running dangling checks, and rejects useful ones.

If the initiating replica fails to compute its local checksum, it does not send
requests (or any indication to cancel) to other replicas. This is problematic
because the checksum tasks will be run on all replicas, which opens the
possibility for accumulating many such dangling checks.

Part of #77432

Release justification: performance and stability improvement

Release note(bug fix): A consistency check is now skipped/stopped when its
remote initiator gives up on it. Previously such checks would still be
attempted to run, and, due to the limited size of the worker pool, prevent the
useful checks from running. In addition, consistency check requests are now
sent in parallel, and cancelation signal propagates more reliably.

Release justification:

Currently, the replica initiating the consistency check sends a collection request to itself first, and only then to other replicas in parallel. This results in substantial asynchrony on the receiving replica, between the incoming CollectChecksum request and the checksum computation task started by the ComputeChecksum message. The current solution to that is keeping the checksum computation results in memory for replicaChecksumGCInterval to return them to late arriving requests. The reason why the first checksum collection blocks the others is that it computes the "master checksum", which is then added to all other requests. However, this field is only used by the receiving end to log an inconsistency error. The actual killing of this replica happens on the second phase of the protocol, after the initiating replica commits another Raft message with the Terminate field populated. So, there is no strong reason to keep this blocking behaviour. If the initiating replica fails to compute its local checksum, it does not send requests to other replicas. This is problematic because the checksum tasks will be run on all replicas, which opens the possibility for accumulating many such dangling checks. This commit makes all the checksum collection requests parallel. Benefits: - There is less asynchrony between the sender and receiver, so we can drop the GC (in follow-up commits), and require an incoming request before starting the checksum computation task. - All the outgoing collection requests are now explicitly canceled if the local computation fails. This way, the cancelation signal has more chance to propagate to all replicas and cancel the tasks that were started anyway. Release justification: performance and stability improvement Release note (bug fix): Consistency checks are now sent to all replicas in parallel, previously it would be blocked on processing the local replica first. This a) reduces the latency of one check 2x, and b) allows better propagation of the cancelation signal which results in fewer abandoned tasks on remote replicas, and more resources spent on useful checks.

Control request cancelation in the event of store stopper quiescing higher up the stack for convenience. Release justification: part of a performance improvement PR Release note: None

Release justification: part of a performance improvement PR Release note: None

The replicaChecksum type helps bridging the checksum computation task and checksum collection request, for a certain computation ID. The code for the lifecycle of replicaChecksum is scattered across replica_consistency.go, and is difficult to understand. This commit simplifies the semantics of replicaChecksum by using Go channels, and stating the invariant on their state. It also minimizes the extensive Replica mutex locking (which is used by nearly every part of the system) by using channels for the communication between the task and the handler. It also resolves the TODO/bug in which some replicaChecksum entries could stay in the map forever if the request does not have a deadline. Release justification: part of a performance improvement PR Release note: None

Consistency checks are initiated by ComputeChecksum command in the Raft log, and run until completion under a background context. The result is collected by the initiator via the CollectChecksum long poll. The task is synchronized with the collection handler via the map of replicaChecksum structs. When the CollectChecksum handler exits due to a canceled context (for example, the request timed out, or the remote caller crashed), the background task continues to run. If it was not running, it may start in the future. In both cases, the consistency checks pool (which has a limited size and processing rate) spends resources on running dangling checks, and rejects useful ones. This commit makes sure that abandoned checksum computation tasks are: - stopped if the waiting collection request is canceled - never started if there was a recent collection request that gave up When starting, the checksum computation task first checks whether the corresponding collection request has previously been abandoned. If so, the task terminates early. Otherwise it starts and sends a cancel func through the channel that it used to notify the collection handler, so that it can abort the task when it abandons the request. Release justification: performance and stability improvement Release note (bug fix): A consistency check is now skipped/stopped when the collection request is canceled before/while running the check computation. Previously such checks would start and run until completion, and, due to the limited size of the worker pool, prevent the useful checks from running.

The checksum computation can take long, so if the store quiesces, it's better to cancel it. Release justification: part of a performance improvement PR Release note: None

Release justification: part of a performance improvement PR Release note: None

Return error from Replica.computeChecksumPostApply for a better introspection in tests, and to avoid lengthy log messages in favor of compositing them. Release justification: part of a performance improvement PR Release note: None

This refactoring makes replica.checksums map store *replicaChecksum pointers instead of values. This way we can modify the entries directly without doing another map roundtrip, which was error-prone and required handling situations when an entry was in the map and then disappeared. We still need to lock Replica.mu for reading/writing the entries, but we can avoid this too by putting sync primitives in the entry itself (see the follow-up commit). Release justification: part of a performance improvement PR Release note: None

blathers-crl · 2022-09-12T15:51:16Z

cockroach-teamcity · 2022-09-12T15:51:32Z

This change is

pav-kv added 9 commits August 23, 2022 11:40

kvserver: cancel checksum collection context on quiescing

c49b1c3

Control request cancelation in the event of store stopper quiescing higher up the stack for convenience. Release justification: part of a performance improvement PR Release note: None

kvserver: use require.* helpers in tests

70ffed8

Release justification: part of a performance improvement PR Release note: None

kvserver: cancel checksum computation on quiescing

0aed4a2

The checksum computation can take long, so if the store quiesces, it's better to cancel it. Release justification: part of a performance improvement PR Release note: None

kvserver: return CollectChecksumResponse directly

324aabe

Release justification: part of a performance improvement PR Release note: None

kvserver: return error for better testability

f9bca26

Return error from Replica.computeChecksumPostApply for a better introspection in tests, and to avoid lengthy log messages in favor of compositing them. Release justification: part of a performance improvement PR Release note: None

blathers-crl bot requested a review from a team as a code owner September 12, 2022 15:51

blathers-crl bot force-pushed the blathers/backport-release-22.2-86883 branch from bc625aa to c7e129f Compare September 12, 2022 15:51

blathers-crl bot requested review from erikgrinaker, pav-kv and tbg September 12, 2022 15:51

blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Sep 12, 2022

blathers-crl bot assigned pav-kv Sep 12, 2022

erikgrinaker approved these changes Sep 12, 2022

View reviewed changes

tbg removed their request for review September 12, 2022 17:57

pav-kv merged commit dce33f3 into release-22.2 Sep 12, 2022

pav-kv deleted the blathers/backport-release-22.2-86883 branch September 12, 2022 18:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

release-22.2: kvserver: cancel consistency checks more reliably #87841

release-22.2: kvserver: cancel consistency checks more reliably #87841

blathers-crl bot commented Sep 12, 2022

blathers-crl bot commented Sep 12, 2022 •

edited by pav-kv

Loading

cockroach-teamcity commented Sep 12, 2022

release-22.2: kvserver: cancel consistency checks more reliably #87841

release-22.2: kvserver: cancel consistency checks more reliably #87841

Conversation

blathers-crl bot commented Sep 12, 2022

Background

blathers-crl bot commented Sep 12, 2022 • edited by pav-kv Loading

cockroach-teamcity commented Sep 12, 2022

blathers-crl bot commented Sep 12, 2022 •

edited by pav-kv

Loading