Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.2: kvserver: cancel consistency checks more reliably #87841

Merged
merged 9 commits into from
Sep 12, 2022

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented Sep 12, 2022

Backport 9/9 commits from #86883 on behalf of @pavelkalinnikov.

/cc @cockroachdb/release


This PR increases chance of propagating cancelation signal to replicas to
prevent them from running abandoned consistency check tasks. Specifically:

  • The computation is aborted if the collection request is canceled.
  • The computation is not started if the collection request gave up recently.
  • The initiator runs all requests in parallel to reduce asynchrony, and to be
    able to cancel all the requests explicitly, instead of skipping some of them.

Background

Consistency checks are initiated by ComputeChecksum command in the Raft log,
and run until completion under a background context. The result is collected by
the initiator via the CollectChecksum long poll. The task is synchronized with
the collection handler via the map of replicaChecksum structs.

Currently, the replica initiating the consistency check sends a collection
request to itself first, and only then to other replicas in parallel. This
results in substantial asynchrony on the receiving replica, between the request
handler and the computation task. The current solution to that is keeping the
checksum computation results in memory for replicaChecksumGCInterval to return
them to late arriving requests. However, there is no symmetry here: if the
computation starts late instead, it doesn't learn about a previously failed request.

The reason why the initiator blocks on its local checksum first is that it
computes the "master checksum", which is then added to all other requests.
However, this field is only used by the receiving end to log an inconsistency
error. The actual killing of this replica happens on the second phase of the
protocol, after the initiating replica commits another Raft message with the
Terminate field populated. So, there is no strong reason to keep this blocking
behaviour
.

When the CollectChecksum handler exits due to a canceled context (for example,
the request timed out, or the remote caller crashed), the background task
continues to run. If it was not running, it may start in the future. In both
cases, the consistency checks pool (which has a limited size and processing
rate) spends resources on running dangling checks, and rejects useful ones.

If the initiating replica fails to compute its local checksum, it does not send
requests (or any indication to cancel) to other replicas. This is problematic
because the checksum tasks will be run on all replicas, which opens the
possibility for accumulating many such dangling checks.


Part of #77432

Release justification: performance and stability improvement

Release note(bug fix): A consistency check is now skipped/stopped when its
remote initiator gives up on it. Previously such checks would still be
attempted to run, and, due to the limited size of the worker pool, prevent the
useful checks from running. In addition, consistency check requests are now
sent in parallel, and cancelation signal propagates more reliably.


Release justification:

pav-kv added 9 commits August 23, 2022 11:40
Currently, the replica initiating the consistency check sends a collection
request to itself first, and only then to other replicas in parallel. This
results in substantial asynchrony on the receiving replica, between the
incoming CollectChecksum request and the checksum computation task started by
the ComputeChecksum message. The current solution to that is keeping the
checksum computation results in memory for replicaChecksumGCInterval to return
them to late arriving requests.

The reason why the first checksum collection blocks the others is that it
computes the "master checksum", which is then added to all other requests.
However, this field is only used by the receiving end to log an inconsistency
error. The actual killing of this replica happens on the second phase of the
protocol, after the initiating replica commits another Raft message with the
Terminate field populated. So, there is no strong reason to keep this blocking
behaviour.

If the initiating replica fails to compute its local checksum, it does not send
requests to other replicas. This is problematic because the checksum tasks will
be run on all replicas, which opens the possibility for accumulating many such
dangling checks.

This commit makes all the checksum collection requests parallel. Benefits:

- There is less asynchrony between the sender and receiver, so we can drop the
  GC (in follow-up commits), and require an incoming request before starting
  the checksum computation task.

- All the outgoing collection requests are now explicitly canceled if the local
  computation fails. This way, the cancelation signal has more chance to
  propagate to all replicas and cancel the tasks that were started anyway.

Release justification: performance and stability improvement

Release note (bug fix): Consistency checks are now sent to all replicas in
parallel, previously it would be blocked on processing the local replica first.
This a) reduces the latency of one check 2x, and b) allows better propagation
of the cancelation signal which results in fewer abandoned tasks on remote
replicas, and more resources spent on useful checks.
Control request cancelation in the event of store stopper quiescing higher up
the stack for convenience.

Release justification: part of a performance improvement PR
Release note: None
Release justification: part of a performance improvement PR
Release note: None
The replicaChecksum type helps bridging the checksum computation task and
checksum collection request, for a certain computation ID. The code for the
lifecycle of replicaChecksum is scattered across replica_consistency.go, and is
difficult to understand.

This commit simplifies the semantics of replicaChecksum by using Go channels,
and stating the invariant on their state. It also minimizes the extensive
Replica mutex locking (which is used by nearly every part of the system) by
using channels for the communication between the task and the handler.

It also resolves the TODO/bug in which some replicaChecksum entries could stay
in the map forever if the request does not have a deadline.

Release justification: part of a performance improvement PR
Release note: None
Consistency checks are initiated by ComputeChecksum command in the Raft log,
and run until completion under a background context. The result is collected by
the initiator via the CollectChecksum long poll. The task is synchronized with
the collection handler via the map of replicaChecksum structs.

When the CollectChecksum handler exits due to a canceled context (for example,
the request timed out, or the remote caller crashed), the background task
continues to run. If it was not running, it may start in the future. In both
cases, the consistency checks pool (which has a limited size and processing
rate) spends resources on running dangling checks, and rejects useful ones.

This commit makes sure that abandoned checksum computation tasks are:
- stopped if the waiting collection request is canceled
- never started if there was a recent collection request that gave up

When starting, the checksum computation task first checks whether the
corresponding collection request has previously been abandoned. If so, the task
terminates early. Otherwise it starts and sends a cancel func through the
channel that it used to notify the collection handler, so that it can abort the
task when it abandons the request.

Release justification: performance and stability improvement

Release note (bug fix): A consistency check is now skipped/stopped when the
collection request is canceled before/while running the check computation.
Previously such checks would start and run until completion, and, due to the
limited size of the worker pool, prevent the useful checks from running.
The checksum computation can take long, so if the store quiesces, it's better
to cancel it.

Release justification: part of a performance improvement PR
Release note: None
Release justification: part of a performance improvement PR
Release note: None
Return error from Replica.computeChecksumPostApply for a better introspection
in tests, and to avoid lengthy log messages in favor of compositing them.

Release justification: part of a performance improvement PR
Release note: None
This refactoring makes replica.checksums map store *replicaChecksum pointers
instead of values. This way we can modify the entries directly without doing
another map roundtrip, which was error-prone and required handling situations
when an entry was in the map and then disappeared.

We still need to lock Replica.mu for reading/writing the entries, but we can
avoid this too by putting sync primitives in the entry itself (see the
follow-up commit).

Release justification: part of a performance improvement PR
Release note: None
@blathers-crl blathers-crl bot requested a review from a team as a code owner September 12, 2022 15:51
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-22.2-86883 branch from bc625aa to c7e129f Compare September 12, 2022 15:51
@blathers-crl
Copy link
Author

blathers-crl bot commented Sep 12, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels Sep 12, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg removed their request for review September 12, 2022 17:57
@pav-kv pav-kv merged commit dce33f3 into release-22.2 Sep 12, 2022
@pav-kv pav-kv deleted the blathers/backport-release-22.2-86883 branch September 12, 2022 18:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants