kvserver: use Background() in computeChecksumPostApply goroutine #75448

tbg · 2022-01-24T16:46:36Z

On the leaseholder, ctx passed to computeChecksumPostApply is that
of the proposal. As of #71806, this context is canceled right after the
corresponding proposal is signaled (and the client goroutine returns
from sendWithRangeID). This effectively prevents most consistency
checks from succeeding (they previously were not affected by
higher-level cancellation because the consistency check is triggered
from a local queue that talks directly to the replica, i.e. had
something like a minutes-long timeout).

This caused disastrous behavior in the clearrange suite of roachtests.
That test imports a large table. After the import, most ranges have
estimates (due to the ctx cancellation preventing the consistency
checks, which as a byproduct trigger stats adjustments) and their stats
claim that they are very small. Before recent PR #74674, ClearRange on
such ranges would use individual point deletions instead of the much
more efficient pebble range deletions, effectively writing a lot of data
and running the nodes out of disk.

Failures of clearrange with #74674 were also observed, but they did
not involve out-of-disk situations, so are possibly an alternative
failure mode (that may still be related to the newly introduced presence
of context cancellation).

Touches #68303.

Release note: None

On the leaseholder, `ctx` passed to `computeChecksumPostApply` is that of the proposal. As of cockroachdb#71806, this context is canceled right after the corresponding proposal is signaled (and the client goroutine returns from `sendWithRangeID`). This effectively prevents most consistency checks from succeeding (they previously were not affected by higher-level cancellation because the consistency check is triggered from a local queue that talks directly to the replica, i.e. had something like a minutes-long timeout). This caused disastrous behavior in the `clearrange` suite of roachtests. That test imports a large table. After the import, most ranges have estimates (due to the ctx cancellation preventing the consistency checks, which as a byproduct trigger stats adjustments) and their stats claim that they are very small. Before recent PR cockroachdb#74674, `ClearRange` on such ranges would use individual point deletions instead of the much more efficient pebble range deletions, effectively writing a lot of data and running the nodes out of disk. Failures of `clearrange` with cockroachdb#74674 were also observed, but they did not involve out-of-disk situations, so are possibly an alternative failure mode (that may still be related to the newly introduced presence of context cancellation). Touches cockroachdb#68303. Release note: None

cockroach-teamcity · 2022-01-24T16:46:44Z

This change is

tbg · 2022-01-24T16:58:36Z

bors r=erikgrinaker
bors single on

TFTR!

craig · 2022-01-24T18:39:47Z

Build failed (retrying...):

GitHub CI (Cockroach)

craig · 2022-01-25T04:12:19Z

Build failed:

GitHub CI (Cockroach)

tbg · 2022-01-25T09:43:24Z

bors r=erikgrinaker

craig · 2022-01-25T13:45:29Z

Build succeeded:

GitHub CI (Cockroach)

We've seen in the events leading up to cockroachdb#75448 that a build-up of consistency check computations on a node can severely impact node performance. This commit attempts to address the main source of that, while re-working the code for easier maintainability. The way the consistency checker works is by replicating a command through Raft that, on each Replica, triggers an async checksum computations the results of which the caller collects via `CollectChecksum` requests addressed to each `Replica`. If for any reason, the caller does *not* wait to collect the checksums but instead moves on to run another consistency check (perhaps on another Range), these inflight computations can build up over time. This was the main issue in cockroachdb#75448; we were accidentally canceling the context on the leaseholder "right away", failing the consistency check (but leaving it running on all other replicas), and moving on to the next Range. As a result, some (but with spread out leaseholders, ultimately all) Replicas ended up with dozens of consistency check computations, starving I/O and CPU. We "addressed" this by avoiding this errant ctx cancellation (cockroachdb#75448 but longer-term cockroachdb#75656), but this isn't a holistic fix yet. In this commit, we make three main changes: - give the inflight consistency check computations a clean API, which makes it much easier to understand "how it works". - when returning from CollectChecksum (either on success or error, notably including context cancellation), cancel the corresponding consistency check. This solves the problem, *assuming* that CollectChecksum is reliably issued to each Replica. - reliably issue CollectChecksum to each Replica on which a computation may have been triggered. When the caller's context is canceled, still do the call with a one-second-timeout one-off Context which should be good enough to make it to the Replica and short-circuit the call. Release note: None

tbg requested a review from a team as a code owner January 24, 2022 16:46

tbg force-pushed the cb-fixed branch from c0e182b to 71f0b34 Compare January 24, 2022 16:47

erikgrinaker approved these changes Jan 24, 2022

View reviewed changes

tbg mentioned this pull request Jan 24, 2022

roachtest: clearrange/zfs/checks=true failed #68303

Closed

craig bot merged commit 8eaf8d2 into cockroachdb:master Jan 25, 2022

tbg mentioned this pull request Feb 2, 2022

roachtest: tpccbench/nodes=6/cpu=16/multi-az failed #75071

Closed

tbg mentioned this pull request Feb 21, 2022

kvserver: prevent build-up of abandoned consistency checks #76855

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: use Background() in computeChecksumPostApply goroutine #75448

kvserver: use Background() in computeChecksumPostApply goroutine #75448

tbg commented Jan 24, 2022 •

edited

Loading

cockroach-teamcity commented Jan 24, 2022

tbg commented Jan 24, 2022

craig bot commented Jan 24, 2022

craig bot commented Jan 25, 2022

tbg commented Jan 25, 2022

craig bot commented Jan 25, 2022

kvserver: use Background() in computeChecksumPostApply goroutine #75448

kvserver: use Background() in computeChecksumPostApply goroutine #75448

Conversation

tbg commented Jan 24, 2022 • edited Loading

cockroach-teamcity commented Jan 24, 2022

tbg commented Jan 24, 2022

craig bot commented Jan 24, 2022

craig bot commented Jan 25, 2022

tbg commented Jan 25, 2022

craig bot commented Jan 25, 2022

tbg commented Jan 24, 2022 •

edited

Loading