-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: remove consistency checker diffs #21128
Comments
In discussion today, @sumeerbhola was wondering how we compute the consistency checker diff. The answer is we buffer the entire Range data and send it in a single response. I tracked down this issue from this comment in the code:
Cc @ajwerner, @andy-kimball, and @johnrk as this throws a bit of a wrench into larger range sizes. |
Just in case this isn't clear, the diff is requested only after a first round of the check already failed. It's still bad. |
I suspect the solution here would be to add a streaming RPC interface. |
This isn't going to be a readily back-portable change. This makes me nervous. |
How big are the ranges in 20.1? We can always not send a range data snapshot if it would be too large. It would be unfortunate to see this in practice, but the data checkpoints would be there anyway, so we could reconstruct the diff manually later. |
|
This is a good point. Disabling sending range snapshot data for large ranges might be the only back-portable change. My suggestion was more about a long-term fix. We actually already have a facility for streaming range data that we use for sending snapshots. Doesn't seem like it could be easily adapted here, though. |
We're doing some adjacent work over in #77432, and plan to remove the consistency checker diffs while we're at it, since they're pretty unwieldy at 512 MB. This involves removing the following fields and functionality:
It might be possible to do this without requiring a version gate and keeping the old code around for mixed-version clusters, because the current code will gracefully ignore responses that don't have a snapshot: cockroach/pkg/kv/kvserver/replica_consistency.go Lines 160 to 168 in 7f61dfd
However, we'll need some mechanism to obtain such diffs for debugging. We already take Pebble checkpoints when inconsistencies are found, so we'll need to write e.g. a cockroach/pkg/kv/kvserver/replica_consistency.go Lines 785 to 798 in 7f61dfd
This could reuse the existing diff code: cockroach/pkg/kv/kvserver/replica_consistency_diff.go Lines 82 to 91 in ee57d72
|
Are we going to lose diff support? |
Likely yes, in its current form (an API that transfers the entire range). But we want diffs to be available in some form out-of-band, e.g. have tooling for constructing it from the checkpoints saved on replicas. We could also explore some forms of partial/sampled diffing, e.g. detect one sub-range with a mismatch, and diff only that part. |
One option for a partial diff:
We could purposefully limit this to 1 mismatching key (it's the worst case anyway, if there is one large k/v entry), or to some reasonable byte-size worth of keys, and leave full analysis to the offline tooling. One advantage of retaining some inline diffing (instead of fully removing it) is in having somewhat more informative consistency check failures should they occur in tests like #87106. @tbg How would a human debug those things? I imagine that I would maybe look at the first few mismatching entries in this diff and try to understand what's up with them, but there is probably benefit in seeing larger patterns in it too. |
A human would always investigate the checkpoints. The diff would serve the guide them as to which keys are of interest, but there isn't a huge upshot in having the diff available upfront (and if there were we could build the retrieval into roachtest). The diff alone isn't that useful, though it would be useful, but not at the expense of lots of engineering. |
SGTM. I’m thinking of the following flow when an inconsistency happens:
Issues:
|
We should clearly call this out in the crash message, so that admins know they'll have to remove this when they no longer need it. I think that should be sufficient for now. We may want to e.g. log a message on startup if there are any checkpoints too. It's worth pinging the SREs real quick about how they'd like this to be surfaced. @joshimhoff When a consistency check fails, we leave behind a checkpoint of the current Pebble state on the relevant nodes (with replicas for that range). These are needed to debug the inconsistency. These are never removed, so over time they'll essentially be an old copy of the node's database. How would SREs prefer to be notified about these checkpoints, which will need to be removed manually when no longer needed for debugging? Currently, the inconsistent nodes go into a panic loop with a log message, so the simplest solution is to just call this out in the crash message -- but we'll also keep checkpoints on the good nodes that will need to be removed. Is that sufficient?
We essentially never see these failures, so let's not optimize for it.
True, but they're cheap to take -- they're only expensive to keep for a long time. Let's leave the GC responsibility to admins for now, and not optimize for it. |
To answer your Q, I need to ask some Qs first: If a consistency check fails, will we definitely get a panic crash loop that doesn't end until some operator takes an action? Or could we get a small number of crashes that go away without some operator taking action? I see you say "panic loop" but want to make sure I get what that means exactly. If an operator will need to take an action, what actions might they take? |
The node will keep crashing forever, by design. The node is borked and must be replaced. However, only the node(s) with the inconsistencies will crash. We'll also leave checkpoints behind on the nodes with correct replicas.
|
@pavelkalinnikov One consideration here is that we may not want to have to copy the entire checkpoint around to run the diff tool. It may be better to have a two-stage process: one command to extract replica data from a checkpoint, then copy that replica data somewhere and run a tool to diff them. |
Ack! Then I'm confident SRE will notice the inconsistency, if the alerting stack functions as designed.
Here are some options, in order of how likely it is that CRL actually cleans up the checkpoints in a cloud context:
If we go with 3, I'm not confident we will clean up checkpoints, but maybe that is okay right now, given that a corruption crash loop has never happened in cloud according to my memory, and given that the checkpoints just increase disk utilization. |
A gauge seems doable and maybe a good way to solve it reliably. diff --git a/pkg/kv/kvserver/store.go b/pkg/kv/kvserver/store.go
index fceda5df42..c6cd380766 100644
--- a/pkg/kv/kvserver/store.go
+++ b/pkg/kv/kvserver/store.go
@@ -3303,11 +3303,15 @@ func (s *Store) updateReplicationGauges(ctx context.Context) error {
return nil
}
+func (s *Store) checkpointDir() string {
+ return filepath.Join(s.engine.GetAuxiliaryDir(), "checkpoints")
+}
+
// checkpoint creates a RocksDB checkpoint in the auxiliary directory with the
// provided tag used in the filepath. The filepath for the checkpoint directory
// is returned.
func (s *Store) checkpoint(ctx context.Context, tag string) (string, error) {
- checkpointBase := filepath.Join(s.engine.GetAuxiliaryDir(), "checkpoints")
+ checkpointBase := s.checkpointDir()
_ = s.engine.MkdirAll(checkpointBase)
checkpointDir := filepath.Join(checkpointBase, tag)
@@ -3337,6 +3341,19 @@ func (s *Store) ComputeMetrics(ctx context.Context, tick int) error {
m := s.engine.GetMetrics()
s.metrics.updateEngineMetrics(m)
+ {
+ cpDirs, err := s.engine.List(s.checkpointDir())
+ var n int64
+ if err != nil {
+ // TODO probably need to tolerate NotFound here,
+ n = math.MaxInt64
+ } else {
+ n = int64(len(cpDirs))
+ }
+
+ someGauge.Update(n)
+ }
+
// Get engine Env stats.
envStats, err := s.engine.GetEnvStats()
if err != nil {
|
Yeah, let's do a gauge and a log message (on crash and on node startup). GC is going to be more work than it's worth for something that's very rare. |
Works for me! Send me the PR adding the gauge, and I'll make sure we add an alert on the CC side. |
SGTM, thank you all for the input. A gauge seems strictly more reliable because of the issue that I pointed out above:
A gauge will be increased on all the alive replicas, so this signal will propagate. |
88080: stmtdiagnostics: save the bundle on a statement timeout r=yuzefovich a=yuzefovich Previously if the traced statement is canceled due to a statement timeout, the statement bundle would be created but would fail on the insertion into the system table. This is suboptimal because we already did all the work to collect the bundle as well as it might be desired to see the partial trace, so this commit makes it so that the bundle is saved correctly. Fixes: #73477. Release note (bug fix): Previously, when a statement bundle was collected for a query that results in an error due to a `statement_timeout`, the bundle would not be saved, and this is now fixed. 88307: sql: retry validation of unique constraint r=yuzefovich a=yuzefovich Previously, when running the validation of unique constraint we would execute the query only once, and if that fails for any reason, we'd return an error. This was suboptimal because we might be performing the validation after having performed a lot of work (e.g. running an import), and the error could be a transient one (e.g. if a node that is used in the distributed validation query goes down). In order to improve the situation this commit adds a retry loop with 5 attempts with an allowlist of error that are swallowed. Fixes: #85711. Release note: None 88929: sql: add tracing spans for each statement executed within a UDF r=mgartner a=mgartner This commit creates a separate tracing span for each statement executed during evaluation of a UDF. Release note: None 88933: opt: add tests for FK delete cascade fast-path with UDF r=mgartner a=mgartner The FK delete cascade fast-path remaps the `DELETE` filter into a filter on child table, avoiding buffering the parent's deleted rows like in the non-fast-path plan. It is possible to plan this fast-path if the filters contain UDF invocations, as long as they are immutable or stable. (Of course, if the user creates a UDF with mislabeled volatility, data corruption is possible. To mitigate this and other corruption from mislabeled UDF volatility in the future, we're considering performing some analysis during `CREATE FUNCTION` to prevent users from incorrectly assigning volatility.) This commit adds tests for FK delete cascades with UDFs and removes an unnecessary TODO. Release note: None 88978: roachtest: tighten boundaries of tpch_concurrency r=yuzefovich a=yuzefovich This should remove one iteration which should keep the test under the default 10 hour timeout. Release note: None 88991: changefeedccl: Release overhead allocation r=miretskiy a=miretskiy To account for the overhead during the encoding stage, we allocate more memory resources than needed (`changefeed.event_memory_multiplier`). Those resources are released when they are amitted by the sink. Some sinks, such as file based sinks, batch many such events (to generate files of target size). This batching, when combined with compression, could result in a situation where maximum of allowed resources, `changefeed.memory.per_changefeed_limit` are all batched, making it impossible to ingest additional events without forcing a sink flush. The only way to avoid premature forced flush, is to increase the memory limit for changefeed -- but doing so is not without the costs (i.e. more events batched, more pressure on Go GC, etc). This PR adjust event resources to match the final size of the data that will be emitted into the sink -- that is, we release the overhead back into the pool, once the event processing is done. Release note: none 89035: ui: fix contention time tooltip on insights page r=ericharmeling a=ericharmeling This commit updates the High Contention Time tooltip on the insights page to accurately reflect the event's contention duration. https://cockroachlabs.slack.com/archives/G01Q9D01NTU/p1664477246961729?thread_ts=1664473625.621409&cid=G01Q9D01NTU https://www.loom.com/share/a8ce63ece586468e870d67118c0d8d5a Release note: None Release justification: bug fix 89036: eval: partial removal of usages of the stored context r=yuzefovich a=yuzefovich This PR contains several commits from #85782 which attempts to remove the stored `context.Context` from `eval.Context` object. I didn't finish that PR since I ran into two very thorny usages, so instead this PR takes some steps towards deprecating the usage of the stored context. Storing a `context.Context` is an anti-pattern that needs to have a justification for applying, and in case of `eval.Context` I think it was introduced that way specifically for scalar functions evaluations (which should use the same `context.Context`) and that was ok, but over time we introduced many more callsites out of convenience. Fixes: #46164 89066: kvserver: add storage checkpoints metric r=erikgrinaker a=pavelkalinnikov When an inconsistency between range's replicas is found, every replica's node leaves behind a storage checkpoint which can be used for finding the cause of the inconsistency. The checkpoints are cheap to make if hard links are used. However, the cost of a checkpoint on a live node increases over time because the state keeps diverging from it. Effectively it becomes a full copy of the old state, and consumes that much of disk capacity. It is important to act on the checkpoints and reclaim the space. This commit adds a store metric which exposes the number of checkpoints found, so that admins/SREs can be alerted and act on it. Touches #21128 Added a test, and also tested manually: tricked one replica to think that there is an inconsistency, this triggered all replicas to store a checkpoint and one crash; the metric on the other 2 replicas increased to 1, and I saw the corresponding directories in storage. Release note: None Co-authored-by: Yahor Yuzefovich <[email protected]> Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Yevgeniy Miretskiy <[email protected]> Co-authored-by: Eric Harmeling <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
88969: sql: address minor typos in recent overflow fix r=mgartner a=mgartner This commit fixes minor typos introduced in #88199. Release note: None 89207: kvserver: rm consistency check diff report in tests r=erikgrinaker a=pavelkalinnikov This change removes the `BadChecksumReportDiff` testing knob. Previously, a node that runs a consistency check would report any diffs through this callback to tests. However, now the plan is to move away from computing the diff, and instead leave storage engine checkpoints on each replica so that later they can be analysed/diffed by tooling. When this plan is implemented, the initiating node will no longer be able to report diffs into the testing knob. This commit proactively refactors a test in such a way that it doesn't need the knob, and verifies the diff by reading from the storage directly. This approximates what we will do with tooling too. Touches #21128 Release note: None Co-authored-by: Marcus Gartner <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
For the replica extraction step, it looks like one of cockroach/pkg/storage/sst_writer.go Lines 74 to 76 in fdbeda5
cockroach/pkg/storage/sst_writer.go Lines 102 to 105 in fdbeda5
Are there any other options? How are range snapshots encoded, is it by Between "ingestion" and "backup", I will probably start with the latter. The former is interesting though if we imagine wanting to recover inconsistent replicas after the investigation. We can look at this option later, as for now the recovery/upreplication will happen as a result of some nodes crashing. |
I'd go with |
Does the node crashing get triggered by a raft command on a replica? That would make that replica internally consistent in its state machine RaftAppliedIndex+state machine state, but not necessarily for other replicas that may be in the middle of doing a sequence of (atomic ingest of AddSSTable, atomic update of RaftAppliedIndex). Is the offline tool going to try to diff other replicas too? |
Yes.
No. As you point out, only this range's replicas are guaranteed to be consistent. |
88539: sql: add telemetry for statistics forecast usage r=rytaft a=michae2 Add a few fields to the sampled_query telemetry events that will help us measure how useful table statistics forecasting is in practice. Fixes: #86356 Release note (ops change): Add five new fields to the sampled_query telemetry events: - `ScanCount`: Number of scans in the query plan. - `ScanWithStatsCount`: Number of scans using statistics (including forecasted statistics) in the query plan. - `ScanWithStatsForecastCount`: Number of scans using forecasted statistics in the query plan. - `TotalScanRowsWithoutForecastsEstimate`: Total number of rows read by all scans in the query, as estimated by the optimizer without using forecasts. - `NanosSinceStatsForecasted`: The greatest quantity of nanoseconds that have passed since the forecast time (or until the forecast time, if it is in the future, in which case it will be negative) for any table with forecasted stats scanned by this query. 89418: sqlstats: always enable tracing first time fingerprint is seen r=j82w a=j82w Fixes: #89185 The first time a fingerprint is seen tracing should be enabled. This currently is broken if sql.metrics.statement_details.plan_collection.enabled is set to false. This can cause crdb_internal.transaction_contention_events to be empty because tracing was never enabled to the contention event was never recorded. To properly fix this a new value needs to be returned on ShouldSample to tell if it is the first time a fingerprint is seen. This will remove the dependency on plan_collection feature switch. Release justification: Bug fixes and low-risk updates to new functionality. Release note (bug fix): Always enable tracing the frist time a fingerprint is seen. 89502: kvserver: do not report diff in consistency checks r=erikgrinaker,tbg a=pavelkalinnikov This commit removes reporting of the diff between replicas in case of consistency check failures. With the increased range sizes the previous approach has become infeasible. One possible way to inspect an inconsistency after this change is: 1. Run `cockroach debug range-data` tool to extract the range data from each replica's checkpoint. 2. Use standard OS tools like `diff` to analyse them. In the meantime, we are researching the UX of this alternative approach, and seeing if there can be a better tooling support. Part of #21128 Epic: none Release note (sql change): The `crdb_internal.check_consistency` function now does not include the diff between inconsistent replicas, should they occur. If an inconsistency occurs, the storage engine checkpoints should be inspected. This change is made because previously the range size limit has been increased from 64 MiB to O(GiB), so inlining diffs in consistency checks does not scale. 89529: changefeedccl: update parallel consumer metrics r=jayshrivastava a=jayshrivastava Previously, both changefeed.nprocs_flush_nanos and changefeed.nprocs_consume_event_nanos were counters that monotonically increased. This was not that useful when determining the average time it takes to consume or flush an event. Changing them to a histogram fixes this issue and allows for percentile values like p90, p99. This change also updates changefeed.nprocs_in_flight_count to sample values when incrementing inFlight variable. Previously, it was showing up at 0 in the UI. This change makes it show the actual value. Fixes #89654 Release note: None Epic: none 89660: tree: use FastIntSet during typechecking r=jordanlewis a=jordanlewis Previously, the typecheck phase used several slices of ordinals into lists. This is a perfect use case for FastIntSet, because the ordinals tend to be quite small. This commit switches to use FastIntSet instead. ``` name old time/op new time/op delta TypeCheck-10 3.68µs ± 3% 3.52µs ± 2% -4.52% (p=0.000 n=9+10) name old alloc/op new alloc/op delta TypeCheck-10 744B ± 0% 576B ± 0% -22.58% (p=0.000 n=10+10) name old allocs/op new allocs/op delta TypeCheck-10 32.0 ± 0% 18.0 ± 0% -43.75% (p=0.000 n=10+10) ``` Issue: None Epic: None Release note: None 89662: scop, scdep: Rename `IndexValidator` to `Validator` r=Xiang-Gu a=Xiang-Gu I've recently done work to enable adding/dropping check constraints in the declarative schema changer. It is a big PR with many commits. I think it's nicer to separate them further into multiple PRs. This is the first PR in that effort, which is merely renaming and should be easy to review: commit 1: ValidateCheckConstraint now uses ConstraintID, instead of constraint name; commit 2: Rename `IndexValidator` to `Validator`. We previously had a file under scdep called `index_validator.go` where we implement logic for validating an index. Now that we are going to validate a check constraint, we renamed them so they will also validate check constraints. Informs #89665 Release note: None Co-authored-by: Michael Erickson <[email protected]> Co-authored-by: j82w <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Jayant Shrivastava <[email protected]> Co-authored-by: Jordan Lewis <[email protected]> Co-authored-by: Xiang Gu <[email protected]>
89747: restore: only read from manifests with startTime < AOST r=stevendanna a=msbutler Previously, all backup manifests in the chain could be added to a restoreSpanEntry during makeSimpleImportSpans, even if the the backup manifest's startTime was greater than the user's AOST. This caused an unecessary amount of data to get read from external storage and filtered out in the restore processor. This small patch prevents these newer manifests from being considered in makeSimpleImportSpans and thus unecessarily read from during the distsql flow. Release note: None 89888: tenantcostserver, filetable: replace some `*.executor`s with `InternalExecutorFactory` r=ajwerner a=ZhouXing19 This is part of migrating the existing usages of the internal executor to the new internal executor interfaces. The changes are based on this logic: trace up the `txn` in the exec functions of the internal executor, find out where they are from, and ensure that an IE is created along with this txn via `planner.WithInternalExecutor` or `sqlutil.InternalExecutorFactory.TxnWithExecutor`. The new interfaces allows the internal executor bound to the outer txn, along with txn-related metadata. Release note: None link Epic CRDB-19135 89899: kvserver: improve consistency check fail message r=erikgrinaker a=pavelkalinnikov Epic: none Touches #21128 Release note (ops change): Consistency check failure message is made more informative, and suggests a few actions that operators should/may do in the unlikely event it occurs. Co-authored-by: Michael Butler <[email protected]> Co-authored-by: Jane Xing <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
89813: kvserver: remove legacy snapshot and diff code r=erikgrinaker a=pavelkalinnikov This commit removes the proto and code for `RaftDataSnapshot`. This type was used for reporting replica snapshots and computing diffs, but now this functionality has been removed in favor of storage checkpoints and offline tooling. Part of #21128 Epic: none Release note: None 90006: roachtest: Introduce a test to overwhelm nodes r=irfansharif a=andrewbaptist This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM. Informs #89142. Release note: None 90063: roachtest: de-flake admission-control/tpcc-olap r=irfansharif a=irfansharif Fixes #89600. This test makes use of `workload querybench`, which is only available in the `workload` binary. This test started failing after \#89482 where we (attempted) to replace all uses of `./workload` with `./cockroach workload`. Release note: None Co-authored-by: Pavel Kalinnikov <[email protected]> Co-authored-by: Andrew Baptist <[email protected]> Co-authored-by: irfan sharif <[email protected]>
We are discussing supporting larger ranges and making first steps in their implementation, see for example #16954.
The consistency checker on failure currently requests a "diff" but this diff is implemented by sending a whole snapshot through memory (in
CollectChecksumResponse
). This won't work any more once ranges are allowed to be larger than a small fraction of memory.gz#9424
Jira issue: CRDB-5898
The text was updated successfully, but these errors were encountered: