Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: remove consistency checker diffs #21128

Closed
tbg opened this issue Dec 29, 2017 · 26 comments
Closed

kvserver: remove consistency checker diffs #21128

tbg opened this issue Dec 29, 2017 · 26 comments
Assignees
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@tbg
Copy link
Member

tbg commented Dec 29, 2017

We are discussing supporting larger ranges and making first steps in their implementation, see for example #16954.

The consistency checker on failure currently requests a "diff" but this diff is implemented by sending a whole snapshot through memory (in CollectChecksumResponse). This won't work any more once ranges are allowed to be larger than a small fraction of memory.

gz#9424

Jira issue: CRDB-5898

@tbg tbg added this to the 2.1 milestone Dec 29, 2017
@petermattis petermattis added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels Jul 21, 2018
@tbg tbg modified the milestones: 2.1, 2.2 Jul 31, 2018
@petermattis petermattis removed this from the 2.2 milestone Oct 5, 2018
@petermattis
Copy link
Collaborator

In discussion today, @sumeerbhola was wondering how we compute the consistency checker diff. The answer is we buffer the entire Range data and send it in a single response. I tracked down this issue from this comment in the code:

  // TODO(tschottdorf): with larger ranges, this is no longer tenable.
  // See https://github.com/cockroachdb/cockroach/issues/21128.
  roachpb.RaftSnapshotData snapshot = 2;

Cc @ajwerner, @andy-kimball, and @johnrk as this throws a bit of a wrench into larger range sizes.

@tbg
Copy link
Member Author

tbg commented Apr 14, 2020

Just in case this isn't clear, the diff is requested only after a first round of the check already failed. It's still bad.

@petermattis
Copy link
Collaborator

I suspect the solution here would be to add a streaming RPC interface.

@ajwerner
Copy link
Contributor

I suspect the solution here would be to add a streaming RPC interface.

This isn't going to be a readily back-portable change. This makes me nervous.

@tbg
Copy link
Member Author

tbg commented Apr 14, 2020

How big are the ranges in 20.1? We can always not send a range data snapshot if it would be too large. It would be unfortunate to see this in practice, but the data checkpoints would be there anyway, so we could reconstruct the diff manually later.

@ajwerner
Copy link
Contributor

How big are the ranges in 20.1?

range_max_size defaults to 512 MiB and range_min_size defaults to 128 MiB. This is likely to be bad on tiny deployments but fine on most deployments.

@petermattis
Copy link
Collaborator

This isn't going to be a readily back-portable change. This makes me nervous.

This is a good point. Disabling sending range snapshot data for large ranges might be the only back-portable change. My suggestion was more about a long-term fix. We actually already have a facility for streaming range data that we use for sending snapshots. Doesn't seem like it could be easily adapted here, though.

@erikgrinaker
Copy link
Contributor

We're doing some adjacent work over in #77432, and plan to remove the consistency checker diffs while we're at it, since they're pretty unwieldy at 512 MB. This involves removing the following fields and functionality:

  • roachpb.CheckConsistencyRequest.WithDiff
  • roachpb.ComputeChecksumRequest.Snapshot
  • roachpb.CollectChecksumResponse.Snapshot
  • roachpb.RaftSnapshotData

It might be possible to do this without requiring a version gate and keeping the old code around for mixed-version clusters, because the current code will gracefully ignore responses that don't have a snapshot:

minoritySnap := results[shaToIdxs[minoritySHA][0]].Response.Snapshot
curSnap := results[shaToIdxs[sha][0]].Response.Snapshot
if sha != minoritySHA && minoritySnap != nil && curSnap != nil {
diff := diffRange(curSnap, minoritySnap)
if report := r.store.cfg.TestingKnobs.ConsistencyTestingKnobs.BadChecksumReportDiff; report != nil {
report(*r.store.Ident, diff)
}
buf.Printf("====== diff(%x, [minority]) ======\n%v", redact.Safe(sha), diff)
}

However, we'll need some mechanism to obtain such diffs for debugging. We already take Pebble checkpoints when inconsistencies are found, so we'll need to write e.g. a cockroach debug diff-checkpoints CLI commands to compare them between good/bad nodes:

if cc.Checkpoint {
sl := stateloader.Make(r.RangeID)
as, err := sl.LoadRangeAppliedState(ctx, snap)
if err != nil {
log.Warningf(ctx, "unable to load applied index, continuing anyway")
}
// NB: the names here will match on all nodes, which is nice for debugging.
tag := fmt.Sprintf("r%d_at_%d", r.RangeID, as.RaftAppliedIndex)
if dir, err := r.store.checkpoint(ctx, tag); err != nil {
log.Warningf(ctx, "unable to create checkpoint %s: %+v", dir, err)
} else {
log.Warningf(ctx, "created checkpoint %s", dir)
}
}

This could reuse the existing diff code:

// diffs the two kv dumps between the lease holder and the replica.
func diffRange(l, r *roachpb.RaftSnapshotData) ReplicaSnapshotDiffSlice {
if l == nil || r == nil {
return nil
}
var diff ReplicaSnapshotDiffSlice
diff = append(diff, diffKVs(l.KV, r.KV)...)
diff = append(diff, diffRangeKeys(l.RangeKV, r.RangeKV)...)
return diff
}

@erikgrinaker erikgrinaker changed the title storage: consistency checker diffs won't work with increased range size kvserver: remove consistency checker diffs Sep 1, 2022
@sumeerbhola
Copy link
Collaborator

and plan to remove the consistency checker diffs while we're at it, since they're pretty unwieldy at 512 MB

Are we going to lose diff support?

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 6, 2022

Likely yes, in its current form (an API that transfers the entire range). But we want diffs to be available in some form out-of-band, e.g. have tooling for constructing it from the checkpoints saved on replicas.

We could also explore some forms of partial/sampled diffing, e.g. detect one sub-range with a mismatch, and diff only that part.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 6, 2022

One option for a partial diff:

  • Have each replica compute a sequence of hashes instead of one hash, spaced out (deterministically) across the range so that there aren't too many of them, and each hash spans a relatively small size. ~10-100 would do.
  • Compare these sequences of hashes in the first phase of the diff algo. (where we previously compared a single hash)
  • When there is a mismatch, drill down into the first hash that mismatches: exchange the corresponding spans of keys and diff them. (like we did before for the whole range)

We could purposefully limit this to 1 mismatching key (it's the worst case anyway, if there is one large k/v entry), or to some reasonable byte-size worth of keys, and leave full analysis to the offline tooling.

One advantage of retaining some inline diffing (instead of fully removing it) is in having somewhat more informative consistency check failures should they occur in tests like #87106. @tbg How would a human debug those things? I imagine that I would maybe look at the first few mismatching entries in this diff and try to understand what's up with them, but there is probably benefit in seeing larger patterns in it too.

@tbg
Copy link
Member Author

tbg commented Sep 7, 2022

A human would always investigate the checkpoints. The diff would serve the guide them as to which keys are of interest, but there isn't a huge upshot in having the diff available upfront (and if there were we could build the retrieval into roachtest). The diff alone isn't that useful, though it would be useful, but not at the expense of lots of engineering.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 28, 2022

A human would always investigate the checkpoints.

SGTM.

I’m thinking of the following flow when an inconsistency happens:

  1. Like today, crash suspicious nodes (those in the minority when SHA mismatches).
    • This will get surfaced to admins via some crash reporting mechanism.
    • The crash message should have enough information to guide the investigation.
  2. Checkpoint all replicas (we need at least one good, for diffing and analysis; but it’s better to have all of them, because even the majority can be “wrong” - e.g. leader may pass its corrupted snapshot to some followers).
  3. Admin uses cockroach debug tool to analyse and eventually remove the checkpoints.

Issues:

  1. Checkpoints become expensive over time, so they should be deleted timely (on the good nodes).
  2. This may not scale if there are too many failures like this.
    • Could be solved with some automatic checkpoint GC.
  3. A minority node may never learn that it needs to crash because of an inconsistency (it may crash for another reason, get stuck, etc). So the admins may miss this event.
  4. The thing that I still don’t like about the checkpoint is that it’s not granular per range, so checkpointing good replicas is costly.
    • We may want to save "lightweight" checkpoints (specific to the range) on all the nodes that are not killed.

@erikgrinaker
Copy link
Contributor

Checkpoints become expensive over time, so they should be deleted timely (on the good nodes).

We should clearly call this out in the crash message, so that admins know they'll have to remove this when they no longer need it. I think that should be sufficient for now. We may want to e.g. log a message on startup if there are any checkpoints too. It's worth pinging the SREs real quick about how they'd like this to be surfaced.

@joshimhoff When a consistency check fails, we leave behind a checkpoint of the current Pebble state on the relevant nodes (with replicas for that range). These are needed to debug the inconsistency. These are never removed, so over time they'll essentially be an old copy of the node's database. How would SREs prefer to be notified about these checkpoints, which will need to be removed manually when no longer needed for debugging? Currently, the inconsistent nodes go into a panic loop with a log message, so the simplest solution is to just call this out in the crash message -- but we'll also keep checkpoints on the good nodes that will need to be removed. Is that sufficient?

This may not scale if there are too many failures like this.

We essentially never see these failures, so let's not optimize for it.

The thing that I still don’t like about the checkpoint is that it’s not granular per range, so checkpointing good replicas is costly.

True, but they're cheap to take -- they're only expensive to keep for a long time. Let's leave the GC responsibility to admins for now, and not optimize for it.

@joshimhoff
Copy link
Collaborator

To answer your Q, I need to ask some Qs first:

If a consistency check fails, will we definitely get a panic crash loop that doesn't end until some operator takes an action? Or could we get a small number of crashes that go away without some operator taking action? I see you say "panic loop" but want to make sure I get what that means exactly.

If an operator will need to take an action, what actions might they take?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 28, 2022

If a consistency check fails, will we definitely get a panic crash loop that doesn't end until some operator takes an action?

The node will keep crashing forever, by design. The node is borked and must be replaced.

However, only the node(s) with the inconsistencies will crash. We'll also leave checkpoints behind on the nodes with correct replicas.

If an operator will need to take an action, what actions might they take?

  1. Make copies of the checkpoints, and either give them to CRL staff for debugging, or run some tools that produce a report to give to CRL staff.
  2. Replace the crashing nodes that have inconsistencies.
  3. Remove the checkpoints from all healthy nodes that have them.

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 28, 2022

The thing that I still don’t like about the checkpoint is that it’s not granular per range, so checkpointing good replicas is costly.

@pavelkalinnikov One consideration here is that we may not want to have to copy the entire checkpoint around to run the diff tool. It may be better to have a two-stage process: one command to extract replica data from a checkpoint, then copy that replica data somewhere and run a tool to diff them.

@joshimhoff
Copy link
Collaborator

joshimhoff commented Sep 28, 2022

The node will keep crashing forever, by design. The node is borked and must be replaced.

Ack! Then I'm confident SRE will notice the inconsistency, if the alerting stack functions as designed.

How would SREs prefer to be notified about these checkpoints, which will need to be removed manually when no longer needed for debugging?

Here are some options, in order of how likely it is that CRL actually cleans up the checkpoints in a cloud context:

  1. Implement GC of checkpoints in CRDB itself.
  2. Increment a prometheus gauge or similar that indicates when there are checkpoints on disk to clean up. SRE adds an alert in case the gauges fires. The alert links to a playbook that says to clean up the checkpoints, if they are no longer needed.
  3. Just mention the checkpoints in CRDB logs. Also have an SRE playbook re: corruption that mentions the need to clean up the checkpoints.

If we go with 3, I'm not confident we will clean up checkpoints, but maybe that is okay right now, given that a corruption crash loop has never happened in cloud according to my memory, and given that the checkpoints just increase disk utilization.

@tbg
Copy link
Member Author

tbg commented Sep 29, 2022

A gauge seems doable and maybe a good way to solve it reliably.

diff --git a/pkg/kv/kvserver/store.go b/pkg/kv/kvserver/store.go
index fceda5df42..c6cd380766 100644
--- a/pkg/kv/kvserver/store.go
+++ b/pkg/kv/kvserver/store.go
@@ -3303,11 +3303,15 @@ func (s *Store) updateReplicationGauges(ctx context.Context) error {
 	return nil
 }
 
+func (s *Store) checkpointDir() string {
+	return filepath.Join(s.engine.GetAuxiliaryDir(), "checkpoints")
+}
+
 // checkpoint creates a RocksDB checkpoint in the auxiliary directory with the
 // provided tag used in the filepath. The filepath for the checkpoint directory
 // is returned.
 func (s *Store) checkpoint(ctx context.Context, tag string) (string, error) {
-	checkpointBase := filepath.Join(s.engine.GetAuxiliaryDir(), "checkpoints")
+	checkpointBase := s.checkpointDir()
 	_ = s.engine.MkdirAll(checkpointBase)
 
 	checkpointDir := filepath.Join(checkpointBase, tag)
@@ -3337,6 +3341,19 @@ func (s *Store) ComputeMetrics(ctx context.Context, tick int) error {
 	m := s.engine.GetMetrics()
 	s.metrics.updateEngineMetrics(m)
 
+	{
+		cpDirs, err := s.engine.List(s.checkpointDir())
+		var n int64
+		if err != nil {
+			// TODO probably need to tolerate NotFound here,
+			n = math.MaxInt64
+		} else {
+			n = int64(len(cpDirs))
+		}
+
+		someGauge.Update(n)
+	}
+
 	// Get engine Env stats.
 	envStats, err := s.engine.GetEnvStats()
 	if err != nil {

@erikgrinaker
Copy link
Contributor

Yeah, let's do a gauge and a log message (on crash and on node startup). GC is going to be more work than it's worth for something that's very rare.

@joshimhoff
Copy link
Collaborator

Works for me! Send me the PR adding the gauge, and I'll make sure we add an alert on the CC side.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 29, 2022

SGTM, thank you all for the input. A gauge seems strictly more reliable because of the issue that I pointed out above:

A minority node may never learn that it needs to crash because of an inconsistency (it may crash for another reason, get stuck, etc). So the admins may miss this event.

A gauge will be increased on all the alive replicas, so this signal will propagate.

craig bot pushed a commit that referenced this issue Sep 30, 2022
88080: stmtdiagnostics: save the bundle on a statement timeout r=yuzefovich a=yuzefovich

Previously if the traced statement is canceled due to a statement timeout, the statement bundle would be created but would fail on the insertion into the system table. This is suboptimal because we already did all the work to collect the bundle as well as it might be desired to see the partial trace, so this commit makes it so that the bundle is saved correctly.

Fixes: #73477.

Release note (bug fix): Previously, when a statement bundle was collected for a query that results in an error due to a `statement_timeout`, the bundle would not be saved, and this is now fixed.

88307: sql: retry validation of unique constraint r=yuzefovich a=yuzefovich

Previously, when running the validation of unique constraint we would execute the query only once, and if that fails for any reason, we'd return an error. This was suboptimal because we might be performing the validation after having performed a lot of work (e.g. running an import), and the error could be a transient one (e.g. if a node that is used in the distributed validation query goes down). In order to improve the situation this commit adds a retry loop with 5 attempts with an allowlist of error that are swallowed.

Fixes: #85711.

Release note: None

88929: sql: add tracing spans for each statement executed within a UDF r=mgartner a=mgartner

This commit creates a separate tracing span for each statement executed during evaluation of a UDF.

Release note: None

88933: opt: add tests for FK delete cascade fast-path with UDF r=mgartner a=mgartner

The FK delete cascade fast-path remaps the `DELETE` filter into a filter on child table, avoiding buffering the parent's deleted rows like in the non-fast-path plan. It is possible to plan this fast-path if the filters contain UDF invocations, as long as they are immutable or stable. (Of course, if the user creates a UDF with mislabeled volatility, data corruption is possible. To mitigate this and other corruption from mislabeled UDF volatility in the future, we're considering performing some analysis during `CREATE FUNCTION` to prevent users from incorrectly assigning volatility.)

This commit adds tests for FK delete cascades with UDFs and removes an unnecessary TODO.

Release note: None

88978: roachtest: tighten boundaries of tpch_concurrency r=yuzefovich a=yuzefovich

This should remove one iteration which should keep the test under the default 10 hour timeout.

Release note: None

88991: changefeedccl: Release overhead allocation r=miretskiy a=miretskiy

To account for the overhead during the encoding stage, we allocate more memory resources than needed
(`changefeed.event_memory_multiplier`).

Those resources are released when they are amitted by the sink.  Some sinks, such as file based sinks, batch many such events (to generate files of target size). This batching, when combined with compression, could result in a situation where maximum of allowed resources, `changefeed.memory.per_changefeed_limit` are all batched, making it impossible to ingest additional events without forcing a sink flush.

The only way to avoid premature forced flush, is to increase the memory limit for changefeed -- but doing so is not without the costs (i.e. more events batched, more pressure on Go GC, etc).

This PR adjust event resources to match the final size of the data that will be emitted into the sink -- that is, we release the overhead back into the pool, once the event processing is done.

Release note: none

89035: ui: fix contention time tooltip on insights page r=ericharmeling a=ericharmeling

This commit updates the High Contention Time tooltip on the insights page to accurately reflect the event's contention duration.

https://cockroachlabs.slack.com/archives/G01Q9D01NTU/p1664477246961729?thread_ts=1664473625.621409&cid=G01Q9D01NTU

https://www.loom.com/share/a8ce63ece586468e870d67118c0d8d5a


Release note: None

Release justification: bug fix

89036: eval: partial removal of usages of the stored context r=yuzefovich a=yuzefovich

This PR contains several commits from #85782 which attempts to remove the stored `context.Context` from `eval.Context` object. I didn't finish that PR since I ran into two very thorny usages, so instead this PR takes some steps towards deprecating the usage of the stored context.

Storing a `context.Context` is an anti-pattern that needs to have a justification for applying, and in case of `eval.Context` I think it was introduced that way specifically for scalar functions evaluations (which should use the same `context.Context`) and that was ok, but over time we introduced many more callsites out of convenience.

Fixes: #46164

89066: kvserver: add storage checkpoints metric r=erikgrinaker a=pavelkalinnikov

When an inconsistency between range's replicas is found, every replica's node leaves behind a storage checkpoint which can be used for finding the cause of the inconsistency. The checkpoints are cheap to make if hard links are used. However, the cost of a checkpoint on a live node increases over time because the state keeps diverging from it. Effectively it becomes a full copy of the old state, and consumes that much of disk capacity.

It is important to act on the checkpoints and reclaim the space. This commit adds a store metric which exposes the number of checkpoints found, so that admins/SREs can be alerted and act on it.

Touches #21128

Added a test, and also tested manually: tricked one replica to think that there is an inconsistency, this triggered all replicas to store a checkpoint and one crash; the metric on the other 2 replicas increased to 1, and I saw the corresponding directories in storage.

Release note: None

Co-authored-by: Yahor Yuzefovich <[email protected]>
Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Yevgeniy Miretskiy <[email protected]>
Co-authored-by: Eric Harmeling <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Oct 4, 2022
88969: sql: address minor typos in recent overflow fix r=mgartner a=mgartner

This commit fixes minor typos introduced in #88199.

Release note: None

89207: kvserver: rm consistency check diff report in tests r=erikgrinaker a=pavelkalinnikov

This change removes the `BadChecksumReportDiff` testing knob.

Previously, a node that runs a consistency check would report any diffs through this callback to tests. However, now the plan is to move away from computing the diff, and instead leave storage engine checkpoints on each replica so that later they can be analysed/diffed by tooling.

When this plan is implemented, the initiating node will no longer be able to report diffs into the testing knob. This commit proactively refactors a test in such a way that it doesn't need the knob, and verifies the diff by reading from the storage directly. This approximates what we will do with tooling too.

Touches #21128

Release note: None

Co-authored-by: Marcus Gartner <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
@pav-kv
Copy link
Collaborator

pav-kv commented Oct 5, 2022

It may be better to have a two-stage process: one command to extract replica data from a checkpoint, then copy that replica data somewhere and run a tool to diff them.

For the replica extraction step, it looks like one of SSTWriter variants would be reusable:

// MakeBackupSSTWriter creates a new SSTWriter tailored for backup SSTs which
// are typically only ever iterated in their entirety.
func MakeBackupSSTWriter(ctx context.Context, cs *cluster.Settings, f io.Writer) SSTWriter {

// MakeIngestionSSTWriter creates a new SSTWriter tailored for ingestion SSTs.
// These SSTs have bloom filters enabled (as set in DefaultPebbleOptions) and
// format set to RocksDBv2.
func MakeIngestionSSTWriter(

Are there any other options? How are range snapshots encoded, is it by MakeIngestionSSTWriter or something else?

Between "ingestion" and "backup", I will probably start with the latter. The former is interesting though if we imagine wanting to recover inconsistent replicas after the investigation. We can look at this option later, as for now the recovery/upreplication will happen as a result of some nodes crashing.

@erikgrinaker
Copy link
Contributor

I'd go with MakeIngestionSSTWriter, since it's more general purpose and uses typical options (it e.g. has bloom filters enabled and smaller blocks for faster seeks). Range snapshots use MakeIngestionSSTWriter.

@sumeerbhola
Copy link
Collaborator

Does the node crashing get triggered by a raft command on a replica? That would make that replica internally consistent in its state machine RaftAppliedIndex+state machine state, but not necessarily for other replicas that may be in the middle of doing a sequence of (atomic ingest of AddSSTable, atomic update of RaftAppliedIndex). Is the offline tool going to try to diff other replicas too?

@erikgrinaker
Copy link
Contributor

Does the node crashing get triggered by a raft command on a replica

Yes.

Is the offline tool going to try to diff other replicas too?

No. As you point out, only this range's replicas are guaranteed to be consistent.

craig bot pushed a commit that referenced this issue Oct 10, 2022
88539: sql: add telemetry for statistics forecast usage r=rytaft a=michae2

Add a few fields to the sampled_query telemetry events that will help us
measure how useful table statistics forecasting is in practice.

Fixes: #86356

Release note (ops change): Add five new fields to the sampled_query
telemetry events:
- `ScanCount`: Number of scans in the query plan.
- `ScanWithStatsCount`: Number of scans using statistics (including
  forecasted statistics) in the query plan.
- `ScanWithStatsForecastCount`: Number of scans using forecasted
  statistics in the query plan.
- `TotalScanRowsWithoutForecastsEstimate`: Total number of rows read by
  all scans in the query, as estimated by the optimizer without using
  forecasts.
- `NanosSinceStatsForecasted`: The greatest quantity of nanoseconds that
  have passed since the forecast time (or until the forecast time, if it
  is in the future, in which case it will be negative) for any table
  with forecasted stats scanned by this query.

89418: sqlstats: always enable tracing first time fingerprint is seen r=j82w a=j82w

Fixes: #89185

The first time a fingerprint is seen tracing should be enabled. This currently is broken if
sql.metrics.statement_details.plan_collection.enabled is set to false. This can cause crdb_internal.transaction_contention_events to be empty because tracing was never enabled to the contention event was never recorded.

To properly fix this a new value needs to be returned on ShouldSample to tell if it is the first time a fingerprint is seen. This will remove the dependency on plan_collection feature switch.

Release justification: Bug fixes and low-risk updates to new functionality.

Release note (bug fix): Always enable tracing the frist time a fingerprint is seen.

89502: kvserver: do not report diff in consistency checks r=erikgrinaker,tbg a=pavelkalinnikov

This commit removes reporting of the diff between replicas in case of consistency check failures. With the increased range sizes the previous approach has become infeasible.

One possible way to inspect an inconsistency after this change is:
1. Run `cockroach debug range-data` tool to extract the range data from each replica's checkpoint.
2. Use standard OS tools like `diff` to analyse them.

In the meantime, we are researching the UX of this alternative approach, and seeing if there can be a better tooling support.

Part of #21128
Epic: none

Release note (sql change): The `crdb_internal.check_consistency` function now does not include the diff between inconsistent replicas, should they occur. If an inconsistency occurs, the storage engine checkpoints should be inspected. This change is made because previously the range size limit has been increased from 64 MiB to O(GiB), so inlining diffs in consistency checks does not scale.

89529: changefeedccl: update parallel consumer metrics r=jayshrivastava a=jayshrivastava

Previously, both changefeed.nprocs_flush_nanos and changefeed.nprocs_consume_event_nanos were counters that monotonically increased. This was not that useful when determining the average time it takes to consume or flush an event. Changing them to a histogram fixes this issue and allows for percentile values like p90, p99.

This change also updates changefeed.nprocs_in_flight_count to sample values when incrementing inFlight variable. Previously, it was showing up at 0 in the UI. This change makes it show the actual value.

Fixes #89654

Release note: None
Epic: none

89660: tree: use FastIntSet during typechecking r=jordanlewis a=jordanlewis

Previously, the typecheck phase used several slices of ordinals into lists. This is a perfect use case for FastIntSet, because the ordinals tend to be quite small. This commit switches to use FastIntSet instead.

```
name          old time/op    new time/op    delta
TypeCheck-10    3.68µs ± 3%    3.52µs ± 2%   -4.52%  (p=0.000 n=9+10)

name          old alloc/op   new alloc/op   delta
TypeCheck-10      744B ± 0%      576B ± 0%  -22.58%  (p=0.000 n=10+10)

name          old allocs/op  new allocs/op  delta
TypeCheck-10      32.0 ± 0%      18.0 ± 0%  -43.75%  (p=0.000 n=10+10)
```
Issue: None
Epic: None
Release note: None

89662: scop, scdep: Rename `IndexValidator` to `Validator` r=Xiang-Gu a=Xiang-Gu

I've recently done work to enable adding/dropping check constraints in the declarative
schema changer. It is a big PR with many commits. I think it's nicer to separate them
further into multiple PRs. 

This is the first PR in that effort, which is merely renaming and should be easy to review:
commit 1: ValidateCheckConstraint now uses ConstraintID, instead of constraint name;

commit 2: Rename `IndexValidator` to `Validator`.
We previously had a file under scdep called `index_validator.go` where we implement
logic for validating an index. Now that we are going to validate a check constraint, we 
renamed them so they will also validate check constraints.

Informs #89665
Release note: None

Co-authored-by: Michael Erickson <[email protected]>
Co-authored-by: j82w <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Jayant Shrivastava <[email protected]>
Co-authored-by: Jordan Lewis <[email protected]>
Co-authored-by: Xiang Gu <[email protected]>
craig bot pushed a commit that referenced this issue Oct 17, 2022
89747: restore: only read from manifests with startTime < AOST r=stevendanna a=msbutler

Previously, all backup manifests in the chain could be added to a restoreSpanEntry during makeSimpleImportSpans, even if the the backup manifest's startTime was greater than the user's AOST. This caused an unecessary amount of data to get read from external storage and filtered out in the restore processor.

This small patch prevents these newer manifests from being considered in makeSimpleImportSpans and thus unecessarily read from during the distsql flow.

Release note: None

89888: tenantcostserver, filetable: replace some `*.executor`s with `InternalExecutorFactory` r=ajwerner a=ZhouXing19

This is part of migrating the existing usages of the internal executor to the new internal executor interfaces.
The changes are based on this logic: trace up the `txn` in the exec functions of the internal executor, find out where 
they are from, and ensure that an IE is created along with this txn via `planner.WithInternalExecutor` or `sqlutil.InternalExecutorFactory.TxnWithExecutor`.

The new interfaces allows the internal executor bound to the outer txn, along with txn-related metadata.

Release note: None

link Epic CRDB-19135

89899: kvserver: improve consistency check fail message r=erikgrinaker a=pavelkalinnikov

Epic: none
Touches #21128

Release note (ops change): Consistency check failure message is made more informative, and suggests a few actions that operators should/may do in the unlikely event it occurs.

Co-authored-by: Michael Butler <[email protected]>
Co-authored-by: Jane Xing <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
craig bot pushed a commit that referenced this issue Oct 17, 2022
89813: kvserver: remove legacy snapshot and diff code r=erikgrinaker a=pavelkalinnikov

This commit removes the proto and code for `RaftDataSnapshot`. This type was used for reporting replica snapshots and computing diffs, but now this functionality has been removed in favor of storage checkpoints and offline tooling.

Part of #21128
Epic: none

Release note: None

90006: roachtest: Introduce a test to overwhelm nodes r=irfansharif a=andrewbaptist

This test is here to check behavior of the system as the SQL load greatly exceeds what the nodes are able to handle. In the future we want to evaulate how this is handled, but today this will cause nodes to OOM.

Informs #89142.

Release note: None

90063: roachtest: de-flake admission-control/tpcc-olap r=irfansharif a=irfansharif

Fixes #89600. This test makes use of `workload querybench`, which is only available in the `workload` binary. This test started failing after \#89482 where we (attempted) to replace all uses of `./workload` with `./cockroach workload`.

Release note: None

Co-authored-by: Pavel Kalinnikov <[email protected]>
Co-authored-by: Andrew Baptist <[email protected]>
Co-authored-by: irfan sharif <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-replication Relating to Raft, consensus, and coordination. C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

No branches or pull requests

8 participants