kvserver: propagate GCHint for partial deletions #110078

pav-kv · 2023-09-06T00:32:08Z

Previously, GCHint was used for 3 purposes:

Instruct GC to run at specific times. This is used by SQL table/index drop jobs, to limit the amount of time they wait on data GC.
Hint GC that a range is fully covered by range tombstones, so that GC can use ClearRange for faster deletes. This is useful for bulk data deletions as well.
Hint GC that a range was fully covered by range tombstones at some recent point. GC prioritizes going through such ranges, and processes them together, because this makes Pebble compaction more efficient.

The use-case 1 was broken. If a range tombstone does not fully cover the Range keyspace, GCHint is not written. Correspondingly, for small table/index deletions (spanning less than a single range), or deletions that don't perfectly match range bounds, there will be some ranges with no hints. For these ranges, GC might be arbitrarily delayed, and the schema change jobs would be stuck.

In addition, GCHint propagation during merges was lossy. It could drop the hint if either LHS or RHS doesn't have a hint, or has some data.

This commit extends GCHint to fix the use-case 1. Two fields are added: min and max deletion timestamp. These fields are populated even for partial range deletions, and are propagated during splits and merges unconditionally. The choice of min-max approach is explained in the comments to the corresponding GCHint fields.

Fixes #107931

Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.

Epic: none

cockroach-teamcity · 2023-09-06T00:32:16Z

This change is

pav-kv · 2023-09-06T19:06:37Z

TestStatsAreDeletedForDroppedTables panics in raft apply stack. It sees inconsistent GC threshold on disk and in memory.

F230906 18:21:09.717973 99124 kv/kvserver/replica.go:1600  [T1,Vsystem,n1,s1,r65/1:/Table/10{4-5},raft] 1  on-disk and in-memory state diverged: [GCThreshold.WallTime: 1694024462607440629 != 0]

Upd: fixed

pkg/kv/kvserver/batcheval/cmd_delete_range.go

pkg/kv/kvserver/mvcc_gc_queue.go

pav-kv · 2023-09-08T08:55:31Z

@erikgrinaker PTAL. First 2 commits are comments/no-op. The real change is in the third commit.

Could add/extend some tests too. For now I've only manually verified that this fixes the repro in #107931.

pav-kv · 2023-09-08T09:06:55Z

@erikgrinaker For the last commit, I suggest reviewing the proto file first, to get a feel for the algorithm. Then the metadata.go file to see how the GCHint is updated. Then the rest of it: the flow starts in cmd_delete_range.go (the hint is written or updated), goes through cmd_end_transaction.go (splits/merges), then via mvcc_gc_queue.go (where the GC decision is made), and cmd_gc.go where we finally GC and update the hint agian.

pkg/roachpb/metadata.go

erikgrinaker

First pass looks good, mostly nits. Thanks for picking this up!

I need to do another pass and just make sure the merge/split handling and such is good, will revisit. Need unit tests for some of this logic for sure.

What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1? Or are there any other less-risky mitigations we can do for 23.1 that doesn't involve protocol changes, and we'll version gate this change?

pkg/roachpb/metadata.proto

pkg/roachpb/metadata.go

pkg/kv/kvserver/mvcc_gc_queue.go

pav-kv

Addressed some comments, still working on others.

pkg/kv/kvserver/batcheval/cmd_delete_range.go

pkg/roachpb/metadata.proto

pkg/kv/kvpb/api.proto

pkg/roachpb/metadata.proto

pkg/roachpb/metadata.go

pkg/kv/kvserver/mvcc_gc_queue.go

pkg/roachpb/metadata.proto

The GCHint is currently used for 3 purposes. Document the status quo before making changes. Epic: none Release note: none

This is a cosmetic commit. It replaces a lower-level GCHint "reset" method with a method that takes an effective GC threshold. Epic: none Release note: none

pav-kv · 2023-09-12T09:55:37Z

@erikgrinaker

I need to do another pass and just make sure the merge/split handling and such is good, will revisit. Need unit tests for some of this logic for sure.

Added unit tests and addressed most comments (still working on a couple minor ones). Ready for another review.

What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1? Or are there any other less-risky mitigations we can do for 23.1 that doesn't involve protocol changes, and we'll version gate this change?

The new code handles old hints trivially: it respects the "whole range deleted" hint the same way as the old code, and does nothing for the "GC pending" hint because it's missing. The old code handles new hints by respecting the "whole range deleted" part of it, and simply ignoring the other fields.

We can only say that the GC is guaranteed for deletions happened in a cluster under the new version. With mixed-version clusters, a hint may be rewritten and lose "precision".

I think we don't need a version gate, and can backport to 23.1.

Alternative approaches:

Always run GC. E.g. if the last GC run was K * TTL ago, run it unconditionally. Obv., this has cost, especially on clusters with large cold datasets.
Make the schema change job (that waits on GC) actively tell GC which ranges to process rather than do it once, and then sit and hope that it happens. This needs some plumbing. Doesn't seem like a better approach.

erikgrinaker

Nice work! I'm going to do a pass over the merge/split and below-Raft handling of the hint, to make sure this is safe in mixed versions, but I think this is probably good to go.

Thanks again for picking this up.

Release note (bug fix): fixed a bug in GCHint causing unbounded wait for GC in schema change jobs, e.g. on table/index drop operations. After the change, the GCHint is populated and propagated properly, and eventual / timely (order of TTL) garbage collection is guaranteed.

Users don't know what GCHint is or why they should care. Try something like:

Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.

pkg/roachpb/metadata.proto

pkg/roachpb/metadata.go

erikgrinaker · 2023-09-12T11:36:38Z

pkg/kv/kvserver/mvcc_gc_queue.go

+		// now, however we should probably check a stronger condition that the GC
+		// threshold can be advanced to timestamp >= hint.GCTimestamp. The current
+		// way can result in false positives and wasteful GC runs.
+		r.ShouldQueue = canAdvanceGCThreshold


Is it worth increasing the score here to give this range priority over other queued ranges, in the case of a backlog? Maybe not, since other ranges with higher scores have actual garbage we want to get rid of, but if we do we could consider bumping this by 2 or something (i.e. 2x mvccGCKeyScoreThreshold). I think it's probably ok to leave this as-is though.

Leaving as is, because the testing went well. Can revisit the priorities later.

pkg/kv/kvserver/client_merge_test.go

erikgrinaker

Afaict, this should be good to go.

Another case we should test is sending a snapshot from a new node to an old node, and vice versa. Can be done with ALTER RANGE ... RELOCATE. Also, running consistency checks on the range.

pav-kv · 2023-09-12T13:30:42Z

Verified that after all the changes this PR still fixes the repro test from #107931.

Previously, GCHint was used for 3 purposes: 1. Instruct GC to run at specific times. This is used by SQL table/index drop jobs, to limit the amount of time they wait on data GC. 2. Hint GC that a range is fully covered by range tombstones, so that GC can use ClearRange for faster deletes. This is useful for bulk data deletions as well. 3. Hint GC that a range was fully covered by range tombstones at some recent point. GC prioritizes going through such ranges, and processes them together, because this makes Pebble compaction more efficient. The use-case 1 was broken. If a range tombstone does not fully cover the Range keyspace, GCHint is not written. Correspondingly, for small table/index deletions (spanning less than a single range), or deletions that don't perfectly match range bounds, there will be some ranges with no hints. For these ranges, GC might be arbitrarily delayed, and the schema change jobs would be stuck. In addition, GCHint propagation during merges was lossy. It could drop the hint if either LHS or RHS doesn't have a hint, or has some data. This commit extends GCHint to fix the use-case 1. Two fields are added: min and max deletion timestamp. These fields are populated even for partial range deletions, and are propagated during splits and merges unconditionally. The choice of min-max approach is explained in the comments to the corresponding GCHint fields. Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page. Epic: none

Epic: none Release note: none

pav-kv · 2023-09-14T09:17:07Z

What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1?

I did a couple of mixed-version tests manually, with before/after this PR (let's call them vBefore and vAfter). Works as expected, and even better.

For example, one of the tests is a modification of #107931 repro:

Start a vAfter cluster, create a table+index, write some values.
- Observe that there are two ranges.
Remove the index.
- Observe that the schema change job starts waiting for MVCC GC.
- Observe that there are still 2 ranges.
- Observe (via some ad-hoc logging) that new type of GCHint is written for the index range.
Before GC kicks in, restart the cluster with a vBefore binary.
Reset the sticky bit, and trigger merge of the two ranges.
- Observe that the merge happened.
- Observe that the GCHint was dropped/lost during this merge because the old code does not know about the new fields.
- Observe that MVCC GC job still waits, and probably is stuck.
Restart the cluster with a vAfter binary.
- Observe that eventually the schema change job does another MVCC range tombstone write to the index range (I was surprised to see this, in a good way).
- Observe that new GCHint is written.
- Observe that eventually MVCC GC removes the data, and the schema change job succeeds.

Given this experiment, I see that both vBefore and vAfter parse and modify each other's GCHint alright.

I suspect that there is an easier mitigation for stuck jobs which wrote vBefore hints. Best case: when the cluster is restarted on vAfter binaries, the job is restarted last - then all the ranges it talks to will write new hints, and we're good. Possible case: the job is restarted and casts deletions again, but the ranges are still under vBefore. Then we will be stuck.
Mitigation: restart the job. Is that possible via console or CLI?

erikgrinaker · 2023-09-14T09:20:05Z

Mitigation: restart the job. Is that possible via console or CLI?

Yes, it can be paused and resumed:PAUSE JOB x; RESUME JOB x;

The test checks that all the GCHint invariants hold before/after the Merge operation; that the Merge operation is commutative; and that Merge returns true iff it modified the hint. For now, the test only Merges an empty hint with a non-empty hint. Soon adding tests for non-empty + non-empty cases. The test helped to fix one bug in Merge commutativity. Epic: none Release note: none

The test checks that all the GCHint invariants hold before/after the ScheduleGCFor operation, and that it returns true iff it modified the hint. The test helped to fix one bug in the method implementation. Epic: none Release note: none

Epic: none Release note: none

pav-kv · 2023-09-14T11:46:34Z

Another case we should test is sending a snapshot from a new node to an old node, and vice versa. Can be done with ALTER RANGE ... RELOCATE. Also, running consistency checks on the range.

Done. Tested sending a snapshot both ways, and running consistency check afterwards.

Yes, it can be paused and resumed: PAUSE JOB x; RESUME JOB x;

Also checked that resuming a job rewrites GC hints, and fixes a stalled schema change job.

Some details of the test:

roachprod create -n4 local
roachprod stage local:1-3 release v23.1.9
roachprod put local:4 cockroach # master + this PR

cockroach sql --insecure -f create.sql --url=$PGURL  # create table+index
cockroach sql --insecure -f show.sql --url=$PGURL    # see where leaseholder is
# Make sure leaseholder is not on node 4. Move it otherwise.
cockroach sql --insecure -f dropidx.sql --url=$PGURL # drop index

cockroach debug send-kv-batch req.json --insecure --url=$PGURL # remove sticky bit
# Then trigger merge.
# Then write many keys to the table so that GC score is low.

# Now we are in a state where we don't have GC hints, and the
# schema change job is stuck waiting for GC.

# Migrate one replica to a new-version node 4, make it leaseholder.
ALTER RANGE 65 RELOCATE VOTERS FROM 1 TO 4;
ALTER RANGE 65 RELOCATE LEASE TO 4;

# Check that the new version is ok with an old snapshot.
SELECT * FROM crdb_internal.check_consistency(true, '', '') WHERE status != 'RANGE_CONSISTENT';                                                                                                
  range_id | start_key | start_key_pretty | status | detail | duration
-----------+-----------+------------------+--------+--------+-----------
(0 rows)

# Restart the schema change job, so that it writes range tombstones again.
PAUSE JOB <id>;
RESUME JOB <id>;

# The job will write range tombstones again, now via node 4 which
# is on the new version. This will write new GC hints. Verified this
# with ad-hoc logging, seeing a GCHint with a non-zero timestamp in
# the new field.

# Now we're sure GC will respect this hint, so we wait 10 minutes.

# I also moved the voters and leaseholder for this range back to
# node 1, and ran consistency checks successfully again, to make
# sure that the old nodes process a snapshot with the new hint ok.

# Moved the voters and leases back to node 4. The index was GC-ed,
# and the schema change job completed.

erikgrinaker · 2023-09-14T11:53:09Z

Cool, appreciate the additional testing!

pav-kv · 2023-09-14T12:45:30Z

bors r=erikgrinaker

craig · 2023-09-14T14:15:26Z

Build succeeded:

Bazel Essential CI (Cockroach)

112903: streamingccl: use correct bounds for range key SSTs r=msbutler a=stevendanna Previously, we were erroneously calling Next() on the end key of our SST even though the end key already represented an exclusive bounds. This Next() meant that the bounds of our SSTs were slightly incorrect. In most cases, this was harmless, but in the case of a range boundary that aligned directly with the _actual_ bounds of the SST, it could result in us attempting to split an SST into 2 with the RHS ending up empty. Here, we fix that and also make our SST splitting a bit more robust to bad input. It now errors if the LHS or RHS ends up empty despite a split key inside the given bounds. It also handles split keys outside of the bounds. Neither of these things should happen unless our contract with other parts of the system changes. Fixes #112846 Release note (bug fix): Fix a bug that could prevent phyiscal replication from advancing in the face of some range deletion operations. 112948: kvserver: gate writing the new sticky GCHint r=erikgrinaker a=pavelkalinnikov This PR makes the new behaviour of `GCHint` introduced in #110078 conditional on a default-off cluster setting in 23.1, and enabled in 23.2 by a version gate. Since it is late to enable this behaviour in 23.1 (risk of backwards incompatibility), hide it behind a default-off cluster setting. In 23.2, it will be enabled by default, and the cluster setting will be deprecated. The new `GCHint` behaviour is likely backwards compatible, but we are hiding it behind a setting for extra safety. The safest moment to enable this cluster setting is when there is some confidence that the cluster binaries will not rollback to previous patch versions of 23.1. The risk exists only in mixed-version state in which some 23.1 binaries don't know the new `GCHint` fields, and some do. Epic: none Release note (ops change): introduce a default-off cluster setting `kv.gc.sticky_hint.enabled` which helps expediting garbage collection after range deletions, such as when a SQL table or index is dropped. Release note (general change): set the default behaviour for `kv.gc.sticky_hint.enabled` cluster setting to enabled since 23.2. The setting is deprecated in 23.2 going forward. Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>

pav-kv force-pushed the gc-hint branch from 2f03e43 to 380755b Compare September 6, 2023 18:04

pav-kv mentioned this pull request Sep 6, 2023

gc: eager GC of dropped tables/indexes in range subspans #107931

Closed

pav-kv force-pushed the gc-hint branch from 380755b to 1eef2a9 Compare September 6, 2023 19:03

pav-kv mentioned this pull request Sep 6, 2023

kvserver: fix replica State update #110134

Merged

pav-kv force-pushed the gc-hint branch 7 times, most recently from 5de4fbc to 86c2b93 Compare September 8, 2023 08:50

pav-kv changed the title ~~kvserver: fix GCHint propagation~~ kvserver: propagate GCHint for partial deletions Sep 8, 2023

pav-kv commented Sep 8, 2023

View reviewed changes

pkg/kv/kvserver/batcheval/cmd_delete_range.go Outdated Show resolved Hide resolved

pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved

pav-kv requested a review from erikgrinaker September 8, 2023 08:53

pav-kv marked this pull request as ready for review September 8, 2023 08:53

pav-kv requested a review from a team as a code owner September 8, 2023 08:53

pav-kv commented Sep 8, 2023

View reviewed changes

pkg/roachpb/metadata.go Outdated Show resolved Hide resolved

pav-kv commented Sep 8, 2023

View reviewed changes

pkg/roachpb/metadata.go Show resolved Hide resolved

erikgrinaker reviewed Sep 8, 2023

View reviewed changes

pav-kv force-pushed the gc-hint branch from b2ccac2 to bfc4a69 Compare September 11, 2023 11:40

pav-kv commented Sep 11, 2023

View reviewed changes

pav-kv force-pushed the gc-hint branch 3 times, most recently from 06bcc90 to 0159e2b Compare September 12, 2023 09:18

pav-kv added 2 commits September 12, 2023 10:21

kvserver: document the usage of GCHint

b5f5b23

The GCHint is currently used for 3 purposes. Document the status quo before making changes. Epic: none Release note: none

kvserver: add GCHint.UpdateAfterGC method

5199815

This is a cosmetic commit. It replaces a lower-level GCHint "reset" method with a method that takes an effective GC threshold. Epic: none Release note: none

pav-kv force-pushed the gc-hint branch from 0159e2b to 754d894 Compare September 12, 2023 09:22

erikgrinaker reviewed Sep 12, 2023

View reviewed changes

erikgrinaker approved these changes Sep 12, 2023

View reviewed changes

pav-kv force-pushed the gc-hint branch from 754d894 to 174ec04 Compare September 12, 2023 13:02

pav-kv force-pushed the gc-hint branch 2 times, most recently from e1fbd35 to f3ed834 Compare September 12, 2023 14:56

pav-kv added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Sep 12, 2023

pav-kv added 2 commits September 12, 2023 16:17

kvserver: cover more cases in TestStoreMergeGCHint

855afa7

Epic: none Release note: none

pav-kv added 5 commits September 14, 2023 10:48

roachpb: unit tests for GCHint.ScheduleGCFor

247c2c8

The test checks that all the GCHint invariants hold before/after the ScheduleGCFor operation, and that it returns true iff it modified the hint. The test helped to fix one bug in the method implementation. Epic: none Release note: none

roachpb: unit test for GCHint.UpdateAfterGC

5859a70

Epic: none Release note: none

roachpb: test GCHint.ForwardLatestRangeDeleteTimestamp

3734a1a

Epic: none Release note: none

roachpb: more tests for GCHint.Merge

e28e450

Epic: none Release note: none

pav-kv force-pushed the gc-hint branch from f3ed834 to e28e450 Compare September 14, 2023 09:48

craig bot merged commit 4518a99 into cockroachdb:master Sep 14, 2023

blathers-crl bot mentioned this pull request Sep 14, 2023

release-23.1: kvserver: propagate GCHint for partial deletions #110643

Merged

pav-kv deleted the gc-hint branch September 14, 2023 14:18

pav-kv mentioned this pull request Oct 24, 2023

kvserver: gate writing the new sticky GCHint #112948

Merged

blathers-crl bot mentioned this pull request Oct 25, 2023

release-23.2: kvserver: gate writing the new sticky GCHint #113040

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: propagate GCHint for partial deletions #110078

kvserver: propagate GCHint for partial deletions #110078

pav-kv commented Sep 6, 2023 •

edited

Loading

cockroach-teamcity commented Sep 6, 2023

pav-kv commented Sep 6, 2023 •

edited

Loading

pav-kv commented Sep 8, 2023 •

edited

Loading

pav-kv commented Sep 8, 2023

erikgrinaker left a comment

pav-kv left a comment

pav-kv commented Sep 12, 2023 •

edited

Loading

erikgrinaker left a comment

erikgrinaker Sep 12, 2023

pav-kv Sep 14, 2023 •

edited

Loading

erikgrinaker left a comment

pav-kv commented Sep 12, 2023

pav-kv commented Sep 14, 2023 •

edited

Loading

erikgrinaker commented Sep 14, 2023 •

edited

Loading

pav-kv commented Sep 14, 2023 •

edited

Loading

erikgrinaker commented Sep 14, 2023

pav-kv commented Sep 14, 2023

craig bot commented Sep 14, 2023

kvserver: propagate GCHint for partial deletions #110078

kvserver: propagate GCHint for partial deletions #110078

Conversation

pav-kv commented Sep 6, 2023 • edited Loading

cockroach-teamcity commented Sep 6, 2023

pav-kv commented Sep 6, 2023 • edited Loading

pav-kv commented Sep 8, 2023 • edited Loading

pav-kv commented Sep 8, 2023

erikgrinaker left a comment

Choose a reason for hiding this comment

pav-kv left a comment

Choose a reason for hiding this comment

pav-kv commented Sep 12, 2023 • edited Loading

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker Sep 12, 2023

Choose a reason for hiding this comment

pav-kv Sep 14, 2023 • edited Loading

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

pav-kv commented Sep 12, 2023

pav-kv commented Sep 14, 2023 • edited Loading

erikgrinaker commented Sep 14, 2023 • edited Loading

pav-kv commented Sep 14, 2023 • edited Loading

erikgrinaker commented Sep 14, 2023

pav-kv commented Sep 14, 2023

craig bot commented Sep 14, 2023

pav-kv commented Sep 6, 2023 •

edited

Loading

pav-kv commented Sep 6, 2023 •

edited

Loading

pav-kv commented Sep 8, 2023 •

edited

Loading

pav-kv commented Sep 12, 2023 •

edited

Loading

pav-kv Sep 14, 2023 •

edited

Loading

pav-kv commented Sep 14, 2023 •

edited

Loading

erikgrinaker commented Sep 14, 2023 •

edited

Loading

pav-kv commented Sep 14, 2023 •

edited

Loading