Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: propagate GCHint for partial deletions #110078

Merged
merged 9 commits into from
Sep 14, 2023
Merged

Conversation

pav-kv
Copy link
Collaborator

@pav-kv pav-kv commented Sep 6, 2023

Previously, GCHint was used for 3 purposes:

  1. Instruct GC to run at specific times. This is used by SQL table/index drop jobs, to limit the amount of time they wait on data GC.
  2. Hint GC that a range is fully covered by range tombstones, so that GC can use ClearRange for faster deletes. This is useful for bulk data deletions as well.
  3. Hint GC that a range was fully covered by range tombstones at some recent point. GC prioritizes going through such ranges, and processes them together, because this makes Pebble compaction more efficient.

The use-case 1 was broken. If a range tombstone does not fully cover the Range keyspace, GCHint is not written. Correspondingly, for small table/index deletions (spanning less than a single range), or deletions that don't perfectly match range bounds, there will be some ranges with no hints. For these ranges, GC might be arbitrarily delayed, and the schema change jobs would be stuck.

In addition, GCHint propagation during merges was lossy. It could drop the hint if either LHS or RHS doesn't have a hint, or has some data.

This commit extends GCHint to fix the use-case 1. Two fields are added: min and max deletion timestamp. These fields are populated even for partial range deletions, and are propagated during splits and merges unconditionally. The choice of min-max approach is explained in the comments to the corresponding GCHint fields.

Fixes #107931

Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.

Epic: none

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 6, 2023

TestStatsAreDeletedForDroppedTables panics in raft apply stack. It sees inconsistent GC threshold on disk and in memory.

F230906 18:21:09.717973 99124 kv/kvserver/replica.go:1600  [T1,Vsystem,n1,s1,r65/1:/Table/10{4-5},raft] 1  on-disk and in-memory state diverged: [GCThreshold.WallTime: 1694024462607440629 != 0]

Upd: fixed

@pav-kv pav-kv force-pushed the gc-hint branch 7 times, most recently from 5de4fbc to 86c2b93 Compare September 8, 2023 08:50
@pav-kv pav-kv changed the title kvserver: fix GCHint propagation kvserver: propagate GCHint for partial deletions Sep 8, 2023
pkg/kv/kvserver/batcheval/cmd_delete_range.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
@pav-kv pav-kv requested a review from erikgrinaker September 8, 2023 08:53
@pav-kv pav-kv marked this pull request as ready for review September 8, 2023 08:53
@pav-kv pav-kv requested a review from a team as a code owner September 8, 2023 08:53
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 8, 2023

@erikgrinaker PTAL. First 2 commits are comments/no-op. The real change is in the third commit.

Could add/extend some tests too. For now I've only manually verified that this fixes the repro in #107931.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 8, 2023

@erikgrinaker For the last commit, I suggest reviewing the proto file first, to get a feel for the algorithm. Then the metadata.go file to see how the GCHint is updated. Then the rest of it: the flow starts in cmd_delete_range.go (the hint is written or updated), goes through cmd_end_transaction.go (splits/merges), then via mvcc_gc_queue.go (where the GC decision is made), and cmd_gc.go where we finally GC and update the hint agian.

pkg/roachpb/metadata.go Outdated Show resolved Hide resolved
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass looks good, mostly nits. Thanks for picking this up!

I need to do another pass and just make sure the merge/split handling and such is good, will revisit. Need unit tests for some of this logic for sure.

What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1? Or are there any other less-risky mitigations we can do for 23.1 that doesn't involve protocol changes, and we'll version gate this change?

pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
pkg/roachpb/metadata.proto Show resolved Hide resolved
pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
pkg/roachpb/metadata.go Outdated Show resolved Hide resolved
pkg/roachpb/metadata.go Outdated Show resolved Hide resolved
pkg/roachpb/metadata.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
Copy link
Collaborator Author

@pav-kv pav-kv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed some comments, still working on others.

pkg/kv/kvserver/batcheval/cmd_delete_range.go Outdated Show resolved Hide resolved
pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
pkg/kv/kvpb/api.proto Show resolved Hide resolved
pkg/roachpb/metadata.proto Show resolved Hide resolved
pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
pkg/roachpb/metadata.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/mvcc_gc_queue.go Outdated Show resolved Hide resolved
pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
@pav-kv pav-kv force-pushed the gc-hint branch 3 times, most recently from 06bcc90 to 0159e2b Compare September 12, 2023 09:18
The GCHint is currently used for 3 purposes. Document the status quo
before making changes.

Epic: none
Release note: none
This is a cosmetic commit. It replaces a lower-level GCHint "reset"
method with a method that takes an effective GC threshold.

Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 12, 2023

@erikgrinaker

I need to do another pass and just make sure the merge/split handling and such is good, will revisit. Need unit tests for some of this logic for sure.

Added unit tests and addressed most comments (still working on a couple minor ones). Ready for another review.

What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1? Or are there any other less-risky mitigations we can do for 23.1 that doesn't involve protocol changes, and we'll version gate this change?

The new code handles old hints trivially: it respects the "whole range deleted" hint the same way as the old code, and does nothing for the "GC pending" hint because it's missing. The old code handles new hints by respecting the "whole range deleted" part of it, and simply ignoring the other fields.

We can only say that the GC is guaranteed for deletions happened in a cluster under the new version. With mixed-version clusters, a hint may be rewritten and lose "precision".

I think we don't need a version gate, and can backport to 23.1.

Alternative approaches:

  • Always run GC. E.g. if the last GC run was K * TTL ago, run it unconditionally. Obv., this has cost, especially on clusters with large cold datasets.
  • Make the schema change job (that waits on GC) actively tell GC which ranges to process rather than do it once, and then sit and hope that it happens. This needs some plumbing. Doesn't seem like a better approach.

Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! I'm going to do a pass over the merge/split and below-Raft handling of the hint, to make sure this is safe in mixed versions, but I think this is probably good to go.

Thanks again for picking this up.

Release note (bug fix): fixed a bug in GCHint causing unbounded wait for GC in schema change jobs, e.g. on table/index drop operations. After the change, the GCHint is populated and propagated properly, and eventual / timely (order of TTL) garbage collection is guaranteed.

Users don't know what GCHint is or why they should care. Try something like:

Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.

pkg/roachpb/metadata.proto Outdated Show resolved Hide resolved
pkg/roachpb/metadata.go Outdated Show resolved Hide resolved
pkg/roachpb/metadata.go Show resolved Hide resolved
// now, however we should probably check a stronger condition that the GC
// threshold can be advanced to timestamp >= hint.GCTimestamp. The current
// way can result in false positives and wasteful GC runs.
r.ShouldQueue = canAdvanceGCThreshold
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth increasing the score here to give this range priority over other queued ranges, in the case of a backlog? Maybe not, since other ranges with higher scores have actual garbage we want to get rid of, but if we do we could consider bumping this by 2 or something (i.e. 2x mvccGCKeyScoreThreshold). I think it's probably ok to leave this as-is though.

Copy link
Collaborator Author

@pav-kv pav-kv Sep 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leaving as is, because the testing went well. Can revisit the priorities later.

pkg/kv/kvserver/client_merge_test.go Outdated Show resolved Hide resolved
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Afaict, this should be good to go.

Another case we should test is sending a snapshot from a new node to an old node, and vice versa. Can be done with ALTER RANGE ... RELOCATE. Also, running consistency checks on the range.

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 12, 2023

Verified that after all the changes this PR still fixes the repro test from #107931.

@pav-kv pav-kv force-pushed the gc-hint branch 2 times, most recently from e1fbd35 to f3ed834 Compare September 12, 2023 14:56
@pav-kv pav-kv added the backport-23.1.x Flags PRs that need to be backported to 23.1 label Sep 12, 2023
Previously, GCHint was used for 3 purposes:
1. Instruct GC to run at specific times. This is used by SQL table/index
   drop jobs, to limit the amount of time they wait on data GC.
2. Hint GC that a range is fully covered by range tombstones, so that GC
   can use ClearRange for faster deletes. This is useful for bulk data
   deletions as well.
3. Hint GC that a range was fully covered by range tombstones at some
   recent point. GC prioritizes going through such ranges, and processes
   them together, because this makes Pebble compaction more efficient.

The use-case 1 was broken. If a range tombstone does not fully cover the
Range keyspace, GCHint is not written. Correspondingly, for small
table/index deletions (spanning less than a single range), or deletions
that don't perfectly match range bounds, there will be some ranges with
no hints. For these ranges, GC might be arbitrarily delayed, and the
schema change jobs would be stuck.

In addition, GCHint propagation during merges was lossy. It could drop
the hint if either LHS or RHS doesn't have a hint, or has some data.

This commit extends GCHint to fix the use-case 1. Two fields are added:
min and max deletion timestamp. These fields are populated even for
partial range deletions, and are propagated during splits and merges
unconditionally. The choice of min-max approach is explained in the
comments to the corresponding GCHint fields.

Release note (bug fix): Fixed a bug that could occasionally cause schema
change jobs (e.g. table/index drops) to appear stuck in state "waiting
for MVCC GC" for much longer than expected. The fix only applies to
future schema changes -- existing stuck jobs can be processed by
manually force-enqueueing the relevant ranges in the MVCC GC queue under
the DB Console's advanced debug page.

Epic: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 14, 2023

What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1?

I did a couple of mixed-version tests manually, with before/after this PR (let's call them vBefore and vAfter). Works as expected, and even better.

For example, one of the tests is a modification of #107931 repro:

  1. Start a vAfter cluster, create a table+index, write some values.
    • Observe that there are two ranges.
  2. Remove the index.
    • Observe that the schema change job starts waiting for MVCC GC.
    • Observe that there are still 2 ranges.
    • Observe (via some ad-hoc logging) that new type of GCHint is written for the index range.
  3. Before GC kicks in, restart the cluster with a vBefore binary.
  4. Reset the sticky bit, and trigger merge of the two ranges.
    • Observe that the merge happened.
    • Observe that the GCHint was dropped/lost during this merge because the old code does not know about the new fields.
    • Observe that MVCC GC job still waits, and probably is stuck.
  5. Restart the cluster with a vAfter binary.
    • Observe that eventually the schema change job does another MVCC range tombstone write to the index range (I was surprised to see this, in a good way).
    • Observe that new GCHint is written.
    • Observe that eventually MVCC GC removes the data, and the schema change job succeeds.

Given this experiment, I see that both vBefore and vAfter parse and modify each other's GCHint alright.

I suspect that there is an easier mitigation for stuck jobs which wrote vBefore hints. Best case: when the cluster is restarted on vAfter binaries, the job is restarted last - then all the ranges it talks to will write new hints, and we're good. Possible case: the job is restarted and casts deletions again, but the ranges are still under vBefore. Then we will be stuck.
Mitigation: restart the job. Is that possible via console or CLI?

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Sep 14, 2023

Mitigation: restart the job. Is that possible via console or CLI?

Yes, it can be paused and resumed:PAUSE JOB x; RESUME JOB x;

The test checks that all the GCHint invariants hold before/after the
Merge operation; that the Merge operation is commutative; and that Merge
returns true iff it modified the hint.

For now, the test only Merges an empty hint with a non-empty hint. Soon
adding tests for non-empty + non-empty cases.

The test helped to fix one bug in Merge commutativity.

Epic: none
Release note: none
The test checks that all the GCHint invariants hold before/after the
ScheduleGCFor operation, and that it returns true iff it modified the
hint.

The test helped to fix one bug in the method implementation.

Epic: none
Release note: none
Epic: none
Release note: none
Epic: none
Release note: none
@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 14, 2023

Another case we should test is sending a snapshot from a new node to an old node, and vice versa. Can be done with ALTER RANGE ... RELOCATE. Also, running consistency checks on the range.

Done. Tested sending a snapshot both ways, and running consistency check afterwards.

Yes, it can be paused and resumed: PAUSE JOB x; RESUME JOB x;

Also checked that resuming a job rewrites GC hints, and fixes a stalled schema change job.

Some details of the test:

roachprod create -n4 local
roachprod stage local:1-3 release v23.1.9
roachprod put local:4 cockroach # master + this PR

cockroach sql --insecure -f create.sql --url=$PGURL  # create table+index
cockroach sql --insecure -f show.sql --url=$PGURL    # see where leaseholder is
# Make sure leaseholder is not on node 4. Move it otherwise.
cockroach sql --insecure -f dropidx.sql --url=$PGURL # drop index

cockroach debug send-kv-batch req.json --insecure --url=$PGURL # remove sticky bit
# Then trigger merge.
# Then write many keys to the table so that GC score is low.

# Now we are in a state where we don't have GC hints, and the
# schema change job is stuck waiting for GC.

# Migrate one replica to a new-version node 4, make it leaseholder.
ALTER RANGE 65 RELOCATE VOTERS FROM 1 TO 4;
ALTER RANGE 65 RELOCATE LEASE TO 4;

# Check that the new version is ok with an old snapshot.
SELECT * FROM crdb_internal.check_consistency(true, '', '') WHERE status != 'RANGE_CONSISTENT';                                                                                                
  range_id | start_key | start_key_pretty | status | detail | duration
-----------+-----------+------------------+--------+--------+-----------
(0 rows)

# Restart the schema change job, so that it writes range tombstones again.
PAUSE JOB <id>;
RESUME JOB <id>;

# The job will write range tombstones again, now via node 4 which
# is on the new version. This will write new GC hints. Verified this
# with ad-hoc logging, seeing a GCHint with a non-zero timestamp in
# the new field.

# Now we're sure GC will respect this hint, so we wait 10 minutes.

# I also moved the voters and leaseholder for this range back to
# node 1, and ran consistency checks successfully again, to make
# sure that the old nodes process a snapshot with the new hint ok.

# Moved the voters and leases back to node 4. The index was GC-ed,
# and the schema change job completed.

@erikgrinaker
Copy link
Contributor

Cool, appreciate the additional testing!

@pav-kv
Copy link
Collaborator Author

pav-kv commented Sep 14, 2023

bors r=erikgrinaker

@craig
Copy link
Contributor

craig bot commented Sep 14, 2023

Build succeeded:

@craig craig bot merged commit 4518a99 into cockroachdb:master Sep 14, 2023
@pav-kv pav-kv deleted the gc-hint branch September 14, 2023 14:18
craig bot pushed a commit that referenced this pull request Oct 25, 2023
112903: streamingccl: use correct bounds for range key SSTs r=msbutler a=stevendanna

Previously, we were erroneously calling Next() on the end key of our SST even though the end key already represented an exclusive bounds.

This Next() meant that the bounds of our SSTs were slightly incorrect. In most cases, this was harmless, but in the case of a range boundary that aligned directly with the _actual_ bounds of the SST, it could result in us attempting to split an SST into 2 with the RHS ending up empty.

Here, we fix that and also make our SST splitting a bit more robust to bad input. It now errors if the LHS or RHS ends up empty despite a split key inside the given bounds. It also handles split keys outside of the bounds. Neither of these things should happen unless our contract with other parts of the system changes.

Fixes #112846

Release note (bug fix): Fix a bug that could prevent phyiscal replication from advancing in the face of some range deletion operations.

112948: kvserver: gate writing the new sticky GCHint r=erikgrinaker a=pavelkalinnikov

This PR makes the new behaviour of `GCHint` introduced in #110078 conditional on a default-off cluster setting in 23.1, and enabled in 23.2 by a version gate.

Since it is late to enable this behaviour in 23.1 (risk of backwards incompatibility), hide it behind a default-off cluster setting. In 23.2, it will be enabled by default, and the cluster setting will be deprecated.

The new `GCHint` behaviour is likely backwards compatible, but we are hiding it behind a setting for extra safety.

The safest moment to enable this cluster setting is when there is some confidence that the cluster binaries will not rollback to previous patch versions of 23.1. The risk exists only in mixed-version state in which some 23.1 binaries don't know the new `GCHint` fields, and some do.

Epic: none

Release note (ops change): introduce a default-off cluster setting `kv.gc.sticky_hint.enabled` which helps expediting garbage collection after range deletions, such as when a SQL table or index is dropped.

Release note (general change): set the default behaviour for `kv.gc.sticky_hint.enabled` cluster setting to enabled since 23.2. The setting is deprecated in 23.2 going forward.

Co-authored-by: Steven Danna <[email protected]>
Co-authored-by: Pavel Kalinnikov <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-23.1.x Flags PRs that need to be backported to 23.1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

gc: eager GC of dropped tables/indexes in range subspans
3 participants