-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: propagate GCHint for partial deletions #110078
Conversation
Upd: fixed |
5de4fbc
to
86c2b93
Compare
@erikgrinaker PTAL. First 2 commits are comments/no-op. The real change is in the third commit. Could add/extend some tests too. For now I've only manually verified that this fixes the repro in #107931. |
@erikgrinaker For the last commit, I suggest reviewing the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass looks good, mostly nits. Thanks for picking this up!
I need to do another pass and just make sure the merge/split handling and such is good, will revisit. Need unit tests for some of this logic for sure.
What's your sense on backwards compatibility and backporting here? Do we want to make sure this works in mixed-version states and backport it to 23.1? Or are there any other less-risky mitigations we can do for 23.1 that doesn't involve protocol changes, and we'll version gate this change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed some comments, still working on others.
06bcc90
to
0159e2b
Compare
The GCHint is currently used for 3 purposes. Document the status quo before making changes. Epic: none Release note: none
This is a cosmetic commit. It replaces a lower-level GCHint "reset" method with a method that takes an effective GC threshold. Epic: none Release note: none
Added unit tests and addressed most comments (still working on a couple minor ones). Ready for another review.
The new code handles old hints trivially: it respects the "whole range deleted" hint the same way as the old code, and does nothing for the "GC pending" hint because it's missing. The old code handles new hints by respecting the "whole range deleted" part of it, and simply ignoring the other fields. We can only say that the GC is guaranteed for deletions happened in a cluster under the new version. With mixed-version clusters, a hint may be rewritten and lose "precision". I think we don't need a version gate, and can backport to 23.1. Alternative approaches:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! I'm going to do a pass over the merge/split and below-Raft handling of the hint, to make sure this is safe in mixed versions, but I think this is probably good to go.
Thanks again for picking this up.
Release note (bug fix): fixed a bug in GCHint causing unbounded wait for GC in schema change jobs, e.g. on table/index drop operations. After the change, the GCHint is populated and propagated properly, and eventual / timely (order of TTL) garbage collection is guaranteed.
Users don't know what GCHint is or why they should care. Try something like:
Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.
// now, however we should probably check a stronger condition that the GC | ||
// threshold can be advanced to timestamp >= hint.GCTimestamp. The current | ||
// way can result in false positives and wasteful GC runs. | ||
r.ShouldQueue = canAdvanceGCThreshold |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth increasing the score here to give this range priority over other queued ranges, in the case of a backlog? Maybe not, since other ranges with higher scores have actual garbage we want to get rid of, but if we do we could consider bumping this by 2 or something (i.e. 2x mvccGCKeyScoreThreshold
). I think it's probably ok to leave this as-is though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving as is, because the testing went well. Can revisit the priorities later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Afaict, this should be good to go.
Another case we should test is sending a snapshot from a new node to an old node, and vice versa. Can be done with ALTER RANGE ... RELOCATE
. Also, running consistency checks on the range.
Verified that after all the changes this PR still fixes the repro test from #107931. |
e1fbd35
to
f3ed834
Compare
Previously, GCHint was used for 3 purposes: 1. Instruct GC to run at specific times. This is used by SQL table/index drop jobs, to limit the amount of time they wait on data GC. 2. Hint GC that a range is fully covered by range tombstones, so that GC can use ClearRange for faster deletes. This is useful for bulk data deletions as well. 3. Hint GC that a range was fully covered by range tombstones at some recent point. GC prioritizes going through such ranges, and processes them together, because this makes Pebble compaction more efficient. The use-case 1 was broken. If a range tombstone does not fully cover the Range keyspace, GCHint is not written. Correspondingly, for small table/index deletions (spanning less than a single range), or deletions that don't perfectly match range bounds, there will be some ranges with no hints. For these ranges, GC might be arbitrarily delayed, and the schema change jobs would be stuck. In addition, GCHint propagation during merges was lossy. It could drop the hint if either LHS or RHS doesn't have a hint, or has some data. This commit extends GCHint to fix the use-case 1. Two fields are added: min and max deletion timestamp. These fields are populated even for partial range deletions, and are propagated during splits and merges unconditionally. The choice of min-max approach is explained in the comments to the corresponding GCHint fields. Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page. Epic: none
Epic: none Release note: none
I did a couple of mixed-version tests manually, with before/after this PR (let's call them For example, one of the tests is a modification of #107931 repro:
Given this experiment, I see that both I suspect that there is an easier mitigation for stuck jobs which wrote |
Yes, it can be paused and resumed: |
The test checks that all the GCHint invariants hold before/after the Merge operation; that the Merge operation is commutative; and that Merge returns true iff it modified the hint. For now, the test only Merges an empty hint with a non-empty hint. Soon adding tests for non-empty + non-empty cases. The test helped to fix one bug in Merge commutativity. Epic: none Release note: none
The test checks that all the GCHint invariants hold before/after the ScheduleGCFor operation, and that it returns true iff it modified the hint. The test helped to fix one bug in the method implementation. Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
Epic: none Release note: none
Done. Tested sending a snapshot both ways, and running consistency check afterwards.
Also checked that resuming a job rewrites GC hints, and fixes a stalled schema change job. Some details of the test: roachprod create -n4 local
roachprod stage local:1-3 release v23.1.9
roachprod put local:4 cockroach # master + this PR
cockroach sql --insecure -f create.sql --url=$PGURL # create table+index
cockroach sql --insecure -f show.sql --url=$PGURL # see where leaseholder is
# Make sure leaseholder is not on node 4. Move it otherwise.
cockroach sql --insecure -f dropidx.sql --url=$PGURL # drop index
cockroach debug send-kv-batch req.json --insecure --url=$PGURL # remove sticky bit
# Then trigger merge.
# Then write many keys to the table so that GC score is low.
# Now we are in a state where we don't have GC hints, and the
# schema change job is stuck waiting for GC.
# Migrate one replica to a new-version node 4, make it leaseholder.
ALTER RANGE 65 RELOCATE VOTERS FROM 1 TO 4;
ALTER RANGE 65 RELOCATE LEASE TO 4;
# Check that the new version is ok with an old snapshot.
SELECT * FROM crdb_internal.check_consistency(true, '', '') WHERE status != 'RANGE_CONSISTENT';
range_id | start_key | start_key_pretty | status | detail | duration
-----------+-----------+------------------+--------+--------+-----------
(0 rows)
# Restart the schema change job, so that it writes range tombstones again.
PAUSE JOB <id>;
RESUME JOB <id>;
# The job will write range tombstones again, now via node 4 which
# is on the new version. This will write new GC hints. Verified this
# with ad-hoc logging, seeing a GCHint with a non-zero timestamp in
# the new field.
# Now we're sure GC will respect this hint, so we wait 10 minutes.
# I also moved the voters and leaseholder for this range back to
# node 1, and ran consistency checks successfully again, to make
# sure that the old nodes process a snapshot with the new hint ok.
# Moved the voters and leases back to node 4. The index was GC-ed,
# and the schema change job completed.
|
Cool, appreciate the additional testing! |
bors r=erikgrinaker |
Build succeeded: |
112903: streamingccl: use correct bounds for range key SSTs r=msbutler a=stevendanna Previously, we were erroneously calling Next() on the end key of our SST even though the end key already represented an exclusive bounds. This Next() meant that the bounds of our SSTs were slightly incorrect. In most cases, this was harmless, but in the case of a range boundary that aligned directly with the _actual_ bounds of the SST, it could result in us attempting to split an SST into 2 with the RHS ending up empty. Here, we fix that and also make our SST splitting a bit more robust to bad input. It now errors if the LHS or RHS ends up empty despite a split key inside the given bounds. It also handles split keys outside of the bounds. Neither of these things should happen unless our contract with other parts of the system changes. Fixes #112846 Release note (bug fix): Fix a bug that could prevent phyiscal replication from advancing in the face of some range deletion operations. 112948: kvserver: gate writing the new sticky GCHint r=erikgrinaker a=pavelkalinnikov This PR makes the new behaviour of `GCHint` introduced in #110078 conditional on a default-off cluster setting in 23.1, and enabled in 23.2 by a version gate. Since it is late to enable this behaviour in 23.1 (risk of backwards incompatibility), hide it behind a default-off cluster setting. In 23.2, it will be enabled by default, and the cluster setting will be deprecated. The new `GCHint` behaviour is likely backwards compatible, but we are hiding it behind a setting for extra safety. The safest moment to enable this cluster setting is when there is some confidence that the cluster binaries will not rollback to previous patch versions of 23.1. The risk exists only in mixed-version state in which some 23.1 binaries don't know the new `GCHint` fields, and some do. Epic: none Release note (ops change): introduce a default-off cluster setting `kv.gc.sticky_hint.enabled` which helps expediting garbage collection after range deletions, such as when a SQL table or index is dropped. Release note (general change): set the default behaviour for `kv.gc.sticky_hint.enabled` cluster setting to enabled since 23.2. The setting is deprecated in 23.2 going forward. Co-authored-by: Steven Danna <[email protected]> Co-authored-by: Pavel Kalinnikov <[email protected]>
Previously,
GCHint
was used for 3 purposes:ClearRange
for faster deletes. This is useful for bulk data deletions as well.The use-case 1 was broken. If a range tombstone does not fully cover the
Range
keyspace,GCHint
is not written. Correspondingly, for small table/index deletions (spanning less than a single range), or deletions that don't perfectly match range bounds, there will be some ranges with no hints. For these ranges, GC might be arbitrarily delayed, and the schema change jobs would be stuck.In addition,
GCHint
propagation during merges was lossy. It could drop the hint if either LHS or RHS doesn't have a hint, or has some data.This commit extends
GCHint
to fix the use-case 1. Two fields are added:min
andmax
deletion timestamp. These fields are populated even for partial range deletions, and are propagated during splits and merges unconditionally. The choice of min-max approach is explained in the comments to the correspondingGCHint
fields.Fixes #107931
Release note (bug fix): Fixed a bug that could occasionally cause schema change jobs (e.g. table/index drops) to appear stuck in state "waiting for MVCC GC" for much longer than expected. The fix only applies to future schema changes -- existing stuck jobs can be processed by manually force-enqueueing the relevant ranges in the MVCC GC queue under the DB Console's advanced debug page.
Epic: none