db: support range keys in delete-only compactions #1656

nicktrav · 2022-04-18T16:10:08Z

Two commits in this patch.

TODOs:

Avoid counting point key space on range key del only hints
Squash additional commits

db: add tableRangedDeletionIter for iteration of ranged deletion spans

Add a new keyspan.FragmentIterator implementation that iterates over
the range deletion and range key blocks of a single table. The iterator
returns merged spans from the table corresponding to "ranged deletions".
A "ranged deletion" span contains keys that fall into one of three
cases: the span contains only range deletions; the span contains only
range key deletions; the span contains a mixture of both range deletions
and range key deletions.

The new tableRangedDeletionIter will be used to construct spans
necessary for calculating deletion hints on a table that takes range
keys into account. Previously, only range deletion spans were
considered.

For a tables that do not contain range key deletes, the iterator returns
the same set of spans that the existing logic would have returned.

db: consider range keys in delete-only compactions

Currently, when constructing delete-only compactions hints, only range
delete spans are considered. With the introduction of range keys, the
current hint construction logic is incorrect. Specifically, a hint
constructed from a range delete may completely cover a table containing
range keys. Currently, this table would be marked as eligible for
deletion. However, the table should only be deleted if the range keys
are also deleted.

Make use of the tableRangedDeletionIter for construction of
delete-only compaction hints. A new deleteCompactionHintType type is
used to distinguish between hints generated by a range delete (which may
not delete tables containing range keys), hints generated from a range
key delete (which may not delete tables containing point keys), and
hints generated from a span with both a range delete and range key
delete (which may delete any covering table).

cockroach-teamcity · 2022-04-18T16:10:16Z

This change is

nicktrav · 2022-04-18T16:20:02Z

Best to review this one in commit order, as the second commit makes use changes made in the first.

nicktrav

Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)

table_stats.go, line 596 at r2 (raw file):

	}
	if iter != nil {
		// FIXME(travers): Do we also need the truncation here too?

Note to self: truncation isn't required here, as range keys will not be split across tables.

jbowens

Flushing comments on the first commit

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @nicktrav)

-- commits, line 17 at r1:
nit: s/a tables/tables/

table_stats.go, line 535 at r1 (raw file):

// tableRangedDeletionIter is a keyspan.FragmentIterator that returns "ranged
// deletion" spans for a single table, providing a combined view of both range
// deletion and range key deletion spans. The tableRAngedDeletionIter is

nit: tableRangedDeletionIter

table_stats.go, line 642 at r1 (raw file):

	// largest and smallest keys in the defragmented span. This maintains the
	// contract that the emitted slice is sorted by (SeqNum, Kind) descending.
	reducer := func(current, incoming []keyspan.Key) []keyspan.Key {

I think there's a complication here where the reducer removes knowledge that the equality function needs. I'm imagining a scenario like:

a-b:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
c-d:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
d-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

during forward iteration, if I understand it correctly, this iterator would yield

a-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

table_stats.go, line 674 at r1 (raw file):

	delIter := &tableRangedDeletionIter{
		FragmentIterator: dIter,
		closers:          closers,

is the closers slice necessary? wondering if we can let DefragmentIter.Close call MergingIter.Close which should fan out and close all the leaf iterators.

table_stats_test.go, line 168 at r1 (raw file):

					err = w.RangeKeyUnset(start, end, []byte(parts[3]))
				case "range-key-del":
					err = w.RangeKeyDelete(start, end)

how about taking the keyspan.ParseSpan format in this test, which allows customizing the sequence numbers? We'd have have to use the (*sstable.Writer).AddRangeKey method. We care quite a bit about this iterator surfacing the correct sequence numbers, since we might delete files with live keys if it doesn't.

nicktrav

Reviewable status: 0 of 7 files reviewed, 3 unresolved discussions (waiting on @itsbilal and @jbowens)

-- commits, line 17 at r1:

Previously, jbowens (Jackson Owens) wrote…

nit: s/a tables/tables/

Done.

table_stats.go, line 535 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: tableRangedDeletionIter

Done.

table_stats.go, line 642 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think there's a complication here where the reducer removes knowledge that the equality function needs. I'm imagining a scenario like:
a-b:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
c-d:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
d-e:{(#10,RANGEDEL) (#3,RANGEDEL)}
during forward iteration, if I understand it correctly, this iterator would yield
a-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

Good catch. I pushed a separate commit (which I will squash into the first before merge, provided we're happy with it).

We're now preserving just the first and last key for each emitted span, and modifying the key kinds to act as a "marker" in the case where there are mixed key kinds (i.e. range dels and range key dels). I believe this is fine as the key kinds are only used to infer the type of span and thus hint type (see the second commit).

Open to alternative approaches on encoding this range del only / range key only / mixed key span information.

table_stats.go, line 674 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is the closers slice necessary? wondering if we can let DefragmentIter.Close call MergingIter.Close which should fan out and close all the leaf iterators.

The problem I was running up against was that keyspan.Truncate returns a new *keyspan.Iter which has a no-op Close method. If we just close the merging iter, the original range del iter isn't closed, and we panic due to the leak.

We seem to do a similar thing here (adding to a closers slice before we lose the reference to Truncate). It seems in this case we don't want to propagate the close, as we want to reuse the underlying iter.

table_stats_test.go, line 168 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

how about taking the keyspan.ParseSpan format in this test, which allows customizing the sequence numbers? We'd have have to use the (*sstable.Writer).AddRangeKey method. We care quite a bit about this iterator surfacing the correct sequence numbers, since we might delete files with live keys if it doesn't.

Good call. Updated.

jbowens

Reviewed 1 of 4 files at r2, 1 of 4 files at r3, all commit messages.
Reviewable status: 2 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @nicktrav)

table_stats.go, line 642 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Good catch. I pushed a separate commit (which I will squash into the first before merge, provided we're happy with it).

We're now preserving just the first and last key for each emitted span, and modifying the key kinds to act as a "marker" in the case where there are mixed key kinds (i.e. range dels and range key dels). I believe this is fine as the key kinds are only used to infer the type of span and thus hint type (see the second commit).

Open to alternative approaches on encoding this range del only / range key only / mixed key span information.

Yeah, I think this mutation of the key kinds is fine for correctness, but I don't love the subtlety.

Throwing out another idea:

Wrap the range deletion and range key deletion iterators separately in a DefragmentingIter, using reduce to only keep the first and last Keys of each. The range key DefragmentIter would need to ignore and drop RangeKeySets, RangeKeyUnsets. I think these could still share code if configured with a key kind to 'keep'.
Wrap the resulting two iterators in a merging iterator.

The result would be just as defragmented, because the only fragmentation points are at defragmented start/end bounds of each of the kind kind. The resulting span would have at most 4 keys, and we can scan the keys list to check for the existence of multiple kinds.

table_stats.go, line 674 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

The problem I was running up against was that keyspan.Truncate returns a new *keyspan.Iter which has a no-op Close method. If we just close the merging iter, the original range del iter isn't closed, and we panic due to the leak.

We seem to do a similar thing here (adding to a closers slice before we lose the reference to Truncate). It seems in this case we don't want to propagate the close, as we want to reuse the underlying iter.

Ah, yeah, that makes sense. Looking forward to when we can drop that keyspan.Truncate.

table_stats.go, line 335 at r3 (raw file):

		}

		estimate, hintSeqNum, err := d.estimateSizeBeneath(v, level, meta, start, end)

there's some awkwardness with estimateSizeBeneath and the EstimateDiskUsage call above. These functions estimate without knowledge of range keys. They include a file's entire size if the file's bounds fall within range of [start, end). Otherwise they use the point index blocks to calculate an estimate at block-level granularity.

I don't think we can ignore a little loss of precision from excluding range keys blocks when calculating estimates, since range keys are expected to be rare, but we should avoid including a large swath of the point keys if the span only contains RANGEKEYDELs

testdata/compaction_delete_only_hints, line 307 at r3 (raw file):

#   +---+           +---+             +---+
# __________________________________________________________
#   a b c d e f g h i j k l m n o p q r s t u v w x y z

Love this test.

nicktrav

TFTRs so far, @jbowens. Thoughts on the approach second commit? Regardless, I'm going to hold off on this until we have a path forward around #1663.

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @jbowens)

table_stats.go, line 642 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I think this mutation of the key kinds is fine for correctness, but I don't love the subtlety.

Throwing out another idea:

Wrap the range deletion and range key deletion iterators separately in a DefragmentingIter, using reduce to only keep the first and last Keys of each. The range key DefragmentIter would need to ignore and drop RangeKeySets, RangeKeyUnsets. I think these could still share code if configured with a key kind to 'keep'.

Wrap the resulting two iterators in a merging iterator.

The result would be just as defragmented, because the only fragmentation points are at defragmented start/end bounds of each of the kind kind. The resulting span would have at most 4 keys, and we can scan the keys list to check for the existence of multiple kinds.

Latest commit implements this (which I shall also squash), building on the new keyspan.Filter. Thanks for the suggestion!

There's one wrinkle that came up around key stability with the merging iter that's causing some metamorphic test failures. I opened #1663, and will wait for the fix for that to go in. This wasn't an issue previously as the merging iter was at the bottom of the stack, and its inputs were stable.

table_stats.go, line 335 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

there's some awkwardness with estimateSizeBeneath and the EstimateDiskUsage call above. These functions estimate without knowledge of range keys. They include a file's entire size if the file's bounds fall within range of [start, end). Otherwise they use the point index blocks to calculate an estimate at block-level granularity.

I don't think we can ignore a little loss of precision from excluding range keys blocks when calculating estimates, since range keys are expected to be rare, but we should avoid including a large swath of the point keys if the span only contains RANGEKEYDELs

Good point. We have a TODO in the laundry list to update the size estimates. I'll look at that next. Depending on how it shakes out, I might put up a separate patch.

Add a new `keyspan.FragmentIterator` implementation that iterates over the range deletion and range key blocks of a single table. The iterator returns merged spans from the table corresponding to "ranged deletions". A "ranged deletion" span contains keys that fall into one of three cases: the span contains only range deletions; the span contains only range key deletions; the span contains a mixture of both range deletions and range key deletions. The new `tableRangedDeletionIter` will be used to construct spans necessary for calculating deletion hints on a table that takes range keys into account. Previously, only range deletion spans were considered. For a tables that do not contain range key deletes, the iterator returns the same set of spans that the existing logic would have returned.

Currently, when constructing delete-only compactions hints, only range delete spans are considered. With the introduction of range keys, the current hint construction logic is incorrect. Specifically, a hint constructed from a range delete may completely cover a table containing range keys. Currently, this table would be marked as eligible for deletion. However, the table should only be deleted if the range keys are also deleted. Make use of the `tableRangedDeletionIter` for construction of delete-only compaction hints. A new `deleteCompactionHintType` type is used to distinguish between hints generated by a range delete (which may not delete tables containing range keys), hints generated from a range key delete (which may not delete tables containing point keys), and hints generated from a span with both a range delete and range key delete (which may delete any covering table).

Fix the `tableRangedDeletionIter` implementation to preserve the spans "type" (i.e. range del keys only, range key del keys only, or mixed keys). The current implementation had a bug where a span with mixed key kinds could have its type switched when merged with an abutting span.

Defragment the range del and range key block separately. The `keyspan.Filter` is used to elide non-range key dels from the range key block. The two separate, defragmented spans are then merged wit the merging iter.

nicktrav · 2022-04-28T01:56:57Z

Pushed a rebased version that picks up b9e970a.

Still need to incorporate a tweak to the disk estimation to ensure we're not counting point key size under range key del deletion hints.

jbowens

Reviewed 2 of 4 files at r5, 2 of 4 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: 6 of 7 files reviewed, 6 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)

compaction.go line 1683 at r7 (raw file):

// deleteCompactionHintType indicates whether the deleteCompactionHint was
// generated from a span containing a range del (point key only), a range delete

nit: "a range key delete (range key only)"

We use the language "range delete" to refer to RANGEDELs in many places. I wish we had unambiguous names to differentiate these, but hard problems in computer science, amirite?

compaction.go line 1717 at r7 (raw file):

		switch k.Kind() {
		case base.InternalKeyKindRangeDelete:
			hintType |= 1 << 0

could this be hintType |= deleteCompactionHintTypePointKeyOnly and down below hintType |= deleteCompactionHintTypeRangeKeyOnly to make the relationship with the enum a little clearer?

compaction.go line 1721 at r7 (raw file):

			hintType |= 1 << 1
		default:
			panic(fmt.Sprintf("unsupported hint hind: %s", k.Kind()))

nit: s/hint hind/key kind/

table_stats.go line 583 at r4 (raw file):

		// Truncate tombstones to the containing file's bounds if necessary.
		// See docs/range_deletions.md for why this is necessary.
		closers = append(closers, iter)

With your change to Truncate, I think we could drop the closers slice now.

internal/keyspan/span.go line 73 at r7 (raw file):

func (k Key) Equal(b Key) bool {
	return k.Trailer == b.Trailer &&
		bytes.Equal(k.Suffix, b.Suffix) &&

bytes.Equal is okay for comparing values, but suffixes need to be compared with a user-provided base.Compare or base.Equal. At the moment, I believe bytes.Equal works for CockroachDB's use, but with cockroachdb/cockroach#77342 equal suffixes are not necessarily byte equal.

nicktrav

Reviewable status: 3 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @jbowens)

compaction.go line 1683 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: "a range key delete (range key only)"

We use the language "range delete" to refer to RANGEDELs in many places. I wish we had unambiguous names to differentiate these, but hard problems in computer science, amirite?

Yeah - bit of a mouthful. Fixed!

compaction.go line 1717 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

could this be hintType |= deleteCompactionHintTypePointKeyOnly and down below hintType |= deleteCompactionHintTypeRangeKeyOnly to make the relationship with the enum a little clearer?

Good idea. Done.

table_stats.go line 583 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

With your change to Truncate, I think we could drop the closers slice now.

Yeah - that disappeared in a20684e. Reviewable is probably not doing it justice.

internal/keyspan/span.go line 73 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

bytes.Equal is okay for comparing values, but suffixes need to be compared with a user-provided base.Compare or base.Equal. At the moment, I believe bytes.Equal works for CockroachDB's use, but with cockroachdb/cockroach#77342 equal suffixes are not necessarily byte equal.

Ah, thanks. Looks like we have a base.Compare wired up, so I'm plumbing that through now for the suffix comparisons.

In converting an existing function to a loop, a loop iteration could early `return`, which would skip subsequent hints. Use `continue` instead, to consider the remaining hints. Add a regression test. This commit will be squashed into the first commit.

Alter the existing test case to show that range del hints only take point key space into account, and range key del hints do not take point key space into account. This will be squashed into the second commit.

nicktrav · 2022-06-09T22:40:54Z

I'm going to close this out in favor of a new PR with all the squashing done and a rebase to pick up the latest changes from master.

With all the changes that have happened in the last month (e.g. Transformer and DefragmentMethod interface changes), it's probably worth reviewing this with fresh context.

Happy to re-open this one if that ends up being simpler.

nicktrav requested review from itsbilal and jbowens April 18, 2022 16:10

nicktrav commented Apr 18, 2022

View reviewed changes

jbowens reviewed Apr 18, 2022

View reviewed changes

nicktrav commented Apr 19, 2022

View reviewed changes

jbowens reviewed Apr 19, 2022

View reviewed changes

nicktrav force-pushed the nickt.deletion-only branch from 1228395 to 061b700 Compare April 20, 2022 21:47

nicktrav commented Apr 20, 2022

View reviewed changes

nicktrav added 4 commits April 27, 2022 18:49

db: defragment first, then merge

a20684e

Defragment the range del and range key block separately. The `keyspan.Filter` is used to elide non-range key dels from the range key block. The two separate, defragmented spans are then merged wit the merging iter.

nicktrav force-pushed the nickt.deletion-only branch from 061b700 to a20684e Compare April 28, 2022 01:54

jbowens reviewed Apr 28, 2022

View reviewed changes

nicktrav added 2 commits April 28, 2022 10:51

db: use user-provided comparator for suffix equality

8862140

db: use existing consts for bitwise operations

61a3bab

nicktrav commented Apr 28, 2022

View reviewed changes

nicktrav added 2 commits June 9, 2022 14:26

*: add test case rangedel / rangekeydel table stats interaction

489fbb5

Alter the existing test case to show that range del hints only take point key space into account, and range key del hints do not take point key space into account. This will be squashed into the second commit.

nicktrav closed this Jun 9, 2022

nicktrav deleted the nickt.deletion-only branch June 9, 2022 22:52

nicktrav mentioned this pull request Jun 9, 2022

db: support range keys in delete-only compactions #1750

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

db: support range keys in delete-only compactions #1656

db: support range keys in delete-only compactions #1656

nicktrav commented Apr 18, 2022 •

edited

Loading

cockroach-teamcity commented Apr 18, 2022

nicktrav commented Apr 18, 2022

nicktrav left a comment

jbowens left a comment

nicktrav left a comment

jbowens left a comment

nicktrav left a comment

nicktrav commented Apr 28, 2022

jbowens left a comment

nicktrav left a comment

nicktrav commented Jun 9, 2022 •

edited

Loading

db: support range keys in delete-only compactions #1656

db: support range keys in delete-only compactions #1656

Conversation

nicktrav commented Apr 18, 2022 • edited Loading

cockroach-teamcity commented Apr 18, 2022

nicktrav commented Apr 18, 2022

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

nicktrav commented Apr 28, 2022

jbowens left a comment

Choose a reason for hiding this comment

nicktrav left a comment

Choose a reason for hiding this comment

nicktrav commented Jun 9, 2022 • edited Loading

nicktrav commented Apr 18, 2022 •

edited

Loading

nicktrav commented Jun 9, 2022 •

edited

Loading