Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

db: support range keys in delete-only compactions #1656

Closed
wants to merge 8 commits into from

Conversation

nicktrav
Copy link
Contributor

@nicktrav nicktrav commented Apr 18, 2022

Two commits in this patch.

TODOs:

  • Avoid counting point key space on range key del only hints
  • Squash additional commits

db: add tableRangedDeletionIter for iteration of ranged deletion spans

Add a new keyspan.FragmentIterator implementation that iterates over
the range deletion and range key blocks of a single table. The iterator
returns merged spans from the table corresponding to "ranged deletions".
A "ranged deletion" span contains keys that fall into one of three
cases: the span contains only range deletions; the span contains only
range key deletions; the span contains a mixture of both range deletions
and range key deletions.

The new tableRangedDeletionIter will be used to construct spans
necessary for calculating deletion hints on a table that takes range
keys into account. Previously, only range deletion spans were
considered.

For a tables that do not contain range key deletes, the iterator returns
the same set of spans that the existing logic would have returned.

db: consider range keys in delete-only compactions

Currently, when constructing delete-only compactions hints, only range
delete spans are considered. With the introduction of range keys, the
current hint construction logic is incorrect. Specifically, a hint
constructed from a range delete may completely cover a table containing
range keys. Currently, this table would be marked as eligible for
deletion. However, the table should only be deleted if the range keys
are also deleted.

Make use of the tableRangedDeletionIter for construction of
delete-only compaction hints. A new deleteCompactionHintType type is
used to distinguish between hints generated by a range delete (which may
not delete tables containing range keys), hints generated from a range
key delete (which may not delete tables containing point keys), and
hints generated from a span with both a range delete and range key
delete (which may delete any covering table).

@nicktrav nicktrav requested review from itsbilal and jbowens April 18, 2022 16:10
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nicktrav
Copy link
Contributor Author

Best to review this one in commit order, as the second commit makes use changes made in the first.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 6 files reviewed, 1 unresolved discussion (waiting on @itsbilal and @jbowens)


table_stats.go, line 596 at r2 (raw file):

	}
	if iter != nil {
		// FIXME(travers): Do we also need the truncation here too?

Note to self: truncation isn't required here, as range keys will not be split across tables.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing comments on the first commit

Reviewable status: 0 of 6 files reviewed, 6 unresolved discussions (waiting on @itsbilal and @nicktrav)


-- commits, line 17 at r1:
nit: s/a tables/tables/


table_stats.go, line 535 at r1 (raw file):

// tableRangedDeletionIter is a keyspan.FragmentIterator that returns "ranged
// deletion" spans for a single table, providing a combined view of both range
// deletion and range key deletion spans. The tableRAngedDeletionIter is

nit: tableRangedDeletionIter


table_stats.go, line 642 at r1 (raw file):

	// largest and smallest keys in the defragmented span. This maintains the
	// contract that the emitted slice is sorted by (SeqNum, Kind) descending.
	reducer := func(current, incoming []keyspan.Key) []keyspan.Key {

I think there's a complication here where the reducer removes knowledge that the equality function needs. I'm imagining a scenario like:

a-b:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
c-d:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
d-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

during forward iteration, if I understand it correctly, this iterator would yield

a-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

table_stats.go, line 674 at r1 (raw file):

	delIter := &tableRangedDeletionIter{
		FragmentIterator: dIter,
		closers:          closers,

is the closers slice necessary? wondering if we can let DefragmentIter.Close call MergingIter.Close which should fan out and close all the leaf iterators.


table_stats_test.go, line 168 at r1 (raw file):

					err = w.RangeKeyUnset(start, end, []byte(parts[3]))
				case "range-key-del":
					err = w.RangeKeyDelete(start, end)

how about taking the keyspan.ParseSpan format in this test, which allows customizing the sequence numbers? We'd have have to use the (*sstable.Writer).AddRangeKey method. We care quite a bit about this iterator surfacing the correct sequence numbers, since we might delete files with live keys if it doesn't.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 0 of 7 files reviewed, 3 unresolved discussions (waiting on @itsbilal and @jbowens)


-- commits, line 17 at r1:

Previously, jbowens (Jackson Owens) wrote…

nit: s/a tables/tables/

Done.


table_stats.go, line 535 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: tableRangedDeletionIter

Done.


table_stats.go, line 642 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I think there's a complication here where the reducer removes knowledge that the equality function needs. I'm imagining a scenario like:

a-b:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
c-d:{(#10,RANGEDEL) (#5,RANGEKEYDEL) (#3,RANGEDEL)}
d-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

during forward iteration, if I understand it correctly, this iterator would yield

a-e:{(#10,RANGEDEL) (#3,RANGEDEL)}

Good catch. I pushed a separate commit (which I will squash into the first before merge, provided we're happy with it).

We're now preserving just the first and last key for each emitted span, and modifying the key kinds to act as a "marker" in the case where there are mixed key kinds (i.e. range dels and range key dels). I believe this is fine as the key kinds are only used to infer the type of span and thus hint type (see the second commit).

Open to alternative approaches on encoding this range del only / range key only / mixed key span information.


table_stats.go, line 674 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

is the closers slice necessary? wondering if we can let DefragmentIter.Close call MergingIter.Close which should fan out and close all the leaf iterators.

The problem I was running up against was that keyspan.Truncate returns a new *keyspan.Iter which has a no-op Close method. If we just close the merging iter, the original range del iter isn't closed, and we panic due to the leak.

We seem to do a similar thing here (adding to a closers slice before we lose the reference to Truncate). It seems in this case we don't want to propagate the close, as we want to reuse the underlying iter.


table_stats_test.go, line 168 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

how about taking the keyspan.ParseSpan format in this test, which allows customizing the sequence numbers? We'd have have to use the (*sstable.Writer).AddRangeKey method. We care quite a bit about this iterator surfacing the correct sequence numbers, since we might delete files with live keys if it doesn't.

Good call. Updated.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 4 files at r2, 1 of 4 files at r3, all commit messages.
Reviewable status: 2 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @nicktrav)


table_stats.go, line 642 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

Good catch. I pushed a separate commit (which I will squash into the first before merge, provided we're happy with it).

We're now preserving just the first and last key for each emitted span, and modifying the key kinds to act as a "marker" in the case where there are mixed key kinds (i.e. range dels and range key dels). I believe this is fine as the key kinds are only used to infer the type of span and thus hint type (see the second commit).

Open to alternative approaches on encoding this range del only / range key only / mixed key span information.

Yeah, I think this mutation of the key kinds is fine for correctness, but I don't love the subtlety.

Throwing out another idea:

  • Wrap the range deletion and range key deletion iterators separately in a DefragmentingIter, using reduce to only keep the first and last Keys of each. The range key DefragmentIter would need to ignore and drop RangeKeySets, RangeKeyUnsets. I think these could still share code if configured with a key kind to 'keep'.
  • Wrap the resulting two iterators in a merging iterator.

The result would be just as defragmented, because the only fragmentation points are at defragmented start/end bounds of each of the kind kind. The resulting span would have at most 4 keys, and we can scan the keys list to check for the existence of multiple kinds.


table_stats.go, line 674 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

The problem I was running up against was that keyspan.Truncate returns a new *keyspan.Iter which has a no-op Close method. If we just close the merging iter, the original range del iter isn't closed, and we panic due to the leak.

We seem to do a similar thing here (adding to a closers slice before we lose the reference to Truncate). It seems in this case we don't want to propagate the close, as we want to reuse the underlying iter.

Ah, yeah, that makes sense. Looking forward to when we can drop that keyspan.Truncate.


table_stats.go, line 335 at r3 (raw file):

		}

		estimate, hintSeqNum, err := d.estimateSizeBeneath(v, level, meta, start, end)

there's some awkwardness with estimateSizeBeneath and the EstimateDiskUsage call above. These functions estimate without knowledge of range keys. They include a file's entire size if the file's bounds fall within range of [start, end). Otherwise they use the point index blocks to calculate an estimate at block-level granularity.

I don't think we can ignore a little loss of precision from excluding range keys blocks when calculating estimates, since range keys are expected to be rare, but we should avoid including a large swath of the point keys if the span only contains RANGEKEYDELs


testdata/compaction_delete_only_hints, line 307 at r3 (raw file):

#   +---+           +---+             +---+
# __________________________________________________________
#   a b c d e f g h i j k l m n o p q r s t u v w x y z

Love this test.

@nicktrav nicktrav force-pushed the nickt.deletion-only branch from 1228395 to 061b700 Compare April 20, 2022 21:47
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs so far, @jbowens. Thoughts on the approach second commit? Regardless, I'm going to hold off on this until we have a path forward around #1663.

Reviewable status: 0 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @jbowens)


table_stats.go, line 642 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

Yeah, I think this mutation of the key kinds is fine for correctness, but I don't love the subtlety.

Throwing out another idea:

  • Wrap the range deletion and range key deletion iterators separately in a DefragmentingIter, using reduce to only keep the first and last Keys of each. The range key DefragmentIter would need to ignore and drop RangeKeySets, RangeKeyUnsets. I think these could still share code if configured with a key kind to 'keep'.
  • Wrap the resulting two iterators in a merging iterator.

The result would be just as defragmented, because the only fragmentation points are at defragmented start/end bounds of each of the kind kind. The resulting span would have at most 4 keys, and we can scan the keys list to check for the existence of multiple kinds.

Latest commit implements this (which I shall also squash), building on the new keyspan.Filter. Thanks for the suggestion!

There's one wrinkle that came up around key stability with the merging iter that's causing some metamorphic test failures. I opened #1663, and will wait for the fix for that to go in. This wasn't an issue previously as the merging iter was at the bottom of the stack, and its inputs were stable.


table_stats.go, line 335 at r3 (raw file):

Previously, jbowens (Jackson Owens) wrote…

there's some awkwardness with estimateSizeBeneath and the EstimateDiskUsage call above. These functions estimate without knowledge of range keys. They include a file's entire size if the file's bounds fall within range of [start, end). Otherwise they use the point index blocks to calculate an estimate at block-level granularity.

I don't think we can ignore a little loss of precision from excluding range keys blocks when calculating estimates, since range keys are expected to be rare, but we should avoid including a large swath of the point keys if the span only contains RANGEKEYDELs

Good point. We have a TODO in the laundry list to update the size estimates. I'll look at that next. Depending on how it shakes out, I might put up a separate patch.

Add a new `keyspan.FragmentIterator` implementation that iterates over
the range deletion and range key blocks of a single table. The iterator
returns merged spans from the table corresponding to "ranged deletions".
A "ranged deletion" span contains keys that fall into one of three
cases: the span contains only range deletions; the span contains only
range key deletions; the span contains a mixture of both range deletions
and range key deletions.

The new `tableRangedDeletionIter` will be used to construct spans
necessary for calculating deletion hints on a table that takes range
keys into account. Previously, only range deletion spans were
considered.

For a tables that do not contain range key deletes, the iterator returns
the same set of spans that the existing logic would have returned.
Currently, when constructing delete-only compactions hints, only range
delete spans are considered. With the introduction of range keys, the
current hint construction logic is incorrect. Specifically, a hint
constructed from a range delete may completely cover a table containing
range keys. Currently, this table would be marked as eligible for
deletion. However, the table should only be deleted if the range keys
are also deleted.

Make use of the `tableRangedDeletionIter` for construction of
delete-only compaction hints. A new `deleteCompactionHintType` type is
used to distinguish between hints generated by a range delete (which may
not delete tables containing range keys), hints generated from a range
key delete (which may not delete tables containing point keys), and
hints generated from a span with both a range delete and range key
delete (which may delete any covering table).
Fix the `tableRangedDeletionIter` implementation to preserve the spans
"type" (i.e. range del keys only, range key del keys only, or mixed
keys). The current implementation had a bug where a span with mixed key
kinds could have its type switched when merged with an abutting span.
Defragment the range del and range key block separately. The
`keyspan.Filter` is used to elide non-range key dels from the range key
block. The two separate, defragmented spans are then merged wit the
merging iter.
@nicktrav nicktrav force-pushed the nickt.deletion-only branch from 061b700 to a20684e Compare April 28, 2022 01:54
@nicktrav
Copy link
Contributor Author

Pushed a rebased version that picks up b9e970a.

Still need to incorporate a tweak to the disk estimation to ensure we're not counting point key size under range key del deletion hints.

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 4 files at r5, 2 of 4 files at r6, 2 of 2 files at r7, all commit messages.
Reviewable status: 6 of 7 files reviewed, 6 unresolved discussions (waiting on @itsbilal, @jbowens, and @nicktrav)


compaction.go line 1683 at r7 (raw file):

// deleteCompactionHintType indicates whether the deleteCompactionHint was
// generated from a span containing a range del (point key only), a range delete

nit: "a range key delete (range key only)"

We use the language "range delete" to refer to RANGEDELs in many places. I wish we had unambiguous names to differentiate these, but hard problems in computer science, amirite?


compaction.go line 1717 at r7 (raw file):

		switch k.Kind() {
		case base.InternalKeyKindRangeDelete:
			hintType |= 1 << 0

could this be hintType |= deleteCompactionHintTypePointKeyOnly and down below hintType |= deleteCompactionHintTypeRangeKeyOnly to make the relationship with the enum a little clearer?


compaction.go line 1721 at r7 (raw file):

			hintType |= 1 << 1
		default:
			panic(fmt.Sprintf("unsupported hint hind: %s", k.Kind()))

nit: s/hint hind/key kind/


table_stats.go line 583 at r4 (raw file):

		// Truncate tombstones to the containing file's bounds if necessary.
		// See docs/range_deletions.md for why this is necessary.
		closers = append(closers, iter)

With your change to Truncate, I think we could drop the closers slice now.


internal/keyspan/span.go line 73 at r7 (raw file):

func (k Key) Equal(b Key) bool {
	return k.Trailer == b.Trailer &&
		bytes.Equal(k.Suffix, b.Suffix) &&

bytes.Equal is okay for comparing values, but suffixes need to be compared with a user-provided base.Compare or base.Equal. At the moment, I believe bytes.Equal works for CockroachDB's use, but with cockroachdb/cockroach#77342 equal suffixes are not necessarily byte equal.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 3 of 7 files reviewed, 4 unresolved discussions (waiting on @itsbilal and @jbowens)


compaction.go line 1683 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: "a range key delete (range key only)"

We use the language "range delete" to refer to RANGEDELs in many places. I wish we had unambiguous names to differentiate these, but hard problems in computer science, amirite?

Yeah - bit of a mouthful. Fixed!


compaction.go line 1717 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

could this be hintType |= deleteCompactionHintTypePointKeyOnly and down below hintType |= deleteCompactionHintTypeRangeKeyOnly to make the relationship with the enum a little clearer?

Good idea. Done.


table_stats.go line 583 at r4 (raw file):

Previously, jbowens (Jackson Owens) wrote…

With your change to Truncate, I think we could drop the closers slice now.

Yeah - that disappeared in a20684e. Reviewable is probably not doing it justice.


internal/keyspan/span.go line 73 at r7 (raw file):

Previously, jbowens (Jackson Owens) wrote…

bytes.Equal is okay for comparing values, but suffixes need to be compared with a user-provided base.Compare or base.Equal. At the moment, I believe bytes.Equal works for CockroachDB's use, but with cockroachdb/cockroach#77342 equal suffixes are not necessarily byte equal.

Ah, thanks. Looks like we have a base.Compare wired up, so I'm plumbing that through now for the suffix comparisons.

nicktrav added 2 commits June 9, 2022 14:26
In converting an existing function to a loop, a loop iteration could
early `return`, which would skip subsequent hints. Use `continue`
instead, to consider the remaining hints.

Add a regression test.

This commit will be squashed into the first commit.
Alter the existing test case to show that range del hints only take
point key space into account, and range key del hints do not take point
key space into account.

This will be squashed into the second commit.
@nicktrav
Copy link
Contributor Author

nicktrav commented Jun 9, 2022

I'm going to close this out in favor of a new PR with all the squashing done and a rebase to pick up the latest changes from master.

With all the changes that have happened in the last month (e.g. Transformer and DefragmentMethod interface changes), it's probably worth reviewing this with fresh context.

Happy to re-open this one if that ends up being simpler.

@nicktrav nicktrav closed this Jun 9, 2022
@nicktrav nicktrav deleted the nickt.deletion-only branch June 9, 2022 22:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants