Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sstable: add range key support to BlockIntervalCollector #1452

Merged

Conversation

nicktrav
Copy link
Contributor

Currently, the sstable.BlockIntervalCollector exists as a helper
struct to which an implementation of
sstable.DataBlockIntervalCollector is passed. The latter contains the
user-defined logic for converting a key and value in an associated
interval that can be maintained.

The BlockIntervalCollector struct handles the logic required to
implement sstable.BlockPropertyCollector, encoding the interval as a
property when the data block is completed, and maintaining index-block-
and table-level intervals.

Update sstable.Writer to pass through range keys to each block
property collector as they are added to the writer.

As range and point keys can be thought of as existing in separate
keyspaces within the same LSM, range keys only contribute to table level
intervals (unioned with the point key interval), rather than block- and
index-level intervals.

Another way to rationalize this decision is to consider that range keys
are contained in a single, dedicated block in the SSTable, while the
point keys can span multiple blocks and have an associated index entry
into which the properties for the specific block are encoded. Block
level filtering applies to point keys only, whereas table-level
filtering takes the union of the point and range key intervals for the
table.

One downside of taking the union of the range and point keys for the
table level interval is an increased rate of false positives when
filtering tables based on an interval and the range key interval is
wider than the point key interval.

This change alters the NewBlockIntervalCollector function to take in
two DataBlockIntervalCollectors, one for point and range keys. This is
required to track the intervals separately within the
BlockIntervalCollector. The caller has the flexibility of passing in
nil for one (or both) of the collectors, in which case either point or
range keys (or both) will be ignored by the collector. This can be used,
for example, to construct a collectors that apply exclusively to either
point or range keys.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nicktrav
Copy link
Contributor Author

nicktrav commented Jan 14, 2022

This is an alternative implementation to the one in #1426 which feels both simpler and more flexible.

I made a simplifying assumption that range keys should never affect point key data block properties. Put another way, I couldn't think of a case where one would want a collector that encodes range key properties into a point key data block, given that the range keys will never appear in the point key data blocks (they have their own have a single, dedicated block).

There's a minor API change that requires callers to specify a collector for both point and range keys separately. Given implementations tend to maintain two uints for upper / lower bounds, there's only a small associated cost.

One benefit is that this implementation provides the flexibility to construct a collector that:

  • only applies to point keys (passing a nil range key collector)
  • only applies to range keys (passing a nil point key collector)
  • applies to both point and range keys

For the cases above, this implementation seems better than the alternative of constructing two separate collectors for point and range keys, which would double up on the space used required to store the collector name and encode up to two properties instead of just the one.

@nicktrav nicktrav force-pushed the nickt.range-key-block-prop-collectors branch from 11c9fa2 to 445976e Compare January 14, 2022 00:45
@nicktrav nicktrav mentioned this pull request Jan 14, 2022
29 tasks
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One downside of taking the union of the range and point keys for the
table level interval is an increased rate of false positives when
filtering tables based on an interval and the range key interval is
wider than the point key interval.

In CockroachDB, the practical concern here is that you may have a broad swath of old data in the span [a, z) at timestamps <10. Then you write a MVCCDeleteRange range key [a,z)@100 and its compacted into the same table(s) as the point keys. A time-bound iterator with a read timestamp ≤ @100 will construct a point iterator levelIter that still needs to visit every table, even though there's zero points within the interval. [The rangekey levelIter for the same time-bound iterator does need to visit every table in order to read the rangekey].

It's not a huge deal, because the index-level block properties will still allow us to skip over all the point blocks. But broadly if we stick with a split iterator implementation such that point and range key iterators belonging to the same pebble.Iterator may be looking at different tables, I wonder if we should lean towards recording separate properties. [I think we should also do this with the FileMetadata.{Smallest,Largest}]

Reviewed 1 of 5 files at r1, all commit messages.
Reviewable status: 1 of 5 files reviewed, 3 unresolved discussions (waiting on @nicktrav and @sumeerbhola)


sstable/block_property.go, line 132 at r1 (raw file):

// NewBlockIntervalCollector constructs a BlockIntervalCollector, with the
// given name and data block collector.

nit: expand the comment to describe the role of the two collectors and that either is permitted to be nil?


sstable/writer.go, line 625 at r1 (raw file):

			return err
		}
	}

I'm wondering whether we should even bother add range keys to the table-property collectors. None of the existing implementations will correctly extract suffixes from range-key internal values, and I think the table-property collectors interface is made obsolete by block-property collectors, which also record a table level property. In CockroachDB, we won't write range keys until a CockroachDB cluster version is finalized that knows about block-property collectors. Then after the next release, it seems like we could remove table-property collectors altogether.


sstable/testdata/block_properties, line 122 at r1 (raw file):

point:    [a@5#1,1,c@15#3,1]
rangedel: [#0,0,#0,0]
rangekey: [d@20#4,21,z#72057594037927935,21]

This test collector is a little confusing, because in the context of CockroachDB MVCC timestamps, the suffix on the start key isn't the one that matters*. We care about the suffix(es) encoded within the values.

*The suffix on the start key eg, d@20 only matters for the purpose of specifying the start bound user key. In CockroachDB, a rangekey [email protected]:d@15 [(@10=foo)] takes effect in the user key range [d@20,d@15) but has a MVCC timestamp of @10, which is what we should record to the interval collector. In practice, a CockroachDB MVCCDeleteRange range key with those bounds and suffix value has no real effect because there exist no keys within the range [d@20,d@15) also at timestamps <@10.

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should lean towards recording separate properties.

This implementation would support that, via registering two interval collectors: one with a non-nil point key collector and nil range key collector; and a second with a nil point key collector and a non-nil range key collector. They would have two similar, but different names.

What's not clear to me is whether there may exist a use-case in the future where an interval collector is constructed with both a non-nil point and range key collector with the one name. This implementation is flexible enough to support that, if we ever wanted something like that.

[I think we should also do this with the FileMetadata.{Smallest,Largest}]

Agree. This is on our TODO list.

I wonder if using these bounds for point keys would be enough to filter at the table level in the case that you mention?

Another option is to move the existing DataBlockIntervalCollector interface, BlockIntervalCollector struct, and "interval" concept up into Cockroach, and let the caller have full control over how a block property collector works - that could be a combined collector, or it could be two separate collectors.

Reviewable status: 1 of 5 files reviewed, 3 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/writer.go, line 625 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

I'm wondering whether we should even bother add range keys to the table-property collectors. None of the existing implementations will correctly extract suffixes from range-key internal values, and I think the table-property collectors interface is made obsolete by block-property collectors, which also record a table level property. In CockroachDB, we won't write range keys until a CockroachDB cluster version is finalized that knows about block-property collectors. Then after the next release, it seems like we could remove table-property collectors altogether.

Sounds good. Removed.


sstable/testdata/block_properties, line 122 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

This test collector is a little confusing, because in the context of CockroachDB MVCC timestamps, the suffix on the start key isn't the one that matters*. We care about the suffix(es) encoded within the values.

*The suffix on the start key eg, d@20 only matters for the purpose of specifying the start bound user key. In CockroachDB, a rangekey [email protected]:d@15 [(@10=foo)] takes effect in the user key range [d@20,d@15) but has a MVCC timestamp of @10, which is what we should record to the interval collector. In practice, a CockroachDB MVCCDeleteRange range key with those bounds and suffix value has no real effect because there exist no keys within the range [d@20,d@15) also at timestamps <@10.

I'll rework this to use the suffixes present in the values.

@nicktrav nicktrav force-pushed the nickt.range-key-block-prop-collectors branch from 445976e to 136b1e0 Compare January 18, 2022 17:40
Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They would have two similar, but different names.

I don't think this could be used to avoid unnecessary sstable reads in the scenario described above, at least without additional changes to the block property filter interface, because the combined pebble.Iterator over point and range keys would have one set of block property filters. It'd have to supply both filters.

I wonder if using these bounds for point keys would be enough to filter at the table level in the case that you mention?

It may be enough if the user key span over which range keys are written is much wider than the point keys', but it doesn't help in the case that they have similar user key spans but very different mvcc timestamp ranges.

Maybe the interface simplicity of this approach is worth the slightly reduced filtering effectiveness. Curious if @sumeerbhola has thoughts

Reviewable status: 1 of 5 files reviewed, 3 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)

@nicktrav nicktrav force-pushed the nickt.range-key-block-prop-collectors branch from 136b1e0 to 4c24c85 Compare January 18, 2022 19:28
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: 1 of 5 files reviewed, 3 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/testdata/block_properties, line 122 at r1 (raw file):

Previously, nicktrav (Nick Travers) wrote…

I'll rework this to use the suffixes present in the values.

Updated to pluck the suffix from the value in the case that it's a range key set or unset. I kept the cases for the point keys as this illustrates how the same collector can be used for point-, range-keys, or both.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • I agree that the integration with CockroachDB should have different block property collectors for point and range keys. I don't mind the generality here of the BlockIntervalCollector. I could also live with a BlockIntervalCollector that can do one or the other but not both if there are strong opinions against this current approach.
  • I am not keen on moving BlockIntervalCollector to the CockroachDB code. The way we did TBI in CockroachDB seems hacky from a code cleanliness perspective, and we have a clean separation of concerns here that we can unit test thoroughly.
  • I didn't understand the concern about the combined pebble.Iterator. It would be configured with IterOptions.{PointFilters,RangeKeyFilters}.

Reviewed 3 of 5 files at r1, 1 of 1 files at r2, 1 of 2 files at r3.
Reviewable status: 4 of 5 files reviewed, 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/block_property.go, line 108 at r2 (raw file):

	name   string
	points DataBlockIntervalCollector
	ranges DataBlockIntervalCollector

Consider adding a comment like:
A BlockIntervalCollector that collects over point and range keys needs to have both of these DataBlockIntervalCollectors since range keys are fed to the BlockIntervalCollector as they get fed to the Writer, which means they are interleaved with point keys.

(since someone may wonder why we can't reuse one DataBlockIntervalCollector the way we reuse it for different point data blocks)

Copy link
Collaborator

@jbowens jbowens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't understand the concern about the combined pebble.Iterator. It would be configured with IterOptions.{PointFilters,RangeKeyFilters}.

That sounds good to me.

Reviewable status: 4 of 5 files reviewed, 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)

Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summarizing a few conversations we had internally. tldr: I think the approach in this PR is sufficient.

I agree that the integration with CockroachDB should have different block property collectors for point and range keys. I don't mind the generality here of the BlockIntervalCollector.

There's a subtle point here in that even though the implementation in this PR provides the ability to specify both a point and range interval collector with the same name, we don't have an immediate use-case for such a collector. Instead, the typical use case (at least in Cockroach), will be to instantiate two interval collectors with different names - one for point keys only, the other for range keys only. The instantiation (in Cockroach) will look like:

const (
  pointKeyCollectorName  = "foo"
  rangeKeyCollectorName = "bar"
)

var PebbleBlockPropertyCollectors = []func() pebble.BlockPropertyCollector{
	func() pebble.BlockPropertyCollector {
		return sstable.NewBlockIntervalCollector(
                        pointKeyCollectorName,
			&pebbleDataBlockMVCCTimeIntervalCollector{}, // point key collector
                        nil, // range key collector
                )
	},
	func() pebble.BlockPropertyCollector {
		return sstable.NewBlockIntervalCollector(
                        rangeKeyCollectorName,
                        nil, // point key collector
			&rangeKeyCollector{}, // range key collector
                )
	},
}

I didn't understand the concern about the combined pebble.Iterator. It would be configured with IterOptions.{PointFilters,RangeKeyFilters}.

Building on from the previous point, as we'll typically have two collectors (one each for points and ranges exclusively), PointFilters will filter only for point keys (with the same pointKeyCollectorName name as above), and RangeKeyFilters will filter only for range keys (with the same rangeKeyCollectorName as above).

This will allow for the independent point and range key filtering that we require.

The filtering code is still a TODO for this PR.

I am not keen on moving BlockIntervalCollector to the CockroachDB code.

Will keep things as they are in Pebble, so as to avoid exposing too much of the internals to Cockroach.

Reviewable status: 4 of 5 files reviewed, 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)

@nicktrav nicktrav force-pushed the nickt.range-key-block-prop-collectors branch 2 times, most recently from ad64f1b to 609c9de Compare January 19, 2022 03:01
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added another series of test cases to demonstrate how filtering works and verify that the point / range filters will work independently.

I might punt the API change to the top-level iterator options (to take in a slice each for point and range keys) to a second PR, so as to keep this one focused on just the collector changes.

All other feedback addressed, so this should be good for another pass.

Reviewable status: 2 of 5 files reviewed, 3 unresolved discussions (waiting on @jbowens and @sumeerbhola)


sstable/block_property.go, line 132 at r1 (raw file):

Previously, jbowens (Jackson Owens) wrote…

nit: expand the comment to describe the role of the two collectors and that either is permitted to be nil?

Done.


sstable/block_property.go, line 108 at r2 (raw file):

Previously, sumeerbhola wrote…

Consider adding a comment like:
A BlockIntervalCollector that collects over point and range keys needs to have both of these DataBlockIntervalCollectors since range keys are fed to the BlockIntervalCollector as they get fed to the Writer, which means they are interleaved with point keys.

(since someone may wonder why we can't reuse one DataBlockIntervalCollector the way we reuse it for different point data blocks)

Done.

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 3 of 3 files at r4.
Reviewable status: all files reviewed (commit messages unreviewed), 4 unresolved discussions (waiting on @jbowens, @nicktrav, and @sumeerbhola)


sstable/block_property.go, line 144 at r4 (raw file):

// respectively.
//
// The caller may pass a nil DataBlockIntervalCollector for either (or both) of

or both? what purpose would both being nil serve?

Currently, the `sstable.BlockIntervalCollector` exists as a helper
struct to which an implementation of
`sstable.DataBlockIntervalCollector` is passed. The latter contains the
user-defined logic for converting a key and value in an associated
interval that can be maintained.

The `BlockIntervalCollector` struct handles the logic required to
implement `sstable.BlockPropertyCollector`, encoding the interval as a
property when the data block is completed, and maintaining index-block-
and table-level intervals.

Update `sstable.Writer` to pass through range keys to each block
property collector as they are added to the writer.

As range and point keys can be thought of as existing in separate
keyspaces within the same LSM, range keys only contribute to table level
intervals (unioned with the point key interval), rather than block- and
index-level intervals.

Another way to rationalize this decision is to consider that range keys
are contained in a single, dedicated block in the SSTable, while the
point keys can span multiple blocks and have an associated index entry
into which the properties for the specific block are encoded. Block
level filtering applies to point keys only, whereas table-level
filtering takes the union of the point and range key intervals for the
table.

One downside of taking the union of the range and point keys for the
table level interval is an increased rate of false positives when
filtering tables based on an interval and the range key interval is
wider than the point key interval.

This change alters the `NewBlockIntervalCollector` function to take in
two `DataBlockIntervalCollector`s, one for point and range keys. This is
required to track the intervals separately within the
`BlockIntervalCollector`. The caller has the flexibility of passing in
`nil` for one (or both) of the collectors, in which case either point or
range keys (or both) will be ignored by the collector. This can be used,
for example, to construct a collectors that apply exclusively to either
point or range keys.
@nicktrav nicktrav force-pushed the nickt.range-key-block-prop-collectors branch from 609c9de to 67be653 Compare January 20, 2022 21:14
Copy link
Contributor Author

@nicktrav nicktrav left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

Dismissed @jbowens and @sumeerbhola from 4 discussions.
Reviewable status: 2 of 5 files reviewed, all discussions resolved (waiting on @sumeerbhola)


sstable/block_property.go, line 144 at r4 (raw file):

Previously, sumeerbhola wrote…

or both? what purpose would both being nil serve?

No purpose. The comment was more for completeness. Given it's a non-sensical input, we can just panic if it's misconfigured, as that's user-error. Updated and removed parenthetical.

@nicktrav nicktrav merged commit ec6c216 into cockroachdb:master Jan 20, 2022
@nicktrav nicktrav deleted the nickt.range-key-block-prop-collectors branch January 20, 2022 21:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants