Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: emit MVCC range tombstones over rangefeeds #82718

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented Jun 10, 2022

This patch adds MVCC range tombstone support in rangefeeds. Whenever an
MVCC range tombstone is written, a new MVCCDeleteRangeOp logical op
is recorded and emitted across the rangefeed as a RangeFeedDeleteRange
event. MVCC range tombstones will only be written when the
MVCCRangeTombstones version gate has been enabled.

Changefeeds will emit an error for these events. We do not expect to see
these in online spans with changefeeds, since they are initially only
planned for use with schema GC and import rollbacks.

The rangefeed client library has been extended with support for these
events, but no existing callers handle them for the same reason as
changefeeds. Initial scans do not emit regular tombstones, and thus not
range tombstones either, but catchup scans will emit them if
encountered.

This patch has rudimentary testing of MVCC range tombstones in
rangefeeds. A later patch will add a data-driven test harness for
rangefeeds with more exhaustive tests.

Resolves #82449.
Touches #70433.

Release note: None

@erikgrinaker erikgrinaker self-assigned this Jun 10, 2022
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@erikgrinaker
Copy link
Contributor Author

This has a pretty hefty performance regression even in the no-range-keys case, mostly due to known performance issues with Pebble for this case. There is also some known overhead with HasPointAndRange(). I'm inclined to wait until the optimization work for #82045 lands, which will likely address most of this, and then address the rest of the (then hopefully minor) regression post-merge -- that PR is also required for WithDiff correctness.

name                                                 old time/op  new time/op  delta
CatchUpScan/mixed-case/withDiff=true/perc=0.00-24     517ms ± 1%   543ms ± 1%  +4.98%  (p=0.000 n=10+10)
CatchUpScan/mixed-case/withDiff=true/perc=50.00-24    1.03s ± 0%   1.13s ± 0%  +9.28%  (p=0.000 n=10+9)
CatchUpScan/mixed-case/withDiff=true/perc=75.00-24    901ms ± 1%   984ms ± 0%  +9.20%  (p=0.000 n=10+9)
CatchUpScan/mixed-case/withDiff=true/perc=95.00-24    186ms ± 1%   202ms ± 1%  +8.79%  (p=0.000 n=10+10)
CatchUpScan/mixed-case/withDiff=true/perc=99.00-24    157ms ± 1%   170ms ± 1%  +8.29%  (p=0.000 n=10+9)
CatchUpScan/mixed-case/withDiff=false/perc=0.00-24    771ms ± 1%   825ms ± 1%  +6.98%  (p=0.000 n=10+10)
CatchUpScan/mixed-case/withDiff=false/perc=50.00-24   659ms ± 1%   706ms ± 1%  +7.07%  (p=0.000 n=10+10)
CatchUpScan/mixed-case/withDiff=false/perc=75.00-24   590ms ± 1%   632ms ± 1%  +7.17%  (p=0.000 n=10+10)
CatchUpScan/mixed-case/withDiff=false/perc=95.00-24   170ms ± 1%   184ms ± 0%  +8.29%  (p=0.000 n=9+10)
CatchUpScan/mixed-case/withDiff=false/perc=99.00-24   153ms ± 1%   167ms ± 1%  +9.01%  (p=0.000 n=10+10)

@erikgrinaker
Copy link
Contributor Author

@miretskiy @stevendanna Feel free to have an early look. This is blocked on a few other PRs, but should otherwise be ~ready.

@@ -472,6 +472,7 @@ func (s *eventStream) streamLoop(ctx context.Context, frontier *span.Frontier) e
return err
}
default:
// TODO(erikgrinaker): Handle DeleteRange events (MVCC range tombstones).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would want to forward those, right? Do we need to trim it? Perhaps file an issue to describe how these these should be handled. Pr

Copy link
Contributor Author

@erikgrinaker erikgrinaker Jun 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we need to handle them in CDC currently, because we'll only drop these across offline spans. They're trimmed to the registration bounds. For C2C, they'll need to be ingested into the target cluster as MVCC range tombstones.

If we're going to use these in online spans, we'd probably want to emit them as distinct events over CDC. Alternatively, we could synthesize point deletion events for any deleted points, but a single range deletion can expand into an unbounded number of point events, which seems bad.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Issue here fwiw: #70433

//
// MVCC range tombstones are emitted at their start key (in chronological
// order), and may thus be output before subsequent keys that they cover. For
// example, the range key [a-f)@3 will be emitted before b@2, even though the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's strange seeing this since above we say events emitted in chronological order...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping strict chronological ordering below range tombstones is going to kill performance. Consider the following example:

5   a5  b5  c5
4   o-----------o
3   a3  b3  c3
2   o-----------o
1   a1  b1  c1
    a   b   c   d

If we were to keep strict ordering, then we would have to emit a1,b1,c1 first, then [a-d)@2, then a3,b3,c3, and so on. Going back and forth to scan these slices is essentially going to be random access, which is extremely inefficient. Instead, range keys are emitted before the point keys they cover, which is the optimally efficient way to process them -- e.g. [a-d)@2,[a-d)@4,a1,a3,a5,b1,b3,b5,....

rangeKeysStart = append(rangeKeysStart[:0], rangeBounds.Key...)
rangeKeys = rangeKeys[:0]
for _, rkv := range i.RangeKeys() {
rangeKeys = append(rangeKeys, rkv.Clone())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a possibility of this array being huge (memory blowup?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's an implicit assumption that the number of overlapping range keys will be small. Pebble pulls them all into memory at a given key position, as does the higher layers.

@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-rangefeed branch 3 times, most recently from 0c80c8c to 0e40997 Compare June 18, 2022 10:13
@erikgrinaker erikgrinaker marked this pull request as ready for review June 18, 2022 10:13
@erikgrinaker erikgrinaker requested review from a team as code owners June 18, 2022 10:13
@erikgrinaker erikgrinaker requested review from gh-casper and removed request for a team June 18, 2022 10:13
@erikgrinaker
Copy link
Contributor Author

erikgrinaker commented Jun 18, 2022

This should be ready for final review now.

@erikgrinaker erikgrinaker requested review from gh-casper and aliher1911 and removed request for gh-casper June 18, 2022 10:15
@shermanCRL shermanCRL requested a review from HonoreDB June 18, 2022 16:34
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-rangefeed branch from 0e40997 to a7c1e1f Compare June 26, 2022 17:15
Copy link
Contributor

@aliher1911 aliher1911 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some minor nits, otherwise :lgtm:

if len(rangeKeys) == 0 {
return false
}
if upper.Less(lower) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it a good idea to reverse here? I think it may hide some logical problem on level above where you use boundaries incorrectly. Alternatively they shouldn't be called upper and lower and it should be mentioned in the comment that they could be in any order.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good point. I was thinking that given the literal meaning of HasRangeKeyBetween(), it doesn't really matter which order the timestamps are in, but it very much does for e.g. range tombstones. I've changed it now so that it simply returns false if they're in the wrong order, and added a panic assertion when running under race.

This patch adds MVCC range tombstone support in rangefeeds. Whenever an
MVCC range tombstone is written, a new `MVCCDeleteRangeOp` logical op
is recorded and emitted across the rangefeed as a `RangeFeedDeleteRange`
event. MVCC range tombstones will only be written when the
`MVCCRangeTombstones` version gate has been enabled.

Changefeeds will emit an error for these events. We do not expect to see
these in online spans with changefeeds, since they are initially only
planned for use with schema GC and import rollbacks.

The rangefeed client library has been extended with support for these
events, but no existing callers handle them for the same reason as
changefeeds. Initial scans do not emit regular tombstones, and thus not
range tombstones either, but catchup scans will emit them if
encountered.

This patch has rudimentary testing of MVCC range tombstones in
rangefeeds. A later patch will add a data-driven test harness for
rangefeeds with more exhaustive tests.

Release note: None
@erikgrinaker erikgrinaker force-pushed the mvcc-range-tombstones-rangefeed branch from a7c1e1f to 566dff2 Compare June 28, 2022 12:10
@erikgrinaker
Copy link
Contributor Author

CI failure appears unrelated. TFTR!

bors r=aliher1911

@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build failed (retrying...):

@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build failed (retrying...):

@craig craig bot merged commit 5541cf8 into cockroachdb:master Jun 28, 2022
@craig
Copy link
Contributor

craig bot commented Jun 28, 2022

Build succeeded:

@erikgrinaker erikgrinaker deleted the mvcc-range-tombstones-rangefeed branch June 30, 2022 10:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: emit MVCC range tombstones across rangefeeds
4 participants