-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: emit MVCC range tombstones over rangefeeds #82718
kvserver: emit MVCC range tombstones over rangefeeds #82718
Conversation
This has a pretty hefty performance regression even in the no-range-keys case, mostly due to known performance issues with Pebble for this case. There is also some known overhead with
|
@miretskiy @stevendanna Feel free to have an early look. This is blocked on a few other PRs, but should otherwise be ~ready. |
f3dbf24
to
be4f0ec
Compare
@@ -472,6 +472,7 @@ func (s *eventStream) streamLoop(ctx context.Context, frontier *span.Frontier) e | |||
return err | |||
} | |||
default: | |||
// TODO(erikgrinaker): Handle DeleteRange events (MVCC range tombstones). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would want to forward those, right? Do we need to trim it? Perhaps file an issue to describe how these these should be handled. Pr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we need to handle them in CDC currently, because we'll only drop these across offline spans. They're trimmed to the registration bounds. For C2C, they'll need to be ingested into the target cluster as MVCC range tombstones.
If we're going to use these in online spans, we'd probably want to emit them as distinct events over CDC. Alternatively, we could synthesize point deletion events for any deleted points, but a single range deletion can expand into an unbounded number of point events, which seems bad.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Issue here fwiw: #70433
// | ||
// MVCC range tombstones are emitted at their start key (in chronological | ||
// order), and may thus be output before subsequent keys that they cover. For | ||
// example, the range key [a-f)@3 will be emitted before b@2, even though the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's strange seeing this since above we say events emitted in chronological order...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Keeping strict chronological ordering below range tombstones is going to kill performance. Consider the following example:
5 a5 b5 c5
4 o-----------o
3 a3 b3 c3
2 o-----------o
1 a1 b1 c1
a b c d
If we were to keep strict ordering, then we would have to emit a1,b1,c1
first, then [a-d)@2
, then a3,b3,c3
, and so on. Going back and forth to scan these slices is essentially going to be random access, which is extremely inefficient. Instead, range keys are emitted before the point keys they cover, which is the optimally efficient way to process them -- e.g. [a-d)@2,[a-d)@4,a1,a3,a5,b1,b3,b5,...
.
rangeKeysStart = append(rangeKeysStart[:0], rangeBounds.Key...) | ||
rangeKeys = rangeKeys[:0] | ||
for _, rkv := range i.RangeKeys() { | ||
rangeKeys = append(rangeKeys, rkv.Clone()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a possibility of this array being huge (memory blowup?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's an implicit assumption that the number of overlapping range keys will be small. Pebble pulls them all into memory at a given key position, as does the higher layers.
0c80c8c
to
0e40997
Compare
This should be ready for final review now. |
0e40997
to
a7c1e1f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pkg/storage/mvcc_key.go
Outdated
if len(rangeKeys) == 0 { | ||
return false | ||
} | ||
if upper.Less(lower) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a good idea to reverse here? I think it may hide some logical problem on level above where you use boundaries incorrectly. Alternatively they shouldn't be called upper and lower and it should be mentioned in the comment that they could be in any order.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good point. I was thinking that given the literal meaning of HasRangeKeyBetween()
, it doesn't really matter which order the timestamps are in, but it very much does for e.g. range tombstones. I've changed it now so that it simply returns false
if they're in the wrong order, and added a panic assertion when running under race.
This patch adds MVCC range tombstone support in rangefeeds. Whenever an MVCC range tombstone is written, a new `MVCCDeleteRangeOp` logical op is recorded and emitted across the rangefeed as a `RangeFeedDeleteRange` event. MVCC range tombstones will only be written when the `MVCCRangeTombstones` version gate has been enabled. Changefeeds will emit an error for these events. We do not expect to see these in online spans with changefeeds, since they are initially only planned for use with schema GC and import rollbacks. The rangefeed client library has been extended with support for these events, but no existing callers handle them for the same reason as changefeeds. Initial scans do not emit regular tombstones, and thus not range tombstones either, but catchup scans will emit them if encountered. This patch has rudimentary testing of MVCC range tombstones in rangefeeds. A later patch will add a data-driven test harness for rangefeeds with more exhaustive tests. Release note: None
a7c1e1f
to
566dff2
Compare
CI failure appears unrelated. TFTR! bors r=aliher1911 |
Build failed (retrying...): |
Build failed (retrying...): |
Build failed (retrying...): |
Build succeeded: |
This patch adds MVCC range tombstone support in rangefeeds. Whenever an
MVCC range tombstone is written, a new
MVCCDeleteRangeOp
logical opis recorded and emitted across the rangefeed as a
RangeFeedDeleteRange
event. MVCC range tombstones will only be written when the
MVCCRangeTombstones
version gate has been enabled.Changefeeds will emit an error for these events. We do not expect to see
these in online spans with changefeeds, since they are initially only
planned for use with schema GC and import rollbacks.
The rangefeed client library has been extended with support for these
events, but no existing callers handle them for the same reason as
changefeeds. Initial scans do not emit regular tombstones, and thus not
range tombstones either, but catchup scans will emit them if
encountered.
This patch has rudimentary testing of MVCC range tombstones in
rangefeeds. A later patch will add a data-driven test harness for
rangefeeds with more exhaustive tests.
Resolves #82449.
Touches #70433.
Release note: None