-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver/gc: remove range tombstones during GC #83265
Conversation
751c539
to
3602d49
Compare
bba01d8
to
57b87fb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some initial comments.
d0a20ae
to
ba3955d
Compare
3df1259
to
476a3d4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flushing some more comments, will continue tomorrow.
It would be nice if the commit/PR messages were a bit more descriptive. For example, outlining the overall mechanics, protocol changes, backwards compatibility concerns, performance implications, etc.
@@ -4073,6 +4071,7 @@ func MVCCGarbageCollect( | |||
iter := rw.NewMVCCIterator(MVCCKeyAndIntentsIterKind, IterOptions{ | |||
LowerBound: keys[0].Key, | |||
UpperBound: keys[len(keys)-1].Key.Next(), | |||
KeyTypes: IterKeyTypePointsAndRanges, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not possible to comment on non-changed lines, so I'm commenting here.
We need to be careful to handle bare range keys whenever we reposition the iterator. In particular, this bit here:
Lines 4210 to 4213 in 476a3d4
// Use the previous version's timestamp if it's for this key. | |
if iter.UnsafeKey().Key.Equal(gcKey.Key) { | |
prevNanos = iter.UnsafeKey().Timestamp.WallTime | |
} |
If this lands on a bare range key with the same start key as the GC key, then Key.Equal()
will return true
but Timestamp
will be 0.
I think the Next()
scan above saves us, in that it ensures there are in fact point key versions above the GC key. However, the code still checks that the keys are equal just to be sure, and I think it'd be worthwhile to confirm that we're on a point key via HasPointAndRange()
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see your point. I think this is the case for metadata check as well. The check against the unsafekey could only be done if we are coming from metadata reads that leaves on valid points e.g. ones that have a non zero keyLastSeen timestamp. It comes from either point, or a range key that is covering a point.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep. I mentioned this in a different comment, but essentially, if mvccGetMetadata
is positioned on a bare range key then there's no point in continuing, so we should skip to the nest gcKey
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we return from metadata is insufficient to make a decision. When we synthesize the metadata for range key in absence of keys we leave iterator at either end or at next key which could be next metadata, next value or next range. This is unfortunate as following checks could be incorrect (e.g. if timestamp is below gc threshold we will count meta as deleted, we could also try to clear implicit data from the next key...).
This wasn't caught by our tests as those are cases where we have gc requests not matching data removed, so I added some tests for that. Also we don't have tests for cross talk between keys like that.
I'm not keen on changing mvccGetMetadata as it is used in other places and seem to rely on point synthesis, but maybe not on iterator position after. Maybe it is worth revisiting to simplify later as the code is hard to understand.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, it's a bit iffy that we have to step the iterator in mvccGetMetadata
. I should go back and review all of the other call sites, to make sure that isn't causing any problems, and document it.
I think we can detect that we were on a bare range key and stepped by checking that all of the following hold:
ok == true
: there was existing data (point or range) at the keyrealKeyChanged.IsEmpty() == true
: either there was no real point key, or it was an inline value with a timestamp of 0 (intents return the intent's non-zero timestamp)meta.IsInline() == false
: there was no inline point key
I agree that this is pretty unsatisfactory, but if the above hold then mvccGetMetadata
found a bare range tombstone and we can do continue
to move to the next GC key. In all other cases, the iterator will be positioned on a point key that matches the GC key.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only other user of mvccGetMetadata
is mvccPutInternal
and it is only using iterator in combination with SeekX
or maybeGetValue
which is also doing Seek
before anything else. So it should be safe Range/Point wise.
I think checking that timestamp of the first version vs realKeyChanged will give an indication that we have a range tombstone covering all versions and then we can apply check for the gc timestamp vs first version to see if we need to GC everything or not. That would eliminate the need for explicit checks and passes all tests.
d70eec6
to
f1725b0
Compare
3da4ff0
to
e52141e
Compare
Sorry for the hold-up here! I'll do another review pass once I'm back home, but I think we're getting there. |
212a1e7
to
b62dd2f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be good to go once we address the last remaining bug that I mentioned, so approving once that's fixed.
I am a bit worried that we keep finding issues that aren't caught by tests though, so consider extending the randomized test coverage until we're confident in it. We should also do further end-to-end testing of GC and stats as part of the broader range tombstone testing (e.g. #84042).
pkg/storage/ranges.go
Outdated
// ExperimentalMVCCDeleteRangeUsingTombstone can read in order to detect | ||
// adjacent range tombstones to merge with or fragment. The bounds will be | ||
// truncated to the Raft range bounds if given. | ||
func RangeTombstonePeekBounds( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Personally I'd just keep them in batcheval
, since that's where they're used. In storage tests we can usually just pass nil
peek bounds, with integration tests in batcheval
that do spanset assertions (since batcheval
is what cares about latches).
@@ -990,7 +990,7 @@ func mvccGetMetadata( | |||
if !hasPoint || !unsafeKey.Key.Equal(metaKey.Key) { | |||
meta.Deleted = true | |||
meta.Timestamp = rkTimestamp.ToLegacyTimestamp() | |||
return true, int64(EncodedMVCCKeyPrefixLength(metaKey.Key)), 0, hlc.Timestamp{}, nil | |||
return true, 0, 0, hlc.Timestamp{}, nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, I think this makes sense. We should mention this in the function and branch comments.
Do we make use of this new behavior anywhere, though? I suppose we could use it as a clearer way to detect bare range keys.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't use it, but after reading comments for the Nth time I realized that this value is meant to represent extra data that needs to be deleted if all keys go. In case of bare range, there's nothing to delete. We do skip the whole key if we detect that realKeyChanged is empty and meta is not inlined so we don't need to look at it or use it.
pkg/storage/mvcc.go
Outdated
// First, check whether all values of the key are being deleted. | ||
// | ||
// We also catch cases of meta synthesized from range tombstone where the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't think we correctly handle this. If we have two range tombstones above a point key, both of which are below gcKey.Timestamp
, then meta.Timestamp
will have the timestamp of the upper range tombstone. However, the key was actually deleted at the lower range tombstone, so this will give the wrong GCBytesAge
. It needs to use realKeyChanged
instead of meta.Timestamp
in those cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I now think that the initial code that just looked on points as they were removed to figure out if all versions are deleted was cleaner and much easier to understand. There are maybe 14 meaningful combinations of how we can put range keys and meta and subsequent keys and what the function returns. Then I just see which combinations of values match which cases, but adding another range above adds a couple more. I'll put a TODO to make it more reasonable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the initial code had the same bug, since it didn't modify this branch (or am I misremembering, and it skipped it when range keys were present?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It did, I think usage of metadata is flawed. I think having a metadata object where those if conditions are exposed as boolean methods would be better. Then you can just ask what it is and do action instead of trying to combine values every time you need to make decision. So I moved meta-range tombstone handling to the top so that it doesn't leak to subsequent logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think having a metadata object where those if conditions are exposed as boolean methods would be better.
Yeah, I'd agree with this. I feel like the use of synthetic metadata in general is a bit iffy, but at least centralizing the logic in easy-to-read methods would be a good start. Feel free to attach a commit or submit a follow-up PR.
So I moved meta-range tombstone handling to the top so that it doesn't leak to subsequent logic.
Good idea, that's clearer.
pkg/kv/kvserver/batcheval/cmd_gc.go
Outdated
// computations correctly account for adjacent ranges if then need to be | ||
// split. | ||
l, r := rangeTombstonePeekBounds(rKey.StartKey, rKey.EndKey, rs.GetStartKey().AsRawKey(), nil) | ||
latchSpans.AddMVCC(spanset.SpanReadWrite, roachpb.Span{Key: l, EndKey: r}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think what we do here for now is fine, but let's create an issue to optimize the latching here (once #83213 lands).
Previously GC didn't take range tombstones into account when removing old point data. This patch adds support for removal of data hidden by range tombstones. This change doesn't change the protocol that GC is using but will look on range tombstones when processing replica in GC queue. Point keys that are covered by range tombstones will be included in the GCRequest same way as keys covered by newer versions or point tombstones. Release note: None
The problem with swapping range tombstones with point tombstones or the other way round is that it wouldn't allow comparing stats, only "read at timestamp" operations outcomes. And actual GC didn't have any failures even before mvcc stats were added. That part is pretty simple actually. |
5fd0f43
to
21ba728
Compare
Previously range tombstones were taken into account when doing point key GC, but were never removed themselves. This PR is adding support for removal of old range keys. This PR is extending GCRequest to include range tombstones. Range tombstones are populated by GC in requests sent after all the point keys under the GC threshold are deleted thus guaranteeing that point keys are not accidentally exposed. When processing GC range tombstone requests, replica does an addtional step to validate these assumptions and fail deletions if there's data still covered by range tombstones. Release note: None
bors r=erikgrinaker |
Build succeeded: |
First commit adds support for range tombstones when removing point keys. Range tombstone will have the same effect as a point tombstone within the mvcc key history.
Second commit adds support for removal of range tombstones below GC threshold.