Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver/gc: remove range tombstones during GC #83265

Merged
merged 2 commits into from
Jul 18, 2022

Conversation

aliher1911
Copy link
Contributor

@aliher1911 aliher1911 commented Jun 23, 2022

First commit adds support for range tombstones when removing point keys. Range tombstone will have the same effect as a point tombstone within the mvcc key history.

Second commit adds support for removal of range tombstones below GC threshold.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 6 times, most recently from 751c539 to 3602d49 Compare June 24, 2022 10:05
@aliher1911 aliher1911 self-assigned this Jun 24, 2022
@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 4 times, most recently from bba01d8 to 57b87fb Compare June 27, 2022 13:58
@aliher1911 aliher1911 marked this pull request as ready for review June 27, 2022 14:25
@aliher1911 aliher1911 requested a review from a team June 27, 2022 14:25
@aliher1911 aliher1911 requested review from a team as code owners June 27, 2022 14:25
@aliher1911 aliher1911 requested a review from erikgrinaker June 27, 2022 14:25
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some initial comments.

pkg/kv/kvserver/gc/data_distribution_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/gc/data_distribution_test.go Show resolved Hide resolved
pkg/kv/kvserver/gc/data_distribution_test.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/gc/gc_iterator_test.go Outdated Show resolved Hide resolved
pkg/roachpb/api.proto Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 5 times, most recently from d0a20ae to ba3955d Compare July 1, 2022 14:19
@aliher1911 aliher1911 requested a review from erikgrinaker July 1, 2022 14:21
@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 4 times, most recently from 3df1259 to 476a3d4 Compare July 1, 2022 17:45
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flushing some more comments, will continue tomorrow.

It would be nice if the commit/PR messages were a bit more descriptive. For example, outlining the overall mechanics, protocol changes, backwards compatibility concerns, performance implications, etc.

pkg/storage/mvcc.go Outdated Show resolved Hide resolved
@@ -4073,6 +4071,7 @@ func MVCCGarbageCollect(
iter := rw.NewMVCCIterator(MVCCKeyAndIntentsIterKind, IterOptions{
LowerBound: keys[0].Key,
UpperBound: keys[len(keys)-1].Key.Next(),
KeyTypes: IterKeyTypePointsAndRanges,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to comment on non-changed lines, so I'm commenting here.

We need to be careful to handle bare range keys whenever we reposition the iterator. In particular, this bit here:

cockroach/pkg/storage/mvcc.go

Lines 4210 to 4213 in 476a3d4

// Use the previous version's timestamp if it's for this key.
if iter.UnsafeKey().Key.Equal(gcKey.Key) {
prevNanos = iter.UnsafeKey().Timestamp.WallTime
}

If this lands on a bare range key with the same start key as the GC key, then Key.Equal() will return true but Timestamp will be 0.

I think the Next() scan above saves us, in that it ensures there are in fact point key versions above the GC key. However, the code still checks that the keys are equal just to be sure, and I think it'd be worthwhile to confirm that we're on a point key via HasPointAndRange() too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. I think this is the case for metadata check as well. The check against the unsafekey could only be done if we are coming from metadata reads that leaves on valid points e.g. ones that have a non zero keyLastSeen timestamp. It comes from either point, or a range key that is covering a point.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I mentioned this in a different comment, but essentially, if mvccGetMetadata is positioned on a bare range key then there's no point in continuing, so we should skip to the nest gcKey.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we return from metadata is insufficient to make a decision. When we synthesize the metadata for range key in absence of keys we leave iterator at either end or at next key which could be next metadata, next value or next range. This is unfortunate as following checks could be incorrect (e.g. if timestamp is below gc threshold we will count meta as deleted, we could also try to clear implicit data from the next key...).
This wasn't caught by our tests as those are cases where we have gc requests not matching data removed, so I added some tests for that. Also we don't have tests for cross talk between keys like that.
I'm not keen on changing mvccGetMetadata as it is used in other places and seem to rely on point synthesis, but maybe not on iterator position after. Maybe it is worth revisiting to simplify later as the code is hard to understand.

Copy link
Contributor

@erikgrinaker erikgrinaker Jul 15, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's a bit iffy that we have to step the iterator in mvccGetMetadata. I should go back and review all of the other call sites, to make sure that isn't causing any problems, and document it.

I think we can detect that we were on a bare range key and stepped by checking that all of the following hold:

  • ok == true: there was existing data (point or range) at the key
  • realKeyChanged.IsEmpty() == true: either there was no real point key, or it was an inline value with a timestamp of 0 (intents return the intent's non-zero timestamp)
  • meta.IsInline() == false: there was no inline point key

I agree that this is pretty unsatisfactory, but if the above hold then mvccGetMetadata found a bare range tombstone and we can do continue to move to the next GC key. In all other cases, the iterator will be positioned on a point key that matches the GC key.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only other user of mvccGetMetadata is mvccPutInternal and it is only using iterator in combination with SeekX or maybeGetValue which is also doing Seek before anything else. So it should be safe Range/Point wise.

I think checking that timestamp of the first version vs realKeyChanged will give an indication that we have a range tombstone covering all versions and then we can apply check for the gc timestamp vs first version to see if we need to GC everything or not. That would eliminate the need for explicit checks and passes all tests.

pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/storage/mvcc.go Show resolved Hide resolved
pkg/storage/mvcc_stats_test.go Outdated Show resolved Hide resolved
@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 9 times, most recently from d70eec6 to f1725b0 Compare July 13, 2022 13:15
@aliher1911 aliher1911 requested a review from erikgrinaker July 13, 2022 13:23
@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 3 times, most recently from 3da4ff0 to e52141e Compare July 15, 2022 13:36
@erikgrinaker
Copy link
Contributor

Sorry for the hold-up here! I'll do another review pass once I'm back home, but I think we're getting there.

@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 4 times, most recently from 212a1e7 to b62dd2f Compare July 18, 2022 09:11
Copy link
Contributor

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this should be good to go once we address the last remaining bug that I mentioned, so approving once that's fixed.

I am a bit worried that we keep finding issues that aren't caught by tests though, so consider extending the randomized test coverage until we're confident in it. We should also do further end-to-end testing of GC and stats as part of the broader range tombstone testing (e.g. #84042).

pkg/storage/mvcc.go Outdated Show resolved Hide resolved
pkg/kv/kvserver/gc/gc_iterator.go Outdated Show resolved Hide resolved
// ExperimentalMVCCDeleteRangeUsingTombstone can read in order to detect
// adjacent range tombstones to merge with or fragment. The bounds will be
// truncated to the Raft range bounds if given.
func RangeTombstonePeekBounds(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I'd just keep them in batcheval, since that's where they're used. In storage tests we can usually just pass nil peek bounds, with integration tests in batcheval that do spanset assertions (since batcheval is what cares about latches).

@@ -990,7 +990,7 @@ func mvccGetMetadata(
if !hasPoint || !unsafeKey.Key.Equal(metaKey.Key) {
meta.Deleted = true
meta.Timestamp = rkTimestamp.ToLegacyTimestamp()
return true, int64(EncodedMVCCKeyPrefixLength(metaKey.Key)), 0, hlc.Timestamp{}, nil
return true, 0, 0, hlc.Timestamp{}, nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I think this makes sense. We should mention this in the function and branch comments.

Do we make use of this new behavior anywhere, though? I suppose we could use it as a clearer way to detect bare range keys.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't use it, but after reading comments for the Nth time I realized that this value is meant to represent extra data that needs to be deleted if all keys go. In case of bare range, there's nothing to delete. We do skip the whole key if we detect that realKeyChanged is empty and meta is not inlined so we don't need to look at it or use it.

// First, check whether all values of the key are being deleted.
//
// We also catch cases of meta synthesized from range tombstone where the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't think we correctly handle this. If we have two range tombstones above a point key, both of which are below gcKey.Timestamp, then meta.Timestamp will have the timestamp of the upper range tombstone. However, the key was actually deleted at the lower range tombstone, so this will give the wrong GCBytesAge. It needs to use realKeyChanged instead of meta.Timestamp in those cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I now think that the initial code that just looked on points as they were removed to figure out if all versions are deleted was cleaner and much easier to understand. There are maybe 14 meaningful combinations of how we can put range keys and meta and subsequent keys and what the function returns. Then I just see which combinations of values match which cases, but adding another range above adds a couple more. I'll put a TODO to make it more reasonable.

Copy link
Contributor

@erikgrinaker erikgrinaker Jul 18, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the initial code had the same bug, since it didn't modify this branch (or am I misremembering, and it skipped it when range keys were present?).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It did, I think usage of metadata is flawed. I think having a metadata object where those if conditions are exposed as boolean methods would be better. Then you can just ask what it is and do action instead of trying to combine values every time you need to make decision. So I moved meta-range tombstone handling to the top so that it doesn't leak to subsequent logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think having a metadata object where those if conditions are exposed as boolean methods would be better.

Yeah, I'd agree with this. I feel like the use of synthetic metadata in general is a bit iffy, but at least centralizing the logic in easy-to-read methods would be a good start. Feel free to attach a commit or submit a follow-up PR.

So I moved meta-range tombstone handling to the top so that it doesn't leak to subsequent logic.

Good idea, that's clearer.

// computations correctly account for adjacent ranges if then need to be
// split.
l, r := rangeTombstonePeekBounds(rKey.StartKey, rKey.EndKey, rs.GetStartKey().AsRawKey(), nil)
latchSpans.AddMVCC(spanset.SpanReadWrite, roachpb.Span{Key: l, EndKey: r},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think what we do here for now is fine, but let's create an issue to optimize the latching here (once #83213 lands).

Previously GC didn't take range tombstones into account
when removing old point data. This patch adds support for
removal of data hidden by range tombstones.

This change doesn't change the protocol that GC is using
but will look on range tombstones when processing replica
in GC queue. Point keys that are covered by range tombstones
will be included in the GCRequest same way as keys covered
by newer versions or point tombstones.

Release note: None
@aliher1911
Copy link
Contributor Author

The problem with swapping range tombstones with point tombstones or the other way round is that it wouldn't allow comparing stats, only "read at timestamp" operations outcomes. And actual GC didn't have any failures even before mvcc stats were added. That part is pretty simple actually.

@aliher1911 aliher1911 force-pushed the rts_gc_2 branch 2 times, most recently from 5fd0f43 to 21ba728 Compare July 18, 2022 16:48
Previously range tombstones were taken into account when doing point key
GC, but were never removed themselves. This PR is adding support for
removal of old range keys.

This PR is extending GCRequest to include range tombstones. Range tombstones
are populated by GC in requests sent after all the point keys under the
GC threshold are deleted thus guaranteeing that point keys are not accidentally
exposed.
When processing GC range tombstone requests, replica does an addtional step to
validate these assumptions and fail deletions if there's data still covered by
range tombstones.

Release note: None
@aliher1911
Copy link
Contributor Author

bors r=erikgrinaker

@craig craig bot merged commit eb28a3e into cockroachdb:master Jul 18, 2022
@craig
Copy link
Contributor

craig bot commented Jul 18, 2022

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants