-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230
roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230
Conversation
@nvanbenschoten I'm aware that this could benefit from commentary and more targeted testing but curious to hear to first impression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)
pkg/roachpb/api.proto, line 873 at r4 (raw file):
bytes key = 1 [(gogoproto.casttype) = "Key"]; util.hlc.Timestamp timestamp = 2 [(gogoproto.nullable) = false]; bool use_clear_range = 3;
It is nifty that this field can be safely ignored. That is what allows us to avoid a cluster version gate for this functionality. Perhaps worth a comment.
pkg/storage/mvcc.go, line 3348 at r4 (raw file):
start := MVCCKey{Key: gcKey.Key, Timestamp: gcKey.Timestamp} end := MVCCKey{Key: gcKey.Key, Timestamp: hlc.Timestamp{WallTime: 1}} if err := rw.ClearRange(start, end); err != nil {
I keep wondering if there is something we could do at compaction time. @sumeerbhola you were thinking about this. Does it help at all to divide up the work? That is, synchronously update MVCCStats
and then asynchronously allow the keys to be deleted via compactions. Perhaps using range tombstones like this is exactly what we'd want to do.
38441b9
to
c6b040a
Compare
8387d18
to
6801349
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@petermattis see how you feel about it now that it's gated behind a cluster setting.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)
pkg/roachpb/api.proto, line 873 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
It is nifty that this field can be safely ignored. That is what allows us to avoid a cluster version gate for this functionality. Perhaps worth a comment.
We still do want to avoid setting it if we don't think that the server can support it. I've added both commentary and a version gate.
pkg/storage/mvcc.go, line 3348 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I keep wondering if there is something we could do at compaction time. @sumeerbhola you were thinking about this. Does it help at all to divide up the work? That is, synchronously update
MVCCStats
and then asynchronously allow the keys to be deleted via compactions. Perhaps using range tombstones like this is exactly what we'd want to do.
Here was @sumeerbhola's prototype: #42514
My more recent thinking on the topic is that most of the time, the background cost of GC isn't killing us. Sometimes, when there are loads of versions, it's a big problem and this patch is likely to help a lot.
My sense is that if we do this PR and we deal with the direct interactions between GC and foreground traffic (#24237), we won't hear about GC again for a long time. I do think that #24237 warrants
In particular, the cost will primarily just become additional scans of the data. Sure, this sucks, but already happens for the purpose of, say, consistency checking. Once we start counting the data scans we need to do and how those scans interact with the block cache, then we'll probably want to revisit this.
@petermattis any chance I can get an updated perspective on this? I get that ideally we'd love to have a solution to the GC problem such that we read all the data once and the write some constant size raft command but we're certainly not going to get to that in this release. This seems like reasonably low complexity for what amounts to a big win in specific situations. |
My concern about large numbers of range tombstones still stands. How often do you expect the |
It will be invoked whenever the I hear you. The way cockroach is today, there are ways in which one can create clear ranges over no keys whatsoever, in particular, when you spin dropping or truncating empty tables. For example, today Either way, I've added a cluster setting to make the change more palatable. This will be a life-saver for the rare cases where people accumulate hundreds of megabytes of versions of the same key. |
I'm not worried about those range tombstones as they are created in response to a rare user operation. My concern is over creating 10s or 100s of range tombstones per range during GC. Heck, this might actually be fine and better than the alternative. If the range tombstones all end up in a small set of sstables they will be compacted out of existence fairly quickly. The concern is really that it stressing range tombstones more than we've done previously so there may be problems.
👍 on the cluster setting. The other thought is stressing this in some fashion, though I don't have good thoughts on how to do so. |
Given:
I decided to hold off on merging this until the 21.1 cycle. In that cycle, I'm interested in merging this and ideally going through the process of validating it in pebble. |
Ack. Sounds like a worthwhile thing to merge earlier in the 21.1 cycle. |
sounds like the 21.1 cycle has started now |
6801349
to
2b80295
Compare
This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. The decision to use a ClearRange is controlled by whether an entire batch would be used to clear versions of a single key. This means that there we'll only send a clear range if there are at least `<key size> * <num versions> > 256 KiB`. There's any additional sanity check that the `<num versions>` is greater than 32 in order to prevent issuing lots of `ClearRange` operations when the cluster has gigantic keys. This new functionality is gated behind both a version and an experimental cluster setting. I'm skipping the release note because of the experimental flag. Release note: None
2b80295
to
39fabe4
Compare
Alright, I've rebased this up on master. I'm thinking that this change deserves some metrics. My other concern is that it is too willing to issue clear range operations because it doesn't want to do too much look-ahead buffering. Ideally we'd have some number of bytes that we'd certainly be removing per clear range rather than the existing row counts. I can rework it to do that. If we do that and we ensure that that number of bytes, by default, is somewhat large, perhaps 1MB, then maybe we can enable this by default. |
Superceded by #90830. |
This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.
In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with #51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.
This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:
Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.
The decision to use a ClearRange is controlled by whether an entire batch would
be used to clear versions of a single key. This means that there we'll only send
a clear range if there are at least
<key size> * <num versions> > 256 KiB
.There's any additional sanity check that the
<num versions>
is greater than32 in order to prevent issuing lots of
ClearRange
operations when thecluster has gigantic keys.
This new functionality is gated behind both a version and an experimental
cluster setting. I'm skipping the release note because of the experimental
flag.
Release note: None