roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

ajwerner · 2020-07-09T18:06:56Z

This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with #51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

The decision to use a ClearRange is controlled by whether an entire batch would
be used to clear versions of a single key. This means that there we'll only send
a clear range if there are at least <key size> * <num versions> > 256 KiB.
There's any additional sanity check that the <num versions> is greater than
32 in order to prevent issuing lots of ClearRange operations when the
cluster has gigantic keys.

This new functionality is gated behind both a version and an experimental
cluster setting. I'm skipping the release note because of the experimental
flag.

Release note: None

cockroach-teamcity · 2020-07-09T18:07:07Z

This change is

ajwerner · 2020-07-09T18:14:31Z

@nvanbenschoten I'm aware that this could benefit from commentary and more targeted testing but curious to hear to first impression.

petermattis

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)

pkg/roachpb/api.proto, line 873 at r4 (raw file):

    bytes key = 1 [(gogoproto.casttype) = "Key"];
    util.hlc.Timestamp timestamp = 2 [(gogoproto.nullable) = false];
    bool use_clear_range = 3;

It is nifty that this field can be safely ignored. That is what allows us to avoid a cluster version gate for this functionality. Perhaps worth a comment.

pkg/storage/mvcc.go, line 3348 at r4 (raw file):

			start := MVCCKey{Key: gcKey.Key, Timestamp: gcKey.Timestamp}
			end := MVCCKey{Key: gcKey.Key, Timestamp: hlc.Timestamp{WallTime: 1}}
			if err := rw.ClearRange(start, end); err != nil {

I keep wondering if there is something we could do at compaction time. @sumeerbhola you were thinking about this. Does it help at all to divide up the work? That is, synchronously update MVCCStats and then asynchronously allow the keys to be deleted via compactions. Perhaps using range tombstones like this is exactly what we'd want to do.

ajwerner

@petermattis see how you feel about it now that it's gated behind a cluster setting.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)

pkg/roachpb/api.proto, line 873 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It is nifty that this field can be safely ignored. That is what allows us to avoid a cluster version gate for this functionality. Perhaps worth a comment.

We still do want to avoid setting it if we don't think that the server can support it. I've added both commentary and a version gate.

pkg/storage/mvcc.go, line 3348 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I keep wondering if there is something we could do at compaction time. @sumeerbhola you were thinking about this. Does it help at all to divide up the work? That is, synchronously update MVCCStats and then asynchronously allow the keys to be deleted via compactions. Perhaps using range tombstones like this is exactly what we'd want to do.

Here was @sumeerbhola's prototype: #42514

My more recent thinking on the topic is that most of the time, the background cost of GC isn't killing us. Sometimes, when there are loads of versions, it's a big problem and this patch is likely to help a lot.

My sense is that if we do this PR and we deal with the direct interactions between GC and foreground traffic (#24237), we won't hear about GC again for a long time. I do think that #24237 warrants

In particular, the cost will primarily just become additional scans of the data. Sure, this sucks, but already happens for the purpose of, say, consistency checking. Once we start counting the data scans we need to do and how those scans interact with the block cache, then we'll probably want to revisit this.

ajwerner · 2020-08-13T20:04:56Z

@petermattis any chance I can get an updated perspective on this? I get that ideally we'd love to have a solution to the GC problem such that we read all the data once and the write some constant size raft command but we're certainly not going to get to that in this release. This seems like reasonably low complexity for what amounts to a big win in specific situations.

petermattis · 2020-08-13T20:11:58Z

@petermattis any chance I can get an updated perspective on this? I get that ideally we'd love to have a solution to the GC problem such that we read all the data once and the write some constant size raft command but we're certainly not going to get to that in this release. This seems like reasonably low complexity for what amounts to a big win in specific situations.

My concern about large numbers of range tombstones still stands. How often do you expect the UseClearRange option to be invoked? Something else to be aware of is that Pebble will force compaction of sstables containing range tombstones in order to recover space, even in the absence of normal compaction signals (i.e. a level being too large). This might be desirable for the use case here. If you're laying down a range tombstone over a large chunk of obsolete keys, we'll recover the space fairly quickly. On the other hand, this could increase compaction pressure. As usual: "it depends".

ajwerner · 2020-08-13T20:42:45Z

How often do you expect the UseClearRange option to be invoked?

It will be invoked whenever the keyByteSize*numVersions > 256kB && keyByteSize < 4kB so if there are at least 64 versions of 4kB keys or 4k version of 64 byte keys. This heuristic could be tuned further, it was just convenient given the 256kB target batch size I was already working with.

I hear you. The way cockroach is today, there are ways in which one can create clear ranges over no keys whatsoever, in particular, when you spin dropping or truncating empty tables. For example, today cockroach sql --insecure --watch 1ms -e 'TRUNCATE foo' will create a clear range over zero keys for each truncate. Is the concern big enough that we should actively fix that by adding a heuristic?

Either way, I've added a cluster setting to make the change more palatable. This will be a life-saver for the rare cases where people accumulate hundreds of megabytes of versions of the same key.

petermattis · 2020-08-13T22:26:30Z

I hear you. The way cockroach is today, there are ways in which one can create clear ranges over no keys whatsoever, in particular, when you spin dropping or truncating empty tables. For example, today cockroach sql --insecure --watch 1ms -e 'TRUNCATE foo' will create a clear range over zero keys for each truncate. Is the concern big enough that we should actively fix that by adding a heuristic?

I'm not worried about those range tombstones as they are created in response to a rare user operation. My concern is over creating 10s or 100s of range tombstones per range during GC. Heck, this might actually be fine and better than the alternative. If the range tombstones all end up in a small set of sstables they will be compacted out of existence fairly quickly. The concern is really that it stressing range tombstones more than we've done previously so there may be problems.

Either way, I've added a cluster setting to make the change more palatable. This will be a life-saver for the rare cases where people accumulate hundreds of megabytes of versions of the same key.

👍 on the cluster setting. The other thought is stressing this in some fashion, though I don't have good thoughts on how to do so.

ajwerner · 2020-09-21T13:48:13Z

Given:

The concern is really that it stressing range tombstones more than we've done previously so there may be problems.

I decided to hold off on merging this until the 21.1 cycle. In that cycle, I'm interested in merging this and ideally going through the process of validating it in pebble.

petermattis · 2020-09-21T21:59:39Z

The concern is really that it stressing range tombstones more than we've done previously so there may be problems.

I decided to hold off on merging this until the 21.1 cycle. In that cycle, I'm interested in merging this and ideally going through the process of validating it in pebble.

Ack. Sounds like a worthwhile thing to merge earlier in the 21.1 cycle.

knz · 2020-10-23T14:07:23Z

sounds like the 21.1 cycle has started now

@petermattis

This commit adds an optimization to massively reduce the overhead of garbage collecting large numbers of versions. When running garbage collection, we currently iterate through the store to send (Key, Timestamp) pairs in a GCRequest to the store to be evaluated and applied. This can be rather slow when deleting large numbers of keys, particularly due to the need to paginate. The primary motivation for pagination is to ensure that we do not create raft commands with deletions that are too large. In practice, we find that for the tiny values in a sequence, we find that we can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though note that in that PR the more versions exist, the worse the throughput will be). This remains abysmally slow. I imagine that using larger batches could be one approach to increase the throughput, but, as it stands, 256 KiB is not a tiny raft command. This instead turns to the ClearRange operation which can delete all of versions of a key with the replication overhead of just two copies. This approach is somewhat controversial because, as @petermattis puts it: ``` We need to be careful about this. Historically, adding many range tombstones was very bad for performance. I think we resolved most (all?) of those issues, but I'm still nervous about embracing using range tombstones below the level of a Range. ``` Nevertheless, the results are enticing. Rather than pinning a core at full utilization for minutes just to clear the versions written to a sequence over the course of a bit more than an hour, we can clear that in ~2 seconds. The decision to use a ClearRange is controlled by whether an entire batch would be used to clear versions of a single key. This means that there we'll only send a clear range if there are at least `<key size> * <num versions> > 256 KiB`. There's any additional sanity check that the `<num versions>` is greater than 32 in order to prevent issuing lots of `ClearRange` operations when the cluster has gigantic keys. This new functionality is gated behind both a version and an experimental cluster setting. I'm skipping the release note because of the experimental flag. Release note: None

ajwerner · 2020-10-28T15:49:44Z

Alright, I've rebased this up on master. I'm thinking that this change deserves some metrics. My other concern is that it is too willing to issue clear range operations because it doesn't want to do too much look-ahead buffering. Ideally we'd have some number of bytes that we'd certainly be removing per clear range rather than the existing row counts. I can rework it to do that. If we do that and we ensure that that number of bytes, by default, is somewhat large, perhaps 1MB, then maybe we can enable this by default.

erikgrinaker · 2022-12-02T14:21:23Z

Superceded by #90830.

petermattis reviewed Jul 9, 2020

View reviewed changes

ajwerner mentioned this pull request Jul 10, 2020

Fast gc #42640

Closed

ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch from 38441b9 to c6b040a Compare July 13, 2020 19:38

ajwerner requested a review from a team as a code owner July 13, 2020 19:38

ajwerner mentioned this pull request Jul 13, 2020

Sequence table GC speed is slower than user write, and which have conflict when user write and gc #50194

Closed

ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch 4 times, most recently from 8387d18 to 6801349 Compare July 19, 2020 01:32

ajwerner changed the title ~~[DNM] roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey~~ roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey Jul 19, 2020

ajwerner commented Jul 20, 2020

View reviewed changes

ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch from 6801349 to 2b80295 Compare October 27, 2020 21:18

ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch from 2b80295 to 39fabe4 Compare October 28, 2020 13:48

sumeerbhola mentioned this pull request Nov 30, 2020

storage, kv: do best-effort GC of old versions during storage compactions #57260

Open

tbg mentioned this pull request Mar 4, 2021

kvserver: add UseClearRange option to GCRequest_GCKey #61455

Closed

tbg added the X-noremind Bots won't notify about PRs with X-noremind label May 6, 2021

ajwerner mentioned this pull request Jul 25, 2022

gc: use delete range to remove large number of consecutive keys #85003

Closed

erikgrinaker closed this Dec 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

ajwerner commented Jul 9, 2020 •

edited

Loading

cockroach-teamcity commented Jul 9, 2020

ajwerner commented Jul 9, 2020

petermattis left a comment

ajwerner left a comment

ajwerner commented Aug 13, 2020

petermattis commented Aug 13, 2020

ajwerner commented Aug 13, 2020 •

edited

Loading

petermattis commented Aug 13, 2020

ajwerner commented Sep 21, 2020

petermattis commented Sep 21, 2020

knz commented Oct 23, 2020

ajwerner commented Oct 28, 2020

erikgrinaker commented Dec 2, 2022

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

Conversation

ajwerner commented Jul 9, 2020 • edited Loading

cockroach-teamcity commented Jul 9, 2020

ajwerner commented Jul 9, 2020

petermattis left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

ajwerner commented Aug 13, 2020

petermattis commented Aug 13, 2020

ajwerner commented Aug 13, 2020 • edited Loading

petermattis commented Aug 13, 2020

ajwerner commented Sep 21, 2020

petermattis commented Sep 21, 2020

knz commented Oct 23, 2020

ajwerner commented Oct 28, 2020

erikgrinaker commented Dec 2, 2022

ajwerner commented Jul 9, 2020 •

edited

Loading

ajwerner commented Aug 13, 2020 •

edited

Loading