Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey #51230

Conversation

ajwerner
Copy link
Contributor

@ajwerner ajwerner commented Jul 9, 2020

This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with #51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

The decision to use a ClearRange is controlled by whether an entire batch would
be used to clear versions of a single key. This means that there we'll only send
a clear range if there are at least <key size> * <num versions> > 256 KiB.
There's any additional sanity check that the <num versions> is greater than
32 in order to prevent issuing lots of ClearRange operations when the
cluster has gigantic keys.

This new functionality is gated behind both a version and an experimental
cluster setting. I'm skipping the release note because of the experimental
flag.

Release note: None

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner
Copy link
Contributor Author

ajwerner commented Jul 9, 2020

@nvanbenschoten I'm aware that this could benefit from commentary and more targeted testing but curious to hear to first impression.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @ajwerner)


pkg/roachpb/api.proto, line 873 at r4 (raw file):

    bytes key = 1 [(gogoproto.casttype) = "Key"];
    util.hlc.Timestamp timestamp = 2 [(gogoproto.nullable) = false];
    bool use_clear_range = 3;

It is nifty that this field can be safely ignored. That is what allows us to avoid a cluster version gate for this functionality. Perhaps worth a comment.


pkg/storage/mvcc.go, line 3348 at r4 (raw file):

			start := MVCCKey{Key: gcKey.Key, Timestamp: gcKey.Timestamp}
			end := MVCCKey{Key: gcKey.Key, Timestamp: hlc.Timestamp{WallTime: 1}}
			if err := rw.ClearRange(start, end); err != nil {

I keep wondering if there is something we could do at compaction time. @sumeerbhola you were thinking about this. Does it help at all to divide up the work? That is, synchronously update MVCCStats and then asynchronously allow the keys to be deleted via compactions. Perhaps using range tombstones like this is exactly what we'd want to do.

@ajwerner ajwerner mentioned this pull request Jul 10, 2020
@ajwerner ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch from 38441b9 to c6b040a Compare July 13, 2020 19:38
@ajwerner ajwerner requested a review from a team as a code owner July 13, 2020 19:38
@ajwerner ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch 4 times, most recently from 8387d18 to 6801349 Compare July 19, 2020 01:32
@ajwerner ajwerner changed the title [DNM] roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey roachpb,gc,mvcc: add UseClearRange option to GCRequest_GCKey Jul 19, 2020
Copy link
Contributor Author

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@petermattis see how you feel about it now that it's gated behind a cluster setting.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @petermattis and @sumeerbhola)


pkg/roachpb/api.proto, line 873 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

It is nifty that this field can be safely ignored. That is what allows us to avoid a cluster version gate for this functionality. Perhaps worth a comment.

We still do want to avoid setting it if we don't think that the server can support it. I've added both commentary and a version gate.


pkg/storage/mvcc.go, line 3348 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I keep wondering if there is something we could do at compaction time. @sumeerbhola you were thinking about this. Does it help at all to divide up the work? That is, synchronously update MVCCStats and then asynchronously allow the keys to be deleted via compactions. Perhaps using range tombstones like this is exactly what we'd want to do.

Here was @sumeerbhola's prototype: #42514

My more recent thinking on the topic is that most of the time, the background cost of GC isn't killing us. Sometimes, when there are loads of versions, it's a big problem and this patch is likely to help a lot.

My sense is that if we do this PR and we deal with the direct interactions between GC and foreground traffic (#24237), we won't hear about GC again for a long time. I do think that #24237 warrants

In particular, the cost will primarily just become additional scans of the data. Sure, this sucks, but already happens for the purpose of, say, consistency checking. Once we start counting the data scans we need to do and how those scans interact with the block cache, then we'll probably want to revisit this.

@ajwerner
Copy link
Contributor Author

@petermattis any chance I can get an updated perspective on this? I get that ideally we'd love to have a solution to the GC problem such that we read all the data once and the write some constant size raft command but we're certainly not going to get to that in this release. This seems like reasonably low complexity for what amounts to a big win in specific situations.

@petermattis
Copy link
Collaborator

@petermattis any chance I can get an updated perspective on this? I get that ideally we'd love to have a solution to the GC problem such that we read all the data once and the write some constant size raft command but we're certainly not going to get to that in this release. This seems like reasonably low complexity for what amounts to a big win in specific situations.

My concern about large numbers of range tombstones still stands. How often do you expect the UseClearRange option to be invoked? Something else to be aware of is that Pebble will force compaction of sstables containing range tombstones in order to recover space, even in the absence of normal compaction signals (i.e. a level being too large). This might be desirable for the use case here. If you're laying down a range tombstone over a large chunk of obsolete keys, we'll recover the space fairly quickly. On the other hand, this could increase compaction pressure. As usual: "it depends".

@ajwerner
Copy link
Contributor Author

ajwerner commented Aug 13, 2020

How often do you expect the UseClearRange option to be invoked?

It will be invoked whenever the keyByteSize*numVersions > 256kB && keyByteSize < 4kB so if there are at least 64 versions of 4kB keys or 4k version of 64 byte keys. This heuristic could be tuned further, it was just convenient given the 256kB target batch size I was already working with.

I hear you. The way cockroach is today, there are ways in which one can create clear ranges over no keys whatsoever, in particular, when you spin dropping or truncating empty tables. For example, today cockroach sql --insecure --watch 1ms -e 'TRUNCATE foo' will create a clear range over zero keys for each truncate. Is the concern big enough that we should actively fix that by adding a heuristic?

Either way, I've added a cluster setting to make the change more palatable. This will be a life-saver for the rare cases where people accumulate hundreds of megabytes of versions of the same key.

@petermattis
Copy link
Collaborator

I hear you. The way cockroach is today, there are ways in which one can create clear ranges over no keys whatsoever, in particular, when you spin dropping or truncating empty tables. For example, today cockroach sql --insecure --watch 1ms -e 'TRUNCATE foo' will create a clear range over zero keys for each truncate. Is the concern big enough that we should actively fix that by adding a heuristic?

I'm not worried about those range tombstones as they are created in response to a rare user operation. My concern is over creating 10s or 100s of range tombstones per range during GC. Heck, this might actually be fine and better than the alternative. If the range tombstones all end up in a small set of sstables they will be compacted out of existence fairly quickly. The concern is really that it stressing range tombstones more than we've done previously so there may be problems.

Either way, I've added a cluster setting to make the change more palatable. This will be a life-saver for the rare cases where people accumulate hundreds of megabytes of versions of the same key.

👍 on the cluster setting. The other thought is stressing this in some fashion, though I don't have good thoughts on how to do so.

@ajwerner
Copy link
Contributor Author

Given:

The concern is really that it stressing range tombstones more than we've done previously so there may be problems.

I decided to hold off on merging this until the 21.1 cycle. In that cycle, I'm interested in merging this and ideally going through the process of validating it in pebble.

@petermattis
Copy link
Collaborator

The concern is really that it stressing range tombstones more than we've done previously so there may be problems.

I decided to hold off on merging this until the 21.1 cycle. In that cycle, I'm interested in merging this and ideally going through the process of validating it in pebble.

Ack. Sounds like a worthwhile thing to merge earlier in the 21.1 cycle.

@knz
Copy link
Contributor

knz commented Oct 23, 2020

sounds like the 21.1 cycle has started now

@ajwerner ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch from 6801349 to 2b80295 Compare October 27, 2020 21:18
This commit adds an optimization to massively reduce the overhead of
garbage collecting large numbers of versions. When running garbage
collection, we currently iterate through the store to send (Key, Timestamp)
pairs in a GCRequest to the store to be evaluated and applied. This can
be rather slow when deleting large numbers of keys, particularly due to the
need to paginate. The primary motivation for pagination is to ensure that
we do not create raft commands with deletions that are too large.

In practice, we find that for the tiny values in a sequence, we find that we
can GC around 1800 versions/s with cockroachdb#51184 and around 900 without it (though
note that in that PR the more versions exist, the worse the throughput will
be). This remains abysmally slow. I imagine that using larger batches could
be one approach to increase the throughput, but, as it stands, 256 KiB is
not a tiny raft command.

This instead turns to the ClearRange operation which can delete all of versions
of a key with the replication overhead of just two copies. This approach is
somewhat controversial because, as @petermattis puts it:

```
We need to be careful about this. Historically, adding many range tombstones
was very bad for performance. I think we resolved most (all?) of those issues,
but I'm still nervous about embracing using range tombstones below the level
of a Range.
```

Nevertheless, the results are enticing. Rather than pinning a core at full
utilization for minutes just to clear the versions written to a sequence over
the course of a bit more than an hour, we can clear that in ~2 seconds.

The decision to use a ClearRange is controlled by whether an entire batch would
be used to clear versions of a single key. This means that there we'll only send
a clear range if there are at least `<key size> * <num versions> > 256 KiB`.
There's any additional sanity check that the `<num versions>` is greater than
32 in order to prevent issuing lots of `ClearRange` operations when the
cluster has gigantic keys.

This new functionality is gated behind both a version and an experimental
cluster setting. I'm skipping the release note because of the experimental
flag.

Release note: None
@ajwerner ajwerner force-pushed the ajwerner/clear-range-for-gc-of-many-versions branch from 2b80295 to 39fabe4 Compare October 28, 2020 13:48
@ajwerner
Copy link
Contributor Author

Alright, I've rebased this up on master. I'm thinking that this change deserves some metrics. My other concern is that it is too willing to issue clear range operations because it doesn't want to do too much look-ahead buffering. Ideally we'd have some number of bytes that we'd certainly be removing per clear range rather than the existing row counts. I can rework it to do that. If we do that and we ensure that that number of bytes, by default, is somewhat large, perhaps 1MB, then maybe we can enable this by default.

@erikgrinaker
Copy link
Contributor

Superceded by #90830.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
X-noremind Bots won't notify about PRs with X-noremind
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants