-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gc: eager GC of dropped tables/indexes in range subspans #107931
Comments
cc @cockroachdb/replication |
TODO: Also dig out a formula for GC scoring and make it human readable. |
Wrote it up in https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/3132325894/MVCC+GC+Queue+Scoring. |
Discussed with @aliher1911 some quick options. Let's explore the approach mentioned in the description: when merging two ranges, merge their 2 GC hints into one. The technicality is how exactly to merge them. GC hint == timestamp. After this timestamp+TTL, there is a GC run for this range (modulo cases when there isn't). The goal of this hint is to unblock the schema change jobs waiting on GC to happen. So we can't afford infinite wait. Thus we must at least store the max(hint1, hint2), so that if we merge two cold ranges, both are guaranteed to be GC-scanned. We can implement such a strategy easily, and also backport it. One flaw in this approach: it is possible to live-lock. Scenario: a range gets merged repeatedly at inconvenient times. The GC hint keeps being bumped up, and never ends up below now()-TTL. To combat this flaw, we need another timestamp which never gets above some upper bound. A simple way is to add a second field into the GC hint protobuf, and store min(hint1, hint2) in it. So we'll have t1 = min(of all hints), t2 = max(of all hints). When GC is run over this range for t > t1, assign t1 = t2 and t2 = 0 ("promote" t2). Any hint that we omitted is always between t1 and t2, and both t1 and t2 are guaranteed to be GCed (because "promote" is guaranteed, and then the timestamp is bounded), so all intermediate timestamps are guaranteed to be GCed too. This is similar to the 2-bucket approach that the closed timestamp tracker uses [doc]. Instead of t1=min(hint1, hint2), we could also allow bumping t1 up as long as it stays under the "bucket" upper bound, i.e. allow t1=min(upper_bound, max(t1, t2)) where upper_bound is stored in the GC hint protobuf too, and updated when the buckets swap. This could be generalized to more than 2 buckets/timestamps (extreme case: keep all the timestamps in an unbounded list), but both approaches above with 2 timestamps seem enough to guarantee deletion within 2 * TTL + epsilon. The second approach could allow 1.5 * TTL + epsilon. Both seem fine, I would lean towards the simpler min&max way. |
This depends on workloads. Also, multiple tables can be in the same range, right? We might expect that neighbours relating to the same table are small, but I don't know if that's much true for other tables if they happen to be in the same ranges. We could have some built-in protection against busy looping on GC hints. For example, with the min&max approach above we could enforce that max > min + TTL/2 (or some other fraction of TTL that we're ok to wait in the worst case). Then, we have a guarantee that GC-hint-driven GC doesn't run more frequently than once in TTL/2 per range. Of course we have the "normal" GC too, which can run anytime and is not affected by this pacing. |
@erikgrinaker @aliher1911 Do we have a setting/env to ignore GC hints? I.e. a mechanism to disable extra GC runs that it brings, in case it spins too much. |
From a UX point of view, when users drop a table, they expect it to be GCed once it drops below the GC TTL. The schema change GC job makes it very visible when it isn't. Your min/max suggestion will do this in the common case when there's a single schema change, or when a few schema changes happen around the same time. In the worst case, is a 2x TTL wait ok? Maybe? Is there a strong reason why we wouldn't just trigger GC at the exact time when it drops below the threshold? Consider the limit, where a user repeatedly creates and drops a table every second (we see this happen with e.g. CI-type workloads). With a 25-hour TTL, this will produce 90.000 GC hints. Let's throw in some additional tables and merges, and say we end up with 1 million GC hints. Each hlc.Timestamp is 13 bytes, so that's 12 MB of GC hints (for one range). Not terrible, but it's pretty hefty to carry this around in the range state. Not to mention that GC would spin, and we now have to add pacing. If we do end up with 1 million dropped tables in a single range, the regular GC heuristic is probably sufficient to GC it within reasonable time anyway, and we don't necessarily need to rely on the hint. Using min/max will act as a natural pacing mechanism, in that it will buffer all intermediate schema changes or merges, and avoids unbounded hint growth. Considering our usual GC heuristic will take 2x TTL to remove garbage e.g. in the case of continual writes to a single key, it seems ok to wait for 2x TTL in the worst case. I think the problem right now is that the worst case is basically "infinity" (we've seen 90.000 year waits in production clusters). TL;DR min/max seems fine.
There's
They can, there's several scenarios:
In case it isn't clear, this problem isn't just about tweaking the existing GC hint behavior with merges -- we don't write the current GC hint at all when dropping a table that doesn't span an entire range. So the min/max hint needs to be generalized -- we may additionally need something similar to the current hint to ask the GC queue to process adjacent ranges, to make Pebble compactions more efficient. |
Here? cockroach/pkg/kv/kvserver/batcheval/cmd_delete_range.go Lines 126 to 132 in d80a271
Can we unconditionally write the GC hint there? It may be wasteful to run GC over this range though (it may have very little garbage if the table was small). But such runs can be paced.
Is this to amortize the wasteful GC runs? How do we know that running it over adjacent ranges is good? They can be static / low-garbage too. |
Yep, see: https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/3132325894/MVCC+GC+Queue+Scoring |
Prototyping the fix in #110078. With commit 380755b, the schema change job completes within 15 minutes (1.5 TTL from index drop). |
@aliher1911 writes:
Problem
Starting from v22.2 schema GC was changed to rely on MVCC GC to clean up data in dropped objects (i.e. tables, indices) instead of using mvcc-history destructive ops like clear range directly.
Internally MVCC GC queue is relying on heuristic to determine if range contains enough garbage to justify GC. Since scanning all data is an expensive operation, heuristic is trying to balance const of collection operation vs amount of reclaimed space. We usually have 2x ttl delay before collection if deleted data is evenly spaced within ttl time. [1]
To reduce this delay for schema GC, we use hints on ranges to mark them as empty (i.e. all data was deleted and there would be no more data written). When MVCC GC sees such hint, it will expedite GC operation as soon as ttl is reached.
Unfortunately this doesn't work very well for ranges with no or small amount of data because of range merges. Once such range is merged with its neighbour sharing the same zone configuration, its GC hint would be dropped as data could be written to another portion of the range. Moreover, the MVCC GC heuristic would then be dominated by the new data that is being written to the range and that could delay removal of dropped object for a long time as there's not enough garbage in that object to justify expensive scan over its merged neighbours data.
This scenario could be reproduced locally.
Start cockroach
Create test data
Check ranges in the database
Will have two ranges for table and index.
Drop the index
that would create a GC job which can be seen in jobs table.
At this point we will still have two ranges because index has a stickyBit present to prevent immediate merging.
We can clear it out using debug send request cli
and providing request json:
Key field is start key of index range in base64 which could be obtained from range info endpoint using index range id from query above.
Once sticky bit is removed range will be either merged automatically or can be put through merge queue. That would remove GC hint. If more data is added to that range then it won't qualify for GC for a long time thus making schema GC job stuck much longer than table ttl.
Potential solutions
We can potentially keep GC hints when merging ranges to trigger GC earlier even if other data is being added to the range. That can be expensive if neighbouring range has a lot of data which has to be scanned. In reality it may not be a big problem as we could only see that on the boundaries or tables or indices which contain no data so neighbouring tables could also have little data as the problem mostly touches inactive key regions. We don't need to change any protobufs and we can include this as a patch.
Alternatively we could make hints more elaborate and keep multiple spans for GC to analyse.
There's also a possibility that we don't need to wait for this data to be deleted if it is small enough and we can keep it as leftover in the range until it is removed as a part of GC later.
[1] GC Heuristic
cockroach/pkg/kv/kvserver/mvcc_gc_queue.go
Lines 291 to 293 in d87a04b
In brief: The score is fraction of deletable data. We can't find exact value for the range without scanning it so we use estimates.
First we compute dead fraction by computing live to total size of the range. This dead data is not necessarily deletable because it may not have reached ttl yet.
Second we need to estimate how much of the dead fraction is actually deletable To be able to do that range is maintaining GCBytesAge value which is sum of all BytesAge for every deleted (or overwritten) entries. BytesAge is calculated as size of entry multiplied by its age relative to now.
When calculating GC score GC queue will compute a value of what GCBytesAge would be if all garbage data was deleted at TTL time by multiplying total garbage size by ttl. The ratio of GCBytesAge to calculated value gives us estimate of how much garbage would be removed.
So multiplying dead fraction from the first step by fraction of deletable data we can get estimate of how much of total range data could be freed up.
Jira issue: CRDB-30262
The text was updated successfully, but these errors were encountered: