Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gc: eager GC of dropped tables/indexes in range subspans #107931

Closed
erikgrinaker opened this issue Aug 1, 2023 · 11 comments · Fixed by #110078
Closed

gc: eager GC of dropped tables/indexes in range subspans #107931

erikgrinaker opened this issue Aug 1, 2023 · 11 comments · Fixed by #110078
Assignees
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Aug 1, 2023

@aliher1911 writes:

Problem

Starting from v22.2 schema GC was changed to rely on MVCC GC to clean up data in dropped objects (i.e. tables, indices) instead of using mvcc-history destructive ops like clear range directly.

Internally MVCC GC queue is relying on heuristic to determine if range contains enough garbage to justify GC. Since scanning all data is an expensive operation, heuristic is trying to balance const of collection operation vs amount of reclaimed space. We usually have 2x ttl delay before collection if deleted data is evenly spaced within ttl time. [1]

To reduce this delay for schema GC, we use hints on ranges to mark them as empty (i.e. all data was deleted and there would be no more data written). When MVCC GC sees such hint, it will expedite GC operation as soon as ttl is reached.

Unfortunately this doesn't work very well for ranges with no or small amount of data because of range merges. Once such range is merged with its neighbour sharing the same zone configuration, its GC hint would be dropped as data could be written to another portion of the range. Moreover, the MVCC GC heuristic would then be dominated by the new data that is being written to the range and that could delay removal of dropped object for a long time as there's not enough garbage in that object to justify expensive scan over its merged neighbours data.

This scenario could be reproduced locally.

Start cockroach

cockroach start-single-node --insecure

Create test data

root@localhost:26257/defaultdb> create database test;
root@localhost:26257/defaultdb> use test;
root@localhost:26257/test> alter database test configure zone using gc.ttlseconds=600;
root@localhost:26257/test> create table test (id string not null, value string not null, primary key (id));
root@localhost:26257/test> create index val_index on test (value);
root@localhost:26257/test> insert into test (id, value) values ('asdf', '1');

Check ranges in the database

root@localhost:26257/test> show ranges from database test;

Will have two ranges for table and index.

Drop the index

root@localhost:26257/test> drop index val_index;

that would create a GC job which can be seen in jobs table.

At this point we will still have two ranges because index has a stickyBit present to prevent immediate merging.
We can clear it out using debug send request cli

cockroach debug send-kv-batch req.json --insecure

and providing request json:

{"requests":[{"adminUnsplit":{"header":{"key":"8oo="}}}]}

Key field is start key of index range in base64 which could be obtained from range info endpoint using index range id from query above.

Once sticky bit is removed range will be either merged automatically or can be put through merge queue. That would remove GC hint. If more data is added to that range then it won't qualify for GC for a long time thus making schema GC job stuck much longer than table ttl.

Potential solutions

We can potentially keep GC hints when merging ranges to trigger GC earlier even if other data is being added to the range. That can be expensive if neighbouring range has a lot of data which has to be scanned. In reality it may not be a big problem as we could only see that on the boundaries or tables or indices which contain no data so neighbouring tables could also have little data as the problem mostly touches inactive key regions. We don't need to change any protobufs and we can include this as a patch.

Alternatively we could make hints more elaborate and keep multiple spans for GC to analyse.

There's also a possibility that we don't need to wait for this data to be deleted if it is small enough and we can keep it as leftover in the range until it is removed as a part of GC later.

[1] GC Heuristic

// makeMVCCGCQueueScoreImpl is used to compute when to trigger the MVCC GC
// Queue. It's important that we don't queue a replica before a relevant amount
// of data is actually deletable, or the queue might run in a tight loop. To

In brief: The score is fraction of deletable data. We can't find exact value for the range without scanning it so we use estimates.
First we compute dead fraction by computing live to total size of the range. This dead data is not necessarily deletable because it may not have reached ttl yet.
Second we need to estimate how much of the dead fraction is actually deletable To be able to do that range is maintaining GCBytesAge value which is sum of all BytesAge for every deleted (or overwritten) entries. BytesAge is calculated as size of entry multiplied by its age relative to now.
When calculating GC score GC queue will compute a value of what GCBytesAge would be if all garbage data was deleted at TTL time by multiplying total garbage size by ttl. The ratio of GCBytesAge to calculated value gives us estimate of how much garbage would be removed.
So multiplying dead fraction from the first step by fraction of deletable data we can get estimate of how much of total range data could be freed up.

Jira issue: CRDB-30262

@erikgrinaker erikgrinaker added C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. T-kv-replication labels Aug 1, 2023
@blathers-crl
Copy link

blathers-crl bot commented Aug 1, 2023

cc @cockroachdb/replication

@irfansharif
Copy link
Contributor

TODO: Also dig out a formula for GC scoring and make it human readable.

@erikgrinaker
Copy link
Contributor Author

TODO: Also dig out a formula for GC scoring and make it human readable.

Wrote it up in https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/3132325894/MVCC+GC+Queue+Scoring.

@erikgrinaker erikgrinaker added the O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs label Aug 17, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 4, 2023

Discussed with @aliher1911 some quick options.

Let's explore the approach mentioned in the description: when merging two ranges, merge their 2 GC hints into one. The technicality is how exactly to merge them.

GC hint == timestamp. After this timestamp+TTL, there is a GC run for this range (modulo cases when there isn't). The goal of this hint is to unblock the schema change jobs waiting on GC to happen. So we can't afford infinite wait. Thus we must at least store the max(hint1, hint2), so that if we merge two cold ranges, both are guaranteed to be GC-scanned. We can implement such a strategy easily, and also backport it.

One flaw in this approach: it is possible to live-lock. Scenario: a range gets merged repeatedly at inconvenient times. The GC hint keeps being bumped up, and never ends up below now()-TTL.

To combat this flaw, we need another timestamp which never gets above some upper bound. A simple way is to add a second field into the GC hint protobuf, and store min(hint1, hint2) in it. So we'll have t1 = min(of all hints), t2 = max(of all hints). When GC is run over this range for t > t1, assign t1 = t2 and t2 = 0 ("promote" t2). Any hint that we omitted is always between t1 and t2, and both t1 and t2 are guaranteed to be GCed (because "promote" is guaranteed, and then the timestamp is bounded), so all intermediate timestamps are guaranteed to be GCed too.

This is similar to the 2-bucket approach that the closed timestamp tracker uses [doc]. Instead of t1=min(hint1, hint2), we could also allow bumping t1 up as long as it stays under the "bucket" upper bound, i.e. allow t1=min(upper_bound, max(t1, t2)) where upper_bound is stored in the GC hint protobuf too, and updated when the buckets swap.

This could be generalized to more than 2 buckets/timestamps (extreme case: keep all the timestamps in an unbounded list), but both approaches above with 2 timestamps seem enough to guarantee deletion within 2 * TTL + epsilon. The second approach could allow 1.5 * TTL + epsilon. Both seem fine, I would lean towards the simpler min&max way.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 4, 2023

We can potentially keep GC hints when merging ranges to trigger GC earlier even if other data is being added to the range. That can be expensive if neighbouring range has a lot of data which has to be scanned. In reality it may not be a big problem as we could only see that on the boundaries or tables or indices which contain no data so neighbouring tables could also have little data as the problem mostly touches inactive key regions.

This depends on workloads. Also, multiple tables can be in the same range, right? We might expect that neighbours relating to the same table are small, but I don't know if that's much true for other tables if they happen to be in the same ranges.

We could have some built-in protection against busy looping on GC hints. For example, with the min&max approach above we could enforce that max > min + TTL/2 (or some other fraction of TTL that we're ok to wait in the worst case). Then, we have a guarantee that GC-hint-driven GC doesn't run more frequently than once in TTL/2 per range. Of course we have the "normal" GC too, which can run anytime and is not affected by this pacing.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 4, 2023

@erikgrinaker @aliher1911 Do we have a setting/env to ignore GC hints? I.e. a mechanism to disable extra GC runs that it brings, in case it spins too much.

@erikgrinaker
Copy link
Contributor Author

From a UX point of view, when users drop a table, they expect it to be GCed once it drops below the GC TTL. The schema change GC job makes it very visible when it isn't.

Your min/max suggestion will do this in the common case when there's a single schema change, or when a few schema changes happen around the same time. In the worst case, is a 2x TTL wait ok? Maybe? Is there a strong reason why we wouldn't just trigger GC at the exact time when it drops below the threshold?

Consider the limit, where a user repeatedly creates and drops a table every second (we see this happen with e.g. CI-type workloads). With a 25-hour TTL, this will produce 90.000 GC hints. Let's throw in some additional tables and merges, and say we end up with 1 million GC hints. Each hlc.Timestamp is 13 bytes, so that's 12 MB of GC hints (for one range). Not terrible, but it's pretty hefty to carry this around in the range state. Not to mention that GC would spin, and we now have to add pacing.

If we do end up with 1 million dropped tables in a single range, the regular GC heuristic is probably sufficient to GC it within reasonable time anyway, and we don't necessarily need to rely on the hint. Using min/max will act as a natural pacing mechanism, in that it will buffer all intermediate schema changes or merges, and avoids unbounded hint growth.

Considering our usual GC heuristic will take 2x TTL to remove garbage e.g. in the case of continual writes to a single key, it seems ok to wait for 2x TTL in the worst case. I think the problem right now is that the worst case is basically "infinity" (we've seen 90.000 year waits in production clusters).

TL;DR min/max seems fine.

Do we have a setting/env to ignore GC hints? I.e. a mechanism to disable extra GC runs that it brings, in case it spins too much.

There's kv.mvcc_gc.queue_high_priority_interval.

multiple tables can be in the same range, right?

They can, there's several scenarios:

  • We don't enforce range splits for dropped tables.
  • Serverless doesn't enforce range splits for tables at all.
  • 23.2 with span coalescing doesn't enforce range splits.

In case it isn't clear, this problem isn't just about tweaking the existing GC hint behavior with merges -- we don't write the current GC hint at all when dropping a table that doesn't span an entire range. So the min/max hint needs to be generalized -- we may additionally need something similar to the current hint to ask the GC queue to process adjacent ranges, to make Pebble compactions more efficient.

@pav-kv pav-kv self-assigned this Sep 5, 2023
@pav-kv
Copy link
Collaborator

pav-kv commented Sep 5, 2023

In case it isn't clear, this problem isn't just about tweaking the existing GC hint behavior with merges -- we don't write the current GC hint at all when dropping a table that doesn't span an entire range.

Here?

// If GCHint was provided, then we need to check if this request meets
// range gc criteria of removing all data. This is not an error as range
// might have merged since request was sent and we don't want to fail
// deletion.
if !args.Key.Equal(desc.StartKey.AsRawKey()) || !args.EndKey.Equal(desc.EndKey.AsRawKey()) {
return nil
}

Can we unconditionally write the GC hint there? It may be wasteful to run GC over this range though (it may have very little garbage if the table was small). But such runs can be paced.

So the min/max hint needs to be generalized -- we may additionally need something similar to the current hint to ask the GC queue to process adjacent ranges, to make Pebble compactions more efficient.

Is this to amortize the wasteful GC runs? How do we know that running it over adjacent ranges is good? They can be static / low-garbage too.

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 6, 2023

Followed the repro steps. Got to disable kv.range_split.by_load_enabled to unblock the merge that was being prevented by load checks.

Seeing that with GC TTL of 10 minutes the GC hasn't been run for the merged range for about 1h now, and the GC for DROP INDEX job doesn't go away. Repro works.

Screenshot 2023-09-06 at 15 29 43

The low priority indicates that the score is low, and GC won't be run for a long time?

Manual running the range through the queue with forced shouldQueue=true unblocked the schema change job.

@erikgrinaker
Copy link
Contributor Author

The low priority indicates that the score is low, and GC won't be run for a long time?

Yep, see: https://cockroachlabs.atlassian.net/wiki/spaces/CKB/pages/3132325894/MVCC+GC+Queue+Scoring

@pav-kv
Copy link
Collaborator

pav-kv commented Sep 6, 2023

Prototyping the fix in #110078. With commit 380755b, the schema change job completes within 15 minutes (1.5 TTL from index drop).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-bug Code not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior. O-support Would prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants