Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

*: lower default GC TTL #89233

Closed
irfansharif opened this issue Oct 3, 2022 · 3 comments · Fixed by #93836
Closed

*: lower default GC TTL #89233

irfansharif opened this issue Oct 3, 2022 · 3 comments · Fixed by #93836
Assignees
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@irfansharif
Copy link
Contributor

irfansharif commented Oct 3, 2022

This is tracking issue to lower the default GC TTL from 25h. There's an internal document that more fully spells out some of the motivations and requisite steps, the text for which was copied over below. There's also this internal slack thread.

Is your feature request related to a problem? Please describe.

The default GC TTL today is 25h, which means overwritten values in the last 25h window are retained. This translates to higher storage use and for outbox-like workloads where rows are deleted frequently, can make for costlier reads CPU wise since they have to scan over a larger number of overwritten values to get to the one of interest (this is not fundamental). Since all versions of key are stored within a single range (we split along key boundaries, not within different values of a key), we’ve seen incidents where we have large (>> 512 MB) unsplittable ranges causing cluster instability: snapshot timeouts all get triggered, other snapshots get queued behind it, range starts backpressuring writes (can be indistinguishable from an outage).

We chose a default of 25h originally to accommodate daily incremental backups with revision history. They fail (informatively) if the data you are trying to backup was GC-ed, which happens when incremental backups are taken less frequently than the GC periods for any of the objects in the base backup. In 22.2 however scheduled backups “chain together” protected timestamp records, which lets your scheduled backup protect only what’s needed and ensure coverage of revision history across each incremental backup. The short of it is we no longer need a 25h default, hence this issue.

Describe the solution you'd like

Lower it to something like 4h or 1.5h.

Jira issue: CRDB-20144

@irfansharif irfansharif added the C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) label Oct 3, 2022
@blathers-crl blathers-crl bot added the T-kv KV Team label Oct 3, 2022
@irfansharif
Copy link
Contributor Author

@ajwerner
Copy link
Contributor

ajwerner commented Oct 3, 2022

@fqazi is actively working on #84911. It should block any change to the GC TTL. A fix should come together soon.

@miretskiy
Copy link
Contributor

CDC would need to address #89450

@irfansharif irfansharif self-assigned this Dec 15, 2022
irfansharif added a commit to irfansharif/cockroach that referenced this issue Dec 19, 2022
And add an upgrade for existing clusters to explicitly retain whatever
value they were using pre-upgrade. Fixes cockroachdb#89233.

Release note (general change): The GC TTL previously defaulted to 25h.
This value was configurable using `ALTER RANGE DEFAULT CONFIGURE ZONE
USING gc.ttlseconds = <whatever>`, but also possible to scope to
specific schema objects using `ALTER {DATABASE,TABLE,INDEX} CONFIGURE
ZONE USING ...`. This value determine how long overwritten values were
retained. The `RANGE DEFAULT` value is now lowered to 4h but only for
freshly created clusters. When existing clusters upgrade onto this
release, we will respect whatever value they were using before the
upgrade for all their schema objects. This will be 25h if the GC TTL was
never altered, or some specific value if set explicitly. We've found the
25h value to translate to higher-than-necessary storage costs,
especially for workloads where rows are deleted frequently. It can also
make for costlier reads with respect to CPU since we currently have to
scan over overwritten values to get to the one of interest. Finally,
we've also observed cluster instability due to large unsplittable ranges
that have accumulated an excessive amount of MVCC garbage.

We chose a default of 25h originally to accommodate daily incremental
backups with revision history. But with the introduction of scheduled
backups introduced in 22.2, we no longer need a large GC TTL. Scheduled
backups "chain together" and prevent garbage collection of relevant
data to ensure coverage of revision history across backups, decoupling
it from whatever value is used for GC TTL. So we no longer need a 25h
default, hence this change.

The GC TTL determines how far back AS OF SYSTEM TIME queries can go,
which now if going past `now()-4h`, will start failing informatively. To
support larger windows for AS OF SYSTEM TIME queries, users are
encouraged to pick a more appropriate GC TTL and set it using `ALTER ...
CONFIGURE ZONE using gc.ttlseconds = <whatever>`. The earlier
considerations around storage use, read costs, and stability still
apply.

Release note (backward-incompatible change): See release note above.
Technically this is not a backwards-incompatible change since we're only
changing the default value that new clusters are initialized with --
existing clusters will remain unaffected. But it might be worth
highlighting this change more prominently to our users for added
scrutiny.
irfansharif added a commit to irfansharif/cockroach that referenced this issue Dec 19, 2022
And add an upgrade for existing clusters to explicitly retain whatever
value they were using pre-upgrade. Fixes cockroachdb#89233.

Release note (general change): The GC TTL previously defaulted to 25h.
This value was configurable using `ALTER RANGE DEFAULT CONFIGURE ZONE
USING gc.ttlseconds = <whatever>`, but also possible to scope to
specific schema objects using `ALTER {DATABASE,TABLE,INDEX} CONFIGURE
ZONE USING ...`. This value determine how long overwritten values were
retained. The `RANGE DEFAULT` value is now lowered to 4h but only for
freshly created clusters. When existing clusters upgrade onto this
release, we will respect whatever value they were using before the
upgrade for all their schema objects. This will be 25h if the GC TTL was
never altered, or some specific value if set explicitly. We've found the
25h value to translate to higher-than-necessary storage costs,
especially for workloads where rows are deleted frequently. It can also
make for costlier reads with respect to CPU since we currently have to
scan over overwritten values to get to the one of interest. Finally,
we've also observed cluster instability due to large unsplittable ranges
that have accumulated an excessive amount of MVCC garbage.

We chose a default of 25h originally to accommodate daily incremental
backups with revision history. But with the introduction of scheduled
backups introduced in 22.2, we no longer need a large GC TTL. Scheduled
backups "chain together" and prevent garbage collection of relevant
data to ensure coverage of revision history across backups, decoupling
it from whatever value is used for GC TTL. So we no longer need a 25h
default, hence this change.

The GC TTL determines how far back AS OF SYSTEM TIME queries can go,
which now if going past `now()-4h`, will start failing informatively. To
support larger windows for AS OF SYSTEM TIME queries, users are
encouraged to pick a more appropriate GC TTL and set it using `ALTER ...
CONFIGURE ZONE using gc.ttlseconds = <whatever>`. The earlier
considerations around storage use, read costs, and stability still
apply.

Release note (backward-incompatible change): See release note above.
Technically this is not a backwards-incompatible change since we're only
changing the default value that new clusters are initialized with --
existing clusters will remain unaffected. But it might be worth
highlighting this change more prominently to our users for added
scrutiny.
irfansharif added a commit to irfansharif/cockroach that referenced this issue Jan 5, 2023
Fixes cockroachdb#89233.

Release note (general change): The GC TTL previously defaulted to 25h.
This value was configurable using `ALTER RANGE DEFAULT CONFIGURE ZONE
USING gc.ttlseconds = <whatever>`, but also possible to scope to
specific schema objects using `ALTER {DATABASE,TABLE,INDEX} CONFIGURE
ZONE USING ...`. This value determine how long overwritten values were
retained. The `RANGE DEFAULT` value is now lowered to 4h but only for
freshly created clusters. When existing clusters upgrade onto this
release, we will respect whatever value they were using before the
upgrade for all their schema objects. This will be 25h if the GC TTL was
never altered, or some specific value if set explicitly. Full cluster
backups taken on earlier version clusters, when restored to clusters
that started off at v23.1, will use the GC TTL recorded in the backup
image.

We've found the 25h value to translate to higher-than-necessary storage
costs, especially for workloads where rows are deleted frequently. It
can also make for costlier reads with respect to CPU since we currently
have to scan over overwritten values to get to the one of interest.
Finally, we've also observed cluster instability due to large
unsplittable ranges that have accumulated an excessive amount of MVCC
garbage. We chose a default of 25h originally to accommodate daily
incremental backups with revision history. But with the introduction of
scheduled backups introduced in 22.2, we no longer need a large GC TTL.
Scheduled backups "chain together" and prevent garbage collection of
relevant data to ensure coverage of revision history across backups,
decoupling it from whatever value is used for GC TTL. So we no longer
need a 25h default, hence this change.

The GC TTL determines how far back AS OF SYSTEM TIME queries can go,
which now if going past `now()-4h`, will start failing informatively. To
support larger windows for AS OF SYSTEM TIME queries, users are
encouraged to pick a more appropriate GC TTL and set it using `ALTER ...
CONFIGURE ZONE using gc.ttlseconds = <whatever>`. The earlier
considerations around storage use, read costs, and stability still
apply.

Release note (backward-incompatible change): See release note above.
Technically this is not a backwards-incompatible change since we're only
changing the default value that new clusters are initialized with --
existing clusters will remain unaffected. But it might be worth
highlighting this change more prominently to our users for added
scrutiny.
craig bot pushed a commit that referenced this issue Jan 6, 2023
93836: *: lower default GC TTL to 4h r=irfansharif a=irfansharif

Fixes #89233.

Release note (general change): The GC TTL previously defaulted to 25h. This value was configurable using `ALTER RANGE DEFAULT CONFIGURE ZONE USING gc.ttlseconds = <whatever>`, but also possible to scope to specific schema objects using `ALTER {DATABASE,TABLE,INDEX} CONFIGURE ZONE USING ...`. This value determine how long overwritten values were retained. The `RANGE DEFAULT` value is now lowered to 4h but only for freshly created clusters. When existing clusters upgrade onto this release, we will respect whatever value they were using before the upgrade for all their schema objects. This will be 25h if the GC TTL was never altered, or some specific value if set explicitly. Full cluster backups taken on earlier version clusters, when restored to clusters that started off at v23.1, will use the GC TTL recorded in the backup image.

We've found the 25h value to translate to higher-than-necessary storage costs, especially for workloads where rows are deleted frequently. It can also make for costlier reads with respect to CPU since we currently have to scan over overwritten values to get to the one of interest. Finally, we've also observed cluster instability due to large unsplittable ranges that have accumulated an excessive amount of MVCC garbage. We chose a default of 25h originally to accommodate daily incremental backups with revision history. But with the introduction of scheduled backups introduced in 22.2, we no longer need a large GC TTL. Scheduled backups "chain together" and prevent garbage collection of relevant data to ensure coverage of revision history across backups, decoupling it from whatever value is used for GC TTL. So we no longer need a 25h default, hence this change.

The GC TTL determines how far back AS OF SYSTEM TIME queries can go, which now if going past `now()-4h`, will start failing informatively. To support larger windows for AS OF SYSTEM TIME queries, users are encouraged to pick a more appropriate GC TTL and set it using `ALTER ... CONFIGURE ZONE using gc.ttlseconds = <whatever>`. The earlier considerations around storage use, read costs, and stability still apply.

Release note (backward-incompatible change): See release note above. Technically this is not a backwards-incompatible change since we're only changing the default value that new clusters are initialized with -- existing clusters will remain unaffected. But it might be worth highlighting this change more prominently to our users for added scrutiny.

Co-authored-by: irfan sharif <[email protected]>
@craig craig bot closed this as completed in 1d01816 Jan 6, 2023
vroldanbet added a commit to authzed/spicedb that referenced this issue Feb 10, 2023
see #1158
see cockroachdb/cockroach#89233

Cockroach is changing their default to 4h in dedicated and
1.25h in Serverless.

This commit changes the lowest common denominator, sans
some padding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants