Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sql: use row-level TTL for rangelog, eventlog and web_sessions #89453

Open
aadityasondhi opened this issue Oct 5, 2022 · 1 comment
Open
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability

Comments

@aadityasondhi
Copy link
Collaborator

aadityasondhi commented Oct 5, 2022

The system.web_session table is periodically purged. This is handled in purge_auth_seession.go and is controlled using cluster settings. An alternative (possibly better) would be to have the table configured with row-level ttl and have that manage these web_sessions.

Things to consider

  • handling deleteOldExpiredSessionsStmt vs deleteOldRevokedSessionsStmt vs deleteSessionsAutoLogoutStmt
  • we want the TTL to be configurable by the user, without a schema change on the table. This means possibly that the TTL expression should be able to refer to a cluster setting.

related: #88766

Jira issue: CRDB-20259

Epic CRDB-21265

@aadityasondhi aadityasondhi added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability-inf labels Oct 5, 2022
@rafiss
Copy link
Collaborator

rafiss commented Oct 6, 2022

It would probably benefit from doing a bit of up-front investigation, since there might be some complexity with getting the jobs framework to operate on a system table.

@knz knz changed the title server: consider row-level ttl for system.web_sessions table sql: use row-level TTL for rangelog, eventlog and web_sessions Oct 27, 2022
knz added a commit to knz/cockroach that referenced this issue Oct 27, 2022
Prior to this patch, we had two problems:

- the code to GC system.eventlog was not running on secondary tenants.

  This was the original motivation for the work in this commit.

- there were 2 separate implementations of GC, one for
  system.eventlog/rangelog and another one for system.web_sessions.

  This was unfortunate because the first one is generic and was meant
  to be reused. Its implementation is just better all around: it is
  actually able to remove all rows when there are more than 1000
  entries to delete at a time (this was a known limitation of the
  web_session cleanup code), and is optimized to avoid hitting
  tombstones when iterating.

This commit improves the situation by using the common, better
implementation for both cases, and enabling it for secondary tenants.

We acknowledge that with the advent of row-level TTL in SQL, all
this would be better served by a conversion of these tables to use
row-level TTL. However, the fix here (run GC in secondary tenants)
needs to be backported, so we are shy to rely on the row-level TTL
which didn't work as well in previous versions. The remainder of this
work is described in cockroachdb#89453.

Release note (ops change): The cluster settings
`server.web_session.purge.period` and
`server.web_session.purge.max_deletions_per_cycle`, which were
specific to the cleanup function for `system.web_sessions`, have been
replaced by `server.log_gc.period` and
`server.log_gc.max_deletions_per_cycle` which apply to the cleanup
function for `system.eventlog`, `system.rangelog` and
`system.web_sessions` equally.

Release note (ops change): The cluster setting
`server.web_session.auto_logout.timeout` has been removed. It had
never been effective.
knz added a commit to knz/cockroach that referenced this issue Oct 27, 2022
Prior to this patch, we had two problems:

- the code to GC system.eventlog was not running on secondary tenants.

  This was the original motivation for the work in this commit.

- there were 2 separate implementations of GC, one for
  system.eventlog/rangelog and another one for system.web_sessions.

  This was unfortunate because the first one is generic and was meant
  to be reused. Its implementation is just better all around: it is
  actually able to remove all rows when there are more than 1000
  entries to delete at a time (this was a known limitation of the
  web_session cleanup code), and is optimized to avoid hitting
  tombstones when iterating.

This commit improves the situation by using the common, better
implementation for both cases, and enabling it for secondary tenants.

We acknowledge that with the advent of row-level TTL in SQL, all
this would be better served by a conversion of these tables to use
row-level TTL. However, the fix here (run GC in secondary tenants)
needs to be backported, so we are shy to rely on the row-level TTL
which didn't work as well in previous versions. The remainder of this
work is described in cockroachdb#89453.

Release note (ops change): The cluster settings
`server.web_session.purge.period` and
`server.web_session.purge.max_deletions_per_cycle`, which were
specific to the cleanup function for `system.web_sessions`, have been
replaced by `server.log_gc.period` and
`server.log_gc.max_deletions_per_cycle` which apply to the cleanup
function for `system.eventlog`, `system.rangelog` and
`system.web_sessions` equally.

Release note (ops change): The cluster setting
`server.web_session.auto_logout.timeout` has been removed. It had
never been effective.
craig bot pushed a commit that referenced this issue Nov 17, 2022
90789: server: use a common logic for system log gc r=dhartunian a=knz

Fixes #90521.
Subsumes #90523.
Subsumes #88766.
Informs #89453.
Informs #90790.

Prior to this patch, we had two problems:

- the code to GC system.eventlog was not running on secondary tenants.

  This was the original motivation for the work in this commit.

- there were 2 separate implementations of GC, one for system.eventlog/rangelog and another one for system.web_sessions.

  This was unfortunate because the first one is generic and was meant to be reused. Its implementation is just better all around: it is actually able to remove all rows when there are more than 1000 entries to delete at a time (this was a known limitation of the web_session cleanup code), and is optimized to avoid hitting tombstones when iterating.

This commit improves the situation by using the common, better implementation for both cases.

We acknowledge that with the advent of row-level TTL in SQL, all this would be better served by a conversion of these tables to use row-level TTL. However, the fix here (run GC in secondary tenants) needs to be backported, so we are shy to rely on the row-level TTL which didn't work as well in previous versions. The remainder of this work is described in #89453.

Release note (ops change): The cluster settings
`server.web_session.purge.period` and
`server.web_session.purge.max_deletions_per_cycle`, which were specific to the cleanup function for `system.web_sessions`, have been replaced by `server.log_gc.period` and
`server.log_gc.max_deletions_per_cycle` which apply to the cleanup function for `system.eventlog`, `system.rangelog` and `system.web_sessions` equally.

Release note (ops change): The cluster setting `server.web_session.auto_logout.timeout` has been removed. It had never been effective.


Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-observability-inf C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-observability
Projects
None yet
Development

No branches or pull requests

2 participants