Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: start purging web_sessions in secondary tenants #90523

Merged
merged 2 commits into from
Oct 25, 2022

Conversation

knz
Copy link
Contributor

@knz knz commented Oct 23, 2022

Fixes #90522.

The next PR #90384 will ensure this code lands in the right place.

@knz knz requested a review from a team October 23, 2022 18:38
@knz knz requested review from a team as code owners October 23, 2022 18:38
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@knz
Copy link
Contributor Author

knz commented Oct 24, 2022

cc @dhartunian - who in your team is best to review this?

@rafiss
Copy link
Collaborator

rafiss commented Oct 24, 2022

drive-by to link this related issue, which might help eliminate some complexity here in the long term: #89453

Copy link
Collaborator

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained

@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

TFYR!

bors r=aadityasondhi

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

Build failed:

@knz knz force-pushed the 20221023-web-session-purge branch from 7e42516 to 4053cd4 Compare October 25, 2022 11:55
@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

bors r=aadityasondhi,stevendanna

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

Build failed:

@knz knz force-pushed the 20221023-web-session-purge branch from 4053cd4 to 20fb7e2 Compare October 25, 2022 12:38
@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

@rafiss This PR started failing on TestComposeGSSPython since yesterday, without any change in the code. Do you have an inkling of what changed?

@rafiss
Copy link
Collaborator

rafiss commented Oct 25, 2022

No I don't know anything that changed. I'll try looking into this.

Also, does this need a backport?

@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

I'll try looking into this.

we have an internal slack thread

Also, does this need a backport?

not that i know

@rafiss
Copy link
Collaborator

rafiss commented Oct 25, 2022

My concern was that our serverless customers (all of which are going to be on v22.2) could potentially see unbounded growth of this table. See https://github.com/cockroachlabs/support/issues/1812 for an example of how that led to issues on a self-hosted cluster. I'm worried that if it leads to CPU usage issues in serverless, then we would charge a customer for RUs that they cannot control.

@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

My concern was that our serverless customers (all of which are going to be on v22.2) could potentially see unbounded growth of this table.

oh yeah I can see this too. So your recommendation would be to backport?

(We have the same problem with system.eventlog btw. There's no PR for that yet.)

@rafiss
Copy link
Collaborator

rafiss commented Oct 25, 2022

Yes, I vote for backporting - v22.2 seems important since we'll soon upgrade serverless to that version, but might as well do v22.1 if it cherry-picks cleanly since that's what serverless is using right now.

Is there an issue for the eventlog one? If not, would you mind filing?

@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

it's this issue #90521

@knz
Copy link
Contributor Author

knz commented Oct 25, 2022

bors r=aadityasondhi,stevendanna

@craig
Copy link
Contributor

craig bot commented Oct 25, 2022

Build succeeded:

@craig craig bot merged commit 66beb98 into cockroachdb:master Oct 25, 2022
@knz knz deleted the 20221023-web-session-purge branch October 25, 2022 17:34
craig bot pushed a commit that referenced this pull request Oct 26, 2022
90608: server: add comments and various cleanups r=andreimatei a=knz

This PR is extracted from #90384 for ease of review.

First two commits from #90523.

Fixes #90550. 
Fixes #90607.

Co-authored-by: Raphael 'kena' Poss <[email protected]>
craig bot pushed a commit that referenced this pull request Oct 26, 2022
90384: server: split the tenant server creation into 3 stages r=dhartunian a=knz

Needed as prerequisite to #90176, towards fixing #89974.
First two commits from #90523.

The original motivation (and ultimate goal) for this PR is to
split the tenant server initialization into three phases: `New()`,
`PreStart()`, `AcceptClients()`, so as to reuse a common process startup
logic in a separate PR (#90176).

To achieve this goal, this PR re-orders the initialization steps
in `server.NewTenantServer` (previously known as
`server.StartTenant`), and extracts many of them into a new method
`(*SQLServerWrapper).PreStart()`.

The specific order of the code in `NewTenantServer()` and
`(*SQLServerWrapper).PreStart()` was chosen to mirror the order of
things in `NewServer()` and `(*Server).PreStart()`.

Reason for using the same order:

- it makes the review of this change easier: the reviewer can pull
  `server.go` and `tenant.go` and read them side-by-side, to satisfy
  themselves that the two implementations of
  `NewServer`/`NewTenantServer` and `PreStart` are equivalent.

- it will make it easier for future maintainers to keep them in sync.

- it helps us discover the fact that both sides share a lot of code.
  This opens an opportunity to merge them to a common implementation
  at a later stage.

While doing this work, care was also taken to discover which steps of
`(*Server).PreStart()` were *missing* from the tenant server
initialization. We found the following:

- the Sentry context enhancement (to report cluster ID, etc)
  was missing. This commit fixes that.

- several log entries that report the server configuration
  to the OPS channel were not emitted. This commit fixes that.

- the Graphite metric reporting was never enabled, even when
  configured. This commit fixes that.

- the Obs Service testing knobs (TestingKnobs.EventExporter) were
  not configured on the ObsServer instance. This commit fixes that.

- the `go.scheduler_latency` metric was not being measured.
  This commit fixes that.

Additionally, two followup issues were filed for the following
missing steps:

- missing support for the special file that blocks background jobs. (#90524)
- missing support for the system.eventlog cleanup loop. (#90521)

Epic: CRDB-14537

Co-authored-by: Raphael 'kena' Poss <[email protected]>
craig bot pushed a commit that referenced this pull request Nov 17, 2022
90789: server: use a common logic for system log gc r=dhartunian a=knz

Fixes #90521.
Subsumes #90523.
Subsumes #88766.
Informs #89453.
Informs #90790.

Prior to this patch, we had two problems:

- the code to GC system.eventlog was not running on secondary tenants.

  This was the original motivation for the work in this commit.

- there were 2 separate implementations of GC, one for system.eventlog/rangelog and another one for system.web_sessions.

  This was unfortunate because the first one is generic and was meant to be reused. Its implementation is just better all around: it is actually able to remove all rows when there are more than 1000 entries to delete at a time (this was a known limitation of the web_session cleanup code), and is optimized to avoid hitting tombstones when iterating.

This commit improves the situation by using the common, better implementation for both cases.

We acknowledge that with the advent of row-level TTL in SQL, all this would be better served by a conversion of these tables to use row-level TTL. However, the fix here (run GC in secondary tenants) needs to be backported, so we are shy to rely on the row-level TTL which didn't work as well in previous versions. The remainder of this work is described in #89453.

Release note (ops change): The cluster settings
`server.web_session.purge.period` and
`server.web_session.purge.max_deletions_per_cycle`, which were specific to the cleanup function for `system.web_sessions`, have been replaced by `server.log_gc.period` and
`server.log_gc.max_deletions_per_cycle` which apply to the cleanup function for `system.eventlog`, `system.rangelog` and `system.web_sessions` equally.

Release note (ops change): The cluster setting `server.web_session.auto_logout.timeout` has been removed. It had never been effective.


Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: old entries in server.web_sessions not deleted for secondary tenants
5 participants