-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: start purging web_sessions in secondary tenants #90523
Conversation
cc @dhartunian - who in your team is best to review this? |
drive-by to link this related issue, which might help eliminate some complexity here in the long term: #89453 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained
TFYR! bors r=aadityasondhi |
Build failed: |
7e42516
to
4053cd4
Compare
bors r=aadityasondhi,stevendanna |
Build failed: |
Release note: None
Release note: None
4053cd4
to
20fb7e2
Compare
@rafiss This PR started failing on TestComposeGSSPython since yesterday, without any change in the code. Do you have an inkling of what changed? |
No I don't know anything that changed. I'll try looking into this. Also, does this need a backport? |
we have an internal slack thread
not that i know |
My concern was that our serverless customers (all of which are going to be on v22.2) could potentially see unbounded growth of this table. See https://github.com/cockroachlabs/support/issues/1812 for an example of how that led to issues on a self-hosted cluster. I'm worried that if it leads to CPU usage issues in serverless, then we would charge a customer for RUs that they cannot control. |
oh yeah I can see this too. So your recommendation would be to backport? (We have the same problem with system.eventlog btw. There's no PR for that yet.) |
Yes, I vote for backporting - v22.2 seems important since we'll soon upgrade serverless to that version, but might as well do v22.1 if it cherry-picks cleanly since that's what serverless is using right now. Is there an issue for the eventlog one? If not, would you mind filing? |
it's this issue #90521 |
bors r=aadityasondhi,stevendanna |
Build succeeded: |
90384: server: split the tenant server creation into 3 stages r=dhartunian a=knz Needed as prerequisite to #90176, towards fixing #89974. First two commits from #90523. The original motivation (and ultimate goal) for this PR is to split the tenant server initialization into three phases: `New()`, `PreStart()`, `AcceptClients()`, so as to reuse a common process startup logic in a separate PR (#90176). To achieve this goal, this PR re-orders the initialization steps in `server.NewTenantServer` (previously known as `server.StartTenant`), and extracts many of them into a new method `(*SQLServerWrapper).PreStart()`. The specific order of the code in `NewTenantServer()` and `(*SQLServerWrapper).PreStart()` was chosen to mirror the order of things in `NewServer()` and `(*Server).PreStart()`. Reason for using the same order: - it makes the review of this change easier: the reviewer can pull `server.go` and `tenant.go` and read them side-by-side, to satisfy themselves that the two implementations of `NewServer`/`NewTenantServer` and `PreStart` are equivalent. - it will make it easier for future maintainers to keep them in sync. - it helps us discover the fact that both sides share a lot of code. This opens an opportunity to merge them to a common implementation at a later stage. While doing this work, care was also taken to discover which steps of `(*Server).PreStart()` were *missing* from the tenant server initialization. We found the following: - the Sentry context enhancement (to report cluster ID, etc) was missing. This commit fixes that. - several log entries that report the server configuration to the OPS channel were not emitted. This commit fixes that. - the Graphite metric reporting was never enabled, even when configured. This commit fixes that. - the Obs Service testing knobs (TestingKnobs.EventExporter) were not configured on the ObsServer instance. This commit fixes that. - the `go.scheduler_latency` metric was not being measured. This commit fixes that. Additionally, two followup issues were filed for the following missing steps: - missing support for the special file that blocks background jobs. (#90524) - missing support for the system.eventlog cleanup loop. (#90521) Epic: CRDB-14537 Co-authored-by: Raphael 'kena' Poss <[email protected]>
90789: server: use a common logic for system log gc r=dhartunian a=knz Fixes #90521. Subsumes #90523. Subsumes #88766. Informs #89453. Informs #90790. Prior to this patch, we had two problems: - the code to GC system.eventlog was not running on secondary tenants. This was the original motivation for the work in this commit. - there were 2 separate implementations of GC, one for system.eventlog/rangelog and another one for system.web_sessions. This was unfortunate because the first one is generic and was meant to be reused. Its implementation is just better all around: it is actually able to remove all rows when there are more than 1000 entries to delete at a time (this was a known limitation of the web_session cleanup code), and is optimized to avoid hitting tombstones when iterating. This commit improves the situation by using the common, better implementation for both cases. We acknowledge that with the advent of row-level TTL in SQL, all this would be better served by a conversion of these tables to use row-level TTL. However, the fix here (run GC in secondary tenants) needs to be backported, so we are shy to rely on the row-level TTL which didn't work as well in previous versions. The remainder of this work is described in #89453. Release note (ops change): The cluster settings `server.web_session.purge.period` and `server.web_session.purge.max_deletions_per_cycle`, which were specific to the cleanup function for `system.web_sessions`, have been replaced by `server.log_gc.period` and `server.log_gc.max_deletions_per_cycle` which apply to the cleanup function for `system.eventlog`, `system.rangelog` and `system.web_sessions` equally. Release note (ops change): The cluster setting `server.web_session.auto_logout.timeout` has been removed. It had never been effective. Co-authored-by: Raphael 'kena' Poss <[email protected]>
Fixes #90522.
The next PR #90384 will ensure this code lands in the right place.