Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server: fix bottlenecks when purging system.web_sessions #88766

Merged

Conversation

aadityasondhi
Copy link
Collaborator

@aadityasondhi aadityasondhi commented Sep 26, 2022

The previous purging limit was found to be too low which resulted in the
system.web_sessions table growing rapidly. This happened when a
customer-side automation script was generating a large number of
sessions. Additionally, we were using ORDER BY random() when deleting
expired sessions from the table which caused a full table scan of a large
table. This resulted in high cpu usage on every purge attempt with only very
few rows actually being deleted. Over time, this turns into a nonideal
feedback loop.

This change fixes the two issues by:

  1. Increasing the defauilt purging limit to 1000 from 10.
  2. Remove the ORDER BY random() when purging the table to avoid full
    table scans on every purge.

support ticket: https://github.com/cockroachlabs/support/issues/1812

resolves #88622

Release note: None

@aadityasondhi aadityasondhi requested a review from a team as a code owner September 26, 2022 20:21
@cockroach-teamcity
Copy link
Member

This change is Reviewable

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cameronnunez mind taking a look at this since you wrote this code. Wondering if there's something we should know about the default limit of 10 on all the purge queries.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @dhartunian)


-- commits line 5 at r1:
Can you add a note here that this was due to automation/scripts on the customer-side. It's not immediately obvious why this table would grow rapidly.


pkg/server/purge_auth_session.go line 92 at r1 (raw file):

		deleteOldExpiredSessionsStmt = `
DELETE FROM system.web_sessions
WHERE "expiresAt" < $1

we always insert an expiration, and we index on the expiresAt column, so let's replace the order by random() with order by "expiresAt" asc on all 3 statements, to drop from the "end" of the list. that seems the most user-friendly. we might otherwise constantly drop the most recent ones.

@aadityasondhi aadityasondhi force-pushed the aadityas/fix-web-session-table-purging branch from 13e8f4e to de5b131 Compare September 29, 2022 14:57
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)


-- commits line 5 at r1:

Previously, dhartunian (David Hartunian) wrote…

Can you add a note here that this was due to automation/scripts on the customer-side. It's not immediately obvious why this table would grow rapidly.

Done.


pkg/server/purge_auth_session.go line 92 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

we always insert an expiration, and we index on the expiresAt column, so let's replace the order by random() with order by "expiresAt" asc on all 3 statements, to drop from the "end" of the list. that seems the most user-friendly. we might otherwise constantly drop the most recent ones.

Done.

Copy link
Collaborator

@dhartunian dhartunian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: just a formatting nit. thanks.

Reviewed 1 of 1 files at r2.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @cameronnunez, and @dhartunian)


pkg/server/purge_auth_session.go line 93 at r2 (raw file):

DELETE FROM system.web_sessions
WHERE "expiresAt" < $1
order by "expiresAt" asc

nit: maintain capitalized style of SQL keywords in query.

@aadityasondhi aadityasondhi force-pushed the aadityas/fix-web-session-table-purging branch 3 times, most recently from a542cdb to 0d06d08 Compare October 5, 2022 14:36
@rafiss
Copy link
Collaborator

rafiss commented Oct 5, 2022

random related thought: should we have a future goal of using row-level TTL on this table?

@aadityasondhi
Copy link
Collaborator Author

TFTR!

bors r=dhartunian

@aadityasondhi
Copy link
Collaborator Author

bors r-

@craig
Copy link
Contributor

craig bot commented Oct 5, 2022

Canceled.

@aadityasondhi
Copy link
Collaborator Author

random related thought: should we have a future goal of using row-level TTL on this table?

I don't see why not. Something we can look into for this table and perhaps other tables (if any) that we purge periodically. Created an issue to track it here: #89453

The previous purging limit was found to be too low which resulted in the
`system.web_sessions` table growing rapidly. This happened when a
customer-side automation script was generating a large number of
sessions. Additionally, we were using `ORDER BY random()` when deleting
expired sessions from the table which caused a full table scan of a large
table. This resulted in high cpu usage on every purge attempt with only very
few rows actually being deleted. Over time, this turns into a nonideal
feedback loop.

This change fixes the two issues by:
1. Increasing the defauilt purging limit to 1000 from 10.
2. Remove the `ORDER BY random()` when purging the table to avoid full
   table scans on every purge.

support ticket: cockroachlabs/support#1812

resolves cockroachdb#88622

Release note: None
@aadityasondhi aadityasondhi force-pushed the aadityas/fix-web-session-table-purging branch from 0d06d08 to 58f2b9f Compare October 12, 2022 20:01
@aadityasondhi
Copy link
Collaborator Author

bors r=dhartunian

@craig
Copy link
Contributor

craig bot commented Oct 13, 2022

Build succeeded:

@craig craig bot merged commit 708b242 into cockroachdb:master Oct 13, 2022
@aadityasondhi aadityasondhi deleted the aadityas/fix-web-session-table-purging branch October 13, 2022 19:09
craig bot pushed a commit that referenced this pull request Nov 17, 2022
90789: server: use a common logic for system log gc r=dhartunian a=knz

Fixes #90521.
Subsumes #90523.
Subsumes #88766.
Informs #89453.
Informs #90790.

Prior to this patch, we had two problems:

- the code to GC system.eventlog was not running on secondary tenants.

  This was the original motivation for the work in this commit.

- there were 2 separate implementations of GC, one for system.eventlog/rangelog and another one for system.web_sessions.

  This was unfortunate because the first one is generic and was meant to be reused. Its implementation is just better all around: it is actually able to remove all rows when there are more than 1000 entries to delete at a time (this was a known limitation of the web_session cleanup code), and is optimized to avoid hitting tombstones when iterating.

This commit improves the situation by using the common, better implementation for both cases.

We acknowledge that with the advent of row-level TTL in SQL, all this would be better served by a conversion of these tables to use row-level TTL. However, the fix here (run GC in secondary tenants) needs to be backported, so we are shy to rely on the row-level TTL which didn't work as well in previous versions. The remainder of this work is described in #89453.

Release note (ops change): The cluster settings
`server.web_session.purge.period` and
`server.web_session.purge.max_deletions_per_cycle`, which were specific to the cleanup function for `system.web_sessions`, have been replaced by `server.log_gc.period` and
`server.log_gc.max_deletions_per_cycle` which apply to the cleanup function for `system.eventlog`, `system.rangelog` and `system.web_sessions` equally.

Release note (ops change): The cluster setting `server.web_session.auto_logout.timeout` has been removed. It had never been effective.


Co-authored-by: Raphael 'kena' Poss <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

server: perf issues when purging system.web_sessions
4 participants