-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server: fix bottlenecks when purging system.web_sessions #88766
server: fix bottlenecks when purging system.web_sessions #88766
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@cameronnunez mind taking a look at this since you wrote this code. Wondering if there's something we should know about the default limit of 10
on all the purge queries.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @dhartunian)
-- commits
line 5 at r1:
Can you add a note here that this was due to automation/scripts on the customer-side. It's not immediately obvious why this table would grow rapidly.
pkg/server/purge_auth_session.go
line 92 at r1 (raw file):
deleteOldExpiredSessionsStmt = ` DELETE FROM system.web_sessions WHERE "expiresAt" < $1
we always insert an expiration, and we index on the expiresAt
column, so let's replace the order by random()
with order by "expiresAt" asc
on all 3 statements, to drop from the "end" of the list. that seems the most user-friendly. we might otherwise constantly drop the most recent ones.
13e8f4e
to
de5b131
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status:
complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)
Previously, dhartunian (David Hartunian) wrote…
Can you add a note here that this was due to automation/scripts on the customer-side. It's not immediately obvious why this table would grow rapidly.
Done.
pkg/server/purge_auth_session.go
line 92 at r1 (raw file):
Previously, dhartunian (David Hartunian) wrote…
we always insert an expiration, and we index on the
expiresAt
column, so let's replace theorder by random()
withorder by "expiresAt" asc
on all 3 statements, to drop from the "end" of the list. that seems the most user-friendly. we might otherwise constantly drop the most recent ones.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a formatting nit. thanks.
Reviewed 1 of 1 files at r2.
Reviewable status:complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @cameronnunez, and @dhartunian)
pkg/server/purge_auth_session.go
line 93 at r2 (raw file):
DELETE FROM system.web_sessions WHERE "expiresAt" < $1 order by "expiresAt" asc
nit: maintain capitalized style of SQL keywords in query.
a542cdb
to
0d06d08
Compare
random related thought: should we have a future goal of using row-level TTL on this table? |
TFTR! bors r=dhartunian |
bors r- |
Canceled. |
I don't see why not. Something we can look into for this table and perhaps other tables (if any) that we purge periodically. Created an issue to track it here: #89453 |
The previous purging limit was found to be too low which resulted in the `system.web_sessions` table growing rapidly. This happened when a customer-side automation script was generating a large number of sessions. Additionally, we were using `ORDER BY random()` when deleting expired sessions from the table which caused a full table scan of a large table. This resulted in high cpu usage on every purge attempt with only very few rows actually being deleted. Over time, this turns into a nonideal feedback loop. This change fixes the two issues by: 1. Increasing the defauilt purging limit to 1000 from 10. 2. Remove the `ORDER BY random()` when purging the table to avoid full table scans on every purge. support ticket: cockroachlabs/support#1812 resolves cockroachdb#88622 Release note: None
0d06d08
to
58f2b9f
Compare
bors r=dhartunian |
Build succeeded: |
90789: server: use a common logic for system log gc r=dhartunian a=knz Fixes #90521. Subsumes #90523. Subsumes #88766. Informs #89453. Informs #90790. Prior to this patch, we had two problems: - the code to GC system.eventlog was not running on secondary tenants. This was the original motivation for the work in this commit. - there were 2 separate implementations of GC, one for system.eventlog/rangelog and another one for system.web_sessions. This was unfortunate because the first one is generic and was meant to be reused. Its implementation is just better all around: it is actually able to remove all rows when there are more than 1000 entries to delete at a time (this was a known limitation of the web_session cleanup code), and is optimized to avoid hitting tombstones when iterating. This commit improves the situation by using the common, better implementation for both cases. We acknowledge that with the advent of row-level TTL in SQL, all this would be better served by a conversion of these tables to use row-level TTL. However, the fix here (run GC in secondary tenants) needs to be backported, so we are shy to rely on the row-level TTL which didn't work as well in previous versions. The remainder of this work is described in #89453. Release note (ops change): The cluster settings `server.web_session.purge.period` and `server.web_session.purge.max_deletions_per_cycle`, which were specific to the cleanup function for `system.web_sessions`, have been replaced by `server.log_gc.period` and `server.log_gc.max_deletions_per_cycle` which apply to the cleanup function for `system.eventlog`, `system.rangelog` and `system.web_sessions` equally. Release note (ops change): The cluster setting `server.web_session.auto_logout.timeout` has been removed. It had never been effective. Co-authored-by: Raphael 'kena' Poss <[email protected]>
The previous purging limit was found to be too low which resulted in the
system.web_sessions
table growing rapidly. This happened when acustomer-side automation script was generating a large number of
sessions. Additionally, we were using
ORDER BY random()
when deletingexpired sessions from the table which caused a full table scan of a large
table. This resulted in high cpu usage on every purge attempt with only very
few rows actually being deleted. Over time, this turns into a nonideal
feedback loop.
This change fixes the two issues by:
ORDER BY random()
when purging the table to avoid fulltable scans on every purge.
support ticket: https://github.com/cockroachlabs/support/issues/1812
resolves #88622
Release note: None