server: fix bottlenecks when purging system.web_sessions #88766

aadityasondhi · 2022-09-26T20:21:05Z

The previous purging limit was found to be too low which resulted in the
system.web_sessions table growing rapidly. This happened when a
customer-side automation script was generating a large number of
sessions. Additionally, we were using ORDER BY random() when deleting
expired sessions from the table which caused a full table scan of a large
table. This resulted in high cpu usage on every purge attempt with only very
few rows actually being deleted. Over time, this turns into a nonideal
feedback loop.

This change fixes the two issues by:

Increasing the defauilt purging limit to 1000 from 10.
Remove the ORDER BY random() when purging the table to avoid full
table scans on every purge.

support ticket: https://github.com/cockroachlabs/support/issues/1812

resolves #88622

Release note: None

cockroach-teamcity · 2022-09-26T20:21:13Z

This change is

dhartunian

@cameronnunez mind taking a look at this since you wrote this code. Wondering if there's something we should know about the default limit of 10 on all the purge queries.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @dhartunian)

-- commits line 5 at r1:
Can you add a note here that this was due to automation/scripts on the customer-side. It's not immediately obvious why this table would grow rapidly.

pkg/server/purge_auth_session.go line 92 at r1 (raw file):

		deleteOldExpiredSessionsStmt = `
DELETE FROM system.web_sessions
WHERE "expiresAt" < $1

we always insert an expiration, and we index on the expiresAt column, so let's replace the order by random() with order by "expiresAt" asc on all 3 statements, to drop from the "end" of the list. that seems the most user-friendly. we might otherwise constantly drop the most recent ones.

aadityasondhi

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @dhartunian)

-- commits line 5 at r1:

Previously, dhartunian (David Hartunian) wrote…

Can you add a note here that this was due to automation/scripts on the customer-side. It's not immediately obvious why this table would grow rapidly.

Done.

pkg/server/purge_auth_session.go line 92 at r1 (raw file):

Previously, dhartunian (David Hartunian) wrote…

we always insert an expiration, and we index on the expiresAt column, so let's replace the order by random() with order by "expiresAt" asc on all 3 statements, to drop from the "end" of the list. that seems the most user-friendly. we might otherwise constantly drop the most recent ones.

Done.

dhartunian

just a formatting nit. thanks.

Reviewed 1 of 1 files at r2.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi, @cameronnunez, and @dhartunian)

pkg/server/purge_auth_session.go line 93 at r2 (raw file):

DELETE FROM system.web_sessions
WHERE "expiresAt" < $1
order by "expiresAt" asc

nit: maintain capitalized style of SQL keywords in query.

rafiss · 2022-10-05T20:28:59Z

random related thought: should we have a future goal of using row-level TTL on this table?

aadityasondhi · 2022-10-05T21:18:33Z

TFTR!

bors r=dhartunian

aadityasondhi · 2022-10-05T21:20:07Z

bors r-

craig · 2022-10-05T21:20:08Z

Canceled.

aadityasondhi · 2022-10-05T23:07:23Z

random related thought: should we have a future goal of using row-level TTL on this table?

I don't see why not. Something we can look into for this table and perhaps other tables (if any) that we purge periodically. Created an issue to track it here: #89453

The previous purging limit was found to be too low which resulted in the `system.web_sessions` table growing rapidly. This happened when a customer-side automation script was generating a large number of sessions. Additionally, we were using `ORDER BY random()` when deleting expired sessions from the table which caused a full table scan of a large table. This resulted in high cpu usage on every purge attempt with only very few rows actually being deleted. Over time, this turns into a nonideal feedback loop. This change fixes the two issues by: 1. Increasing the defauilt purging limit to 1000 from 10. 2. Remove the `ORDER BY random()` when purging the table to avoid full table scans on every purge. support ticket: cockroachlabs/support#1812 resolves cockroachdb#88622 Release note: None

aadityasondhi · 2022-10-13T14:23:42Z

bors r=dhartunian

craig · 2022-10-13T15:55:28Z

Build succeeded:

Bazel Essential CI (Cockroach)

90789: server: use a common logic for system log gc r=dhartunian a=knz Fixes #90521. Subsumes #90523. Subsumes #88766. Informs #89453. Informs #90790. Prior to this patch, we had two problems: - the code to GC system.eventlog was not running on secondary tenants. This was the original motivation for the work in this commit. - there were 2 separate implementations of GC, one for system.eventlog/rangelog and another one for system.web_sessions. This was unfortunate because the first one is generic and was meant to be reused. Its implementation is just better all around: it is actually able to remove all rows when there are more than 1000 entries to delete at a time (this was a known limitation of the web_session cleanup code), and is optimized to avoid hitting tombstones when iterating. This commit improves the situation by using the common, better implementation for both cases. We acknowledge that with the advent of row-level TTL in SQL, all this would be better served by a conversion of these tables to use row-level TTL. However, the fix here (run GC in secondary tenants) needs to be backported, so we are shy to rely on the row-level TTL which didn't work as well in previous versions. The remainder of this work is described in #89453. Release note (ops change): The cluster settings `server.web_session.purge.period` and `server.web_session.purge.max_deletions_per_cycle`, which were specific to the cleanup function for `system.web_sessions`, have been replaced by `server.log_gc.period` and `server.log_gc.max_deletions_per_cycle` which apply to the cleanup function for `system.eventlog`, `system.rangelog` and `system.web_sessions` equally. Release note (ops change): The cluster setting `server.web_session.auto_logout.timeout` has been removed. It had never been effective. Co-authored-by: Raphael 'kena' Poss <[email protected]>

aadityasondhi requested a review from dhartunian September 26, 2022 20:21

aadityasondhi requested a review from a team as a code owner September 26, 2022 20:21

dhartunian reviewed Sep 27, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/fix-web-session-table-purging branch from 13e8f4e to de5b131 Compare September 29, 2022 14:57

aadityasondhi commented Sep 29, 2022

View reviewed changes

aadityasondhi requested review from cameronnunez and dhartunian October 3, 2022 14:43

dhartunian approved these changes Oct 4, 2022

View reviewed changes

aadityasondhi force-pushed the aadityas/fix-web-session-table-purging branch 3 times, most recently from a542cdb to 0d06d08 Compare October 5, 2022 14:36

aadityasondhi mentioned this pull request Oct 5, 2022

sql: use row-level TTL for rangelog, eventlog and web_sessions #89453

Open

aadityasondhi force-pushed the aadityas/fix-web-session-table-purging branch from 0d06d08 to 58f2b9f Compare October 12, 2022 20:01

craig bot merged commit 708b242 into cockroachdb:master Oct 13, 2022

aadityasondhi deleted the aadityas/fix-web-session-table-purging branch October 13, 2022 19:09

knz mentioned this pull request Oct 27, 2022

server: use a common logic for system log gc #90789

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server: fix bottlenecks when purging system.web_sessions #88766

server: fix bottlenecks when purging system.web_sessions #88766

aadityasondhi commented Sep 26, 2022 •

edited

Loading

cockroach-teamcity commented Sep 26, 2022

dhartunian left a comment

aadityasondhi left a comment

dhartunian left a comment

rafiss commented Oct 5, 2022

aadityasondhi commented Oct 5, 2022

aadityasondhi commented Oct 5, 2022

craig bot commented Oct 5, 2022

aadityasondhi commented Oct 5, 2022

aadityasondhi commented Oct 13, 2022

craig bot commented Oct 13, 2022

server: fix bottlenecks when purging system.web_sessions #88766

server: fix bottlenecks when purging system.web_sessions #88766

Conversation

aadityasondhi commented Sep 26, 2022 • edited Loading

cockroach-teamcity commented Sep 26, 2022

dhartunian left a comment

Choose a reason for hiding this comment

aadityasondhi left a comment

Choose a reason for hiding this comment

dhartunian left a comment

Choose a reason for hiding this comment

rafiss commented Oct 5, 2022

aadityasondhi commented Oct 5, 2022

aadityasondhi commented Oct 5, 2022

craig bot commented Oct 5, 2022

aadityasondhi commented Oct 5, 2022

aadityasondhi commented Oct 13, 2022

craig bot commented Oct 13, 2022

aadityasondhi commented Sep 26, 2022 •

edited

Loading