-
Notifications
You must be signed in to change notification settings - Fork 80
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Partition JobImpression and UserEvent tables #309
Comments
Engine Yard has a nice blog post on PostgreSQL partitioning, including a trigger that creates child tables automatically. |
One expectation with partitioning is that old data that is no longer needed can be exported to a backup and removed from the database. In our case we still read old data occasionally from permalinks, particularly for viewcounts. The only answer to that problem is having this data permanently cached in another table (rather than cached in redis). |
can we find out the maximum interval between 2 job impressions after a job has expired? As in, how long was a job in stale mode before someone opened it with permalink again. And how often does that occur for, say, more than 90 days interval, giving 2 months of grace period once a job has expired. if it's not that frequent, we could move those records to a separate table periodically. and while querying by permalink, if job expiry date is older than 3 months, we check in that other table for impression data and not always. Also, how important is viewcount for a job that's been expired for more than 3 months? Do we really need to show that at the cost of query time? Until we figure out a nice way to deal with this, we could disable showing view counts for those jobs. Every job view must be suffering from querying this large a table. |
The pg_partman extension for PostgreSQL moves all the hard logic into PostgreSQL itself. We should use it. |
current stat -
|
Hasjob's JobImpression table has become a bottleneck in production. It along with the UserEvent table is now the biggest consumer of disk space. From production (using query from this article):
(
user_event
would have been larger but is not indexed for now.)PostgreSQL's documentation recommends using vertical partitioning in such timestamp-sensitive scenarios so that indexes are smaller and inserts are more efficient. This will be somewhat cumbersome to implement as we'll no longer have SQLAlchemy's elegant abstractions, but shouldn't be too hard to manage and will be more or less transparent from within code.
One consequence of using this mechanism is that (a) we can no longer use
session_id
in a unique constraint as that may span partition borders and will slow down inserts, and (b) all read queries for a given session will need to addjob_impression.datetime >= session.created_at AND job_impression.datetime <= COALESCE(session.ended_at, NOW())
so that the query planner limits which partitions are read from. Session sweeping as described in #221 becomes important now.However, the JobImpression table is typically queried on the basis of
jobpost_id
and notsession_id
, so this makes partition-specific queries somewhat trickier. We could consider that since jobs are supposed to expire after 30 days and will not be impressed thereafter (except for the unpublicised archive mode), these queries could be bounded to betweenjobpost.created_at
andjobpost.datetime + interval '30 days'
(for a total period that may exceed 30 days depending on how long it took for the draft to be published, and depending on manual updates to thedatetime
column as occasionally done for customer service).The text was updated successfully, but these errors were encountered: