-
-
Notifications
You must be signed in to change notification settings - Fork 887
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a scaled sort, to boost smaller communities. #3907
Changes from 2 commits
c0cad89
d40ee4b
978a7d3
d03b4f8
384f430
c964b1c
776347f
83fb70f
42e128d
f163fe5
4ffa1d3
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -100,6 +100,8 @@ pub struct PostAggregates { | |
pub community_id: CommunityId, | ||
pub creator_id: PersonId, | ||
pub controversy_rank: f64, | ||
/// A rank that amplifies smaller communities | ||
pub scaled_rank: i32, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this value should be a float? Hot_rank for example really quickly goes to zero (after just ~ 2 days mostly) and all information of values between 0 and 1 and 1 and 2 is lost. For example, right now Really hot rank should also be a float imo but that's maybe out of scope here. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's true, these all should probably be made into floats (not sure why I didn't want to use floats originally), could you open an issue for that? Lets do that as a separate issue. Also being a "news" type sort, the time decay is supposed to go to zero after ~ 2 days, after which published descending takes over. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Mh, that's interesting. Then maybe we can just keep hot_rank as is but make the new scaled_rank a float? I think it would make sense to make the new scaled_rank function and column float from the start instead of migration later. There's no code change required for making this a float but keeping hot_rank an int There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Maybe I should just change hot_rank to a float then as a part of this, since scaled_rank depends on it. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sounds good. Note it will reduce the performance of the scheduled task a fair bit since afterwards almost no post will have a hot rank of 0 and be filtered out (ideas @sunaurus? ) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That's def a concern... We could either use the published timestamp as a filter, or minimum hot_rank threshold. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I haven't had a chance to look at the implementation yet, so sorry if this is a dumb idea, but can we not just update the rank functions to set the rank 0 after it decays to some threshold, to guarantee that the majority of posts will still have a 0 value for rank and thus get filtered out during updates? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I mean something like if the rank is below 1.0 (or perhaps even higher!) then just set it to 0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, that should work. But it would need careful consideration what the threshold should be both for hot_rank and scaled_rank. According to dess above, the published sort is supposed to take over after a few days. So maybe it would be better to make the function return zero if published is more than 7 days in the past? The hot_rank function gets So that change would just be to replace |
||
} | ||
|
||
#[derive(PartialEq, Eq, Debug, Serialize, Deserialize, Clone)] | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,52 @@ | ||
DROP FUNCTION scaled_rank; | ||
|
||
ALTER TABLE post_aggregates | ||
DROP COLUMN scaled_rank; | ||
|
||
-- The following code is necessary because postgres can't remove | ||
-- a single enum value. | ||
ALTER TABLE local_user | ||
ALTER default_sort_type DROP DEFAULT; | ||
|
||
UPDATE | ||
local_user | ||
SET | ||
default_sort_type = 'Hot' | ||
WHERE | ||
default_sort_type = 'Scaled'; | ||
|
||
-- rename the old enum | ||
ALTER TYPE sort_type_enum RENAME TO sort_type_enum__; | ||
|
||
-- create the new enum | ||
CREATE TYPE sort_type_enum AS ENUM ( | ||
'Active', | ||
'Hot', | ||
'New', | ||
'Old', | ||
'TopDay', | ||
'TopWeek', | ||
'TopMonth', | ||
'TopYear', | ||
'TopAll', | ||
'MostComments', | ||
'NewComments', | ||
'TopHour', | ||
'TopSixHour', | ||
'TopTwelveHour', | ||
'TopThreeMonths', | ||
'TopSixMonths', | ||
'TopNineMonths' | ||
); | ||
|
||
-- alter all your enum columns | ||
ALTER TABLE local_user | ||
ALTER COLUMN default_sort_type TYPE sort_type_enum | ||
USING default_sort_type::text::sort_type_enum; | ||
|
||
ALTER TABLE local_user | ||
ALTER default_sort_type SET DEFAULT 'Active'; | ||
|
||
-- drop the old enum | ||
DROP TYPE sort_type_enum__; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
CREATE OR REPLACE FUNCTION scaled_rank (score numeric, published timestamp without time zone, users_active_month numeric) | ||
RETURNS integer | ||
AS $$ | ||
BEGIN | ||
-- Add 2 to avoid divide by zero errors | ||
-- Use 0.1 to lessen the initial sharp decline at a hot_rank ~ 300 | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This I tried factors of 1, 0.1, 0.01, using a graphing calculator and some regular hot ranks, and compared them with various community sizes. Using smaller numbers lessens the initial sharp decline sensitivity. |
||
-- Default for score = 1, active users = 1, and now, is 742 | ||
RETURN (hot_rank (score, published) / log(2 + 0.1 * users_active_month))::integer; | ||
END; | ||
$$ | ||
LANGUAGE plpgsql | ||
IMMUTABLE PARALLEL SAFE; | ||
|
||
ALTER TABLE post_aggregates | ||
ADD COLUMN scaled_rank integer NOT NULL DEFAULT 742; | ||
|
||
CREATE INDEX idx_post_aggregates_featured_community_scaled ON post_aggregates (featured_community DESC, scaled_rank DESC, published DESC); | ||
|
||
CREATE INDEX idx_post_aggregates_featured_local_scaled ON post_aggregates (featured_local DESC, scaled_rank DESC, published DESC); | ||
|
||
-- We forgot to add the controversial sort type | ||
ALTER TYPE sort_type_enum | ||
ADD VALUE 'Controversial'; | ||
|
||
-- Add the Scaled enum | ||
ALTER TYPE sort_type_enum | ||
ADD VALUE 'Scaled'; | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -154,22 +154,16 @@ fn startup_jobs(db_url: &str) { | |
fn update_hot_ranks(conn: &mut PgConnection) { | ||
info!("Updating hot ranks for all history..."); | ||
|
||
process_hot_ranks_in_batches( | ||
conn, | ||
"post_aggregates", | ||
"a.hot_rank != 0 OR a.hot_rank_active != 0", | ||
"SET hot_rank = hot_rank(a.score, a.published), | ||
hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)", | ||
); | ||
process_post_aggregates_ranks_in_batches(conn); | ||
|
||
process_hot_ranks_in_batches( | ||
process_ranks_in_batches( | ||
conn, | ||
"comment_aggregates", | ||
"a.hot_rank != 0", | ||
"SET hot_rank = hot_rank(a.score, a.published)", | ||
); | ||
|
||
process_hot_ranks_in_batches( | ||
process_ranks_in_batches( | ||
conn, | ||
"community_aggregates", | ||
"a.hot_rank != 0", | ||
|
@@ -189,7 +183,7 @@ struct HotRanksUpdateResult { | |
/// In `where_clause` and `set_clause`, "a" will refer to the current aggregates table. | ||
/// Locked rows are skipped in order to prevent deadlocks (they will likely get updated on the next | ||
/// run) | ||
fn process_hot_ranks_in_batches( | ||
fn process_ranks_in_batches( | ||
conn: &mut PgConnection, | ||
table_name: &str, | ||
where_clause: &str, | ||
|
@@ -238,6 +232,52 @@ fn process_hot_ranks_in_batches( | |
); | ||
} | ||
|
||
/// Post aggregates is a special case, since it needs to join to the community_aggregates | ||
/// table, to get the active monthly user counts. | ||
fn process_post_aggregates_ranks_in_batches(conn: &mut PgConnection) { | ||
let process_start_time = NaiveDateTime::from_timestamp_opt(0, 0).expect("0 timestamp creation"); | ||
|
||
let update_batch_size = 1000; // Bigger batches than this tend to cause seq scans | ||
let mut processed_rows_count = 0; | ||
let mut previous_batch_result = Some(process_start_time); | ||
while let Some(previous_batch_last_published) = previous_batch_result { | ||
let result = sql_query( | ||
r#"WITH batch AS (SELECT pa.id | ||
FROM post_aggregates pa | ||
WHERE pa.published > $1 | ||
AND (pa.hot_rank != 0 OR pa.hot_rank_active != 0 OR pa.scaled_rank != 0) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. for performance, either the index on idx_post_aggregates_nonzero_hotrank should be replaced and conditioned on this condition, or the removing the scaled_rank != 0 check should not impact the output I think because scaled_rank with it being floats scaled_rank will only be 0 if hot_rank is also 0 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. That sounds right, and you're correct that it does seem pointless to change that index.
|
||
ORDER BY pa.published | ||
LIMIT $2 | ||
FOR UPDATE SKIP LOCKED) | ||
UPDATE post_aggregates pa | ||
SET hot_rank = hot_rank(pa.score, pa.published), | ||
hot_rank_active = hot_rank(pa.score, pa.newest_comment_time_necro), | ||
scaled_rank = scaled_rank(pa.score, pa.published, ca.users_active_month) | ||
FROM batch, community_aggregates ca | ||
WHERE pa.id = batch.id and pa.community_id = ca.community_id RETURNING pa.published; | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just as a note: with this change it might make sense to do the post ordering / batch selection with ORDER BY (community_id, published) so that the join on community_aggregates is less expensive. But it's probably not a worth the extra pagination complexity if we assume that the whole communities table+indexes will be in memory in any case. |
||
"#, | ||
) | ||
.bind::<Timestamp, _>(previous_batch_last_published) | ||
.bind::<Integer, _>(update_batch_size) | ||
.get_results::<HotRanksUpdateResult>(conn); | ||
|
||
match result { | ||
Ok(updated_rows) => { | ||
processed_rows_count += updated_rows.len(); | ||
previous_batch_result = updated_rows.last().map(|row| row.published); | ||
} | ||
Err(e) => { | ||
error!("Failed to update {} hot_ranks: {}", "post_aggregates", e); | ||
break; | ||
} | ||
} | ||
} | ||
info!( | ||
"Finished process_hot_ranks_in_batches execution for {} (processed {} rows)", | ||
"post_aggregates", processed_rows_count | ||
); | ||
} | ||
|
||
fn delete_expired_captcha_answers(conn: &mut PgConnection) { | ||
diesel::delete( | ||
captcha_answer::table.filter(captcha_answer::published.lt(now - IntervalDsl::minutes(10))), | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can probably move this query down into the update statement, to avoid the round-trip cost.
edit: tried and failed at this.