Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a scaled sort, to boost smaller communities. #3907

Merged
merged 11 commits into from
Sep 6, 2023
2 changes: 1 addition & 1 deletion crates/apub/src/activities/create_or_update/post.rs
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,7 @@ impl ActivityHandler for CreateOrUpdatePage {
PostLike::like(&mut context.pool(), &like_form).await?;

// Calculate initial hot_rank for post
PostAggregates::update_hot_rank(&mut context.pool(), post.id).await?;
PostAggregates::update_ranks(&mut context.pool(), post.id).await?;

Ok(())
}
Expand Down
27 changes: 23 additions & 4 deletions crates/db_schema/src/aggregates/post_aggregates.rs
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
use crate::{
aggregates::structs::PostAggregates,
newtypes::PostId,
schema::post_aggregates,
utils::{functions::hot_rank, get_conn, DbPool},
schema::{community_aggregates, post, post_aggregates},
utils::{
functions::{hot_rank, scaled_rank},
get_conn,
DbPool,
},
};
use diesel::{result::Error, ExpressionMethods, QueryDsl};
use diesel::{result::Error, ExpressionMethods, JoinOnDsl, QueryDsl};
use diesel_async::RunQueryDsl;

impl PostAggregates {
Expand All @@ -16,9 +20,19 @@ impl PostAggregates {
.await
}

pub async fn update_hot_rank(pool: &mut DbPool<'_>, post_id: PostId) -> Result<Self, Error> {
pub async fn update_ranks(pool: &mut DbPool<'_>, post_id: PostId) -> Result<Self, Error> {
let conn = &mut get_conn(pool).await?;

// Diesel can't update based on a join, which is necessary for the scaled_rank
// https://github.com/diesel-rs/diesel/issues/1478
// Just select the users_active_month manually for now, since its a single post anyway
let users_active_month = community_aggregates::table
Copy link
Member Author

@dessalines dessalines Aug 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can probably move this query down into the update statement, to avoid the round-trip cost.

edit: tried and failed at this.

.select(community_aggregates::users_active_month)
.inner_join(post::table.on(community_aggregates::community_id.eq(post::community_id)))
.filter(post::id.eq(post_id))
.first::<i64>(conn)
.await?;

diesel::update(post_aggregates::table)
.filter(post_aggregates::post_id.eq(post_id))
.set((
Expand All @@ -27,6 +41,11 @@ impl PostAggregates {
post_aggregates::score,
post_aggregates::newest_comment_time_necro,
)),
post_aggregates::scaled_rank.eq(scaled_rank(
post_aggregates::score,
post_aggregates::published,
users_active_month,
)),
))
.get_result::<Self>(conn)
.await
Expand Down
2 changes: 2 additions & 0 deletions crates/db_schema/src/aggregates/structs.rs
Original file line number Diff line number Diff line change
Expand Up @@ -100,6 +100,8 @@ pub struct PostAggregates {
pub community_id: CommunityId,
pub creator_id: PersonId,
pub controversy_rank: f64,
/// A rank that amplifies smaller communities
pub scaled_rank: i32,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this value should be a float? Hot_rank for example really quickly goes to zero (after just ~ 2 days mostly) and all information of values between 0 and 1 and 1 and 2 is lost.

For example, right now select scaled_rank(80, '2023-08-20', 10000); and select scaled_rank(1000, '2023-08-20', 10000); both have the exact same rank even though one has > 10 times the votes!

Really hot rank should also be a float imo but that's maybe out of scope here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true, these all should probably be made into floats (not sure why I didn't want to use floats originally), could you open an issue for that? Lets do that as a separate issue.

Also being a "news" type sort, the time decay is supposed to go to zero after ~ 2 days, after which published descending takes over.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mh, that's interesting. Then maybe we can just keep hot_rank as is but make the new scaled_rank a float? I think it would make sense to make the new scaled_rank function and column float from the start instead of migration later.

There's no code change required for making this a float but keeping hot_rank an int

Copy link
Member Author

@dessalines dessalines Aug 24, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I should just change hot_rank to a float then as a part of this, since scaled_rank depends on it.

Copy link
Collaborator

@phiresky phiresky Aug 25, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. Note it will reduce the performance of the scheduled task a fair bit since afterwards almost no post will have a hot rank of 0 and be filtered out (ideas @sunaurus? )

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's def a concern... We could either use the published timestamp as a filter, or minimum hot_rank threshold.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't had a chance to look at the implementation yet, so sorry if this is a dumb idea, but can we not just update the rank functions to set the rank 0 after it decays to some threshold, to guarantee that the majority of posts will still have a 0 value for rank and thus get filtered out during updates?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean something like if the rank is below 1.0 (or perhaps even higher!) then just set it to 0

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that should work. But it would need careful consideration what the threshold should be both for hot_rank and scaled_rank. According to dess above, the published sort is supposed to take over after a few days. So maybe it would be better to make the function return zero if published is more than 7 days in the past? The hot_rank function gets published as a parameter anyways.

So that change would just be to replace IF (hours_diff > 0) THEN with IF (hours_diff > 0 AND hours_diff < 24 * 7) THEN

}

#[derive(PartialEq, Eq, Debug, Serialize, Deserialize, Clone)]
Expand Down
2 changes: 2 additions & 0 deletions crates/db_schema/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -52,6 +52,7 @@ use ts_rs::TS;
)]
#[cfg_attr(feature = "full", DbValueStyle = "verbatim")]
#[cfg_attr(feature = "full", ts(export))]
// TODO add the controversial and scaled rankings to the doc below
/// The post sort types. See here for descriptions: https://join-lemmy.org/docs/en/users/03-votes-and-ranking.html
pub enum SortType {
Active,
Expand All @@ -72,6 +73,7 @@ pub enum SortType {
TopSixMonths,
TopNineMonths,
Controversial,
Scaled,
}
dessalines marked this conversation as resolved.
Show resolved Hide resolved

#[derive(EnumString, Display, Debug, Serialize, Deserialize, Clone, Copy)]
Expand Down
1 change: 1 addition & 0 deletions crates/db_schema/src/schema.rs
Original file line number Diff line number Diff line change
Expand Up @@ -678,6 +678,7 @@ diesel::table! {
community_id -> Int4,
creator_id -> Int4,
controversy_rank -> Float8,
scaled_rank -> Int4,
}
}

Expand Down
6 changes: 5 additions & 1 deletion crates/db_schema/src/utils.rs
Original file line number Diff line number Diff line change
Expand Up @@ -346,7 +346,7 @@ pub fn naive_now() -> NaiveDateTime {

pub fn post_to_comment_sort_type(sort: SortType) -> CommentSortType {
match sort {
SortType::Active | SortType::Hot => CommentSortType::Hot,
SortType::Active | SortType::Hot | SortType::Scaled => CommentSortType::Hot,
SortType::New | SortType::NewComments | SortType::MostComments => CommentSortType::New,
SortType::Old => CommentSortType::Old,
SortType::Controversial => CommentSortType::Controversial,
Expand Down Expand Up @@ -386,6 +386,10 @@ pub mod functions {
fn hot_rank(score: BigInt, time: Timestamp) -> Integer;
}

sql_function! {
fn scaled_rank(score: BigInt, time: Timestamp, users_active_month: BigInt) -> Integer;
}

sql_function! {
fn controversy_rank(upvotes: BigInt, downvotes: BigInt, score: BigInt) -> Double;
}
Expand Down
1 change: 1 addition & 0 deletions crates/db_views/src/post_report_view.rs
Original file line number Diff line number Diff line change
Expand Up @@ -423,6 +423,7 @@ mod tests {
featured_local: false,
hot_rank: 1728,
hot_rank_active: 1728,
scaled_rank: 742,
controversy_rank: 0.0,
community_id: inserted_post.community_id,
creator_id: inserted_post.creator_id,
Expand Down
4 changes: 4 additions & 0 deletions crates/db_views/src/post_view.rs
Original file line number Diff line number Diff line change
Expand Up @@ -332,6 +332,9 @@ fn queries<'a>() -> Queries<
SortType::Hot => query
.then_order_by(post_aggregates::hot_rank.desc())
.then_order_by(post_aggregates::published.desc()),
SortType::Scaled => query
.then_order_by(post_aggregates::scaled_rank.desc())
.then_order_by(post_aggregates::published.desc()),
SortType::Controversial => query.then_order_by(post_aggregates::controversy_rank.desc()),
SortType::New => query.then_order_by(post_aggregates::published.desc()),
SortType::Old => query.then_order_by(post_aggregates::published.asc()),
Expand Down Expand Up @@ -1129,6 +1132,7 @@ mod tests {
hot_rank: 1728,
hot_rank_active: 1728,
controversy_rank: 0.0,
scaled_rank: 742,
community_id: inserted_post.community_id,
creator_id: inserted_post.creator_id,
},
Expand Down
2 changes: 1 addition & 1 deletion crates/db_views_actor/src/community_view.rs
Original file line number Diff line number Diff line change
Expand Up @@ -113,7 +113,7 @@ fn queries<'a>() -> Queries<
}

match options.sort.unwrap_or(Hot) {
Hot | Active => query = query.order_by(community_aggregates::hot_rank.desc()),
Hot | Active | Scaled => query = query.order_by(community_aggregates::hot_rank.desc()),
NewComments | TopDay | TopTwelveHour | TopSixHour | TopHour => {
query = query.order_by(community_aggregates::users_active_day.desc())
}
Expand Down
51 changes: 51 additions & 0 deletions migrations/2023-08-23-182533_scaled_rank/down.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
DROP FUNCTION scaled_rank;

ALTER TABLE post_aggregates
DROP COLUMN scaled_rank;

ALTER TABLE local_user
ALTER default_sort_type DROP DEFAULT;

-- Remove the 'Scaled' sort enum
UPDATE
local_user
SET
default_sort_type = 'Hot'
WHERE
default_sort_type = 'Scaled';

-- rename the old enum
ALTER TYPE sort_type_enum RENAME TO sort_type_enum__;

-- create the new enum
CREATE TYPE sort_type_enum AS ENUM (
'Active',
'Hot',
'New',
'Old',
'TopDay',
'TopWeek',
'TopMonth',
'TopYear',
'TopAll',
'MostComments',
'NewComments',
'TopHour',
'TopSixHour',
'TopTwelveHour',
'TopThreeMonths',
'TopSixMonths',
'TopNineMonths'
);

-- alter all your enum columns
ALTER TABLE local_user
ALTER COLUMN default_sort_type TYPE sort_type_enum
USING default_sort_type::text::sort_type_enum;

ALTER TABLE local_user
ALTER default_sort_type SET DEFAULT 'Active';

-- drop the old enum
DROP TYPE sort_type_enum__;

28 changes: 28 additions & 0 deletions migrations/2023-08-23-182533_scaled_rank/up.sql
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
CREATE OR REPLACE FUNCTION scaled_rank (score numeric, published timestamp without time zone, users_active_month numeric)
RETURNS integer
AS $$
BEGIN
-- Add 2 to avoid divide by zero errors
-- Use 0.1 to lessen the initial sharp decline at a hot_rank ~ 300
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This 0.1 scale factor is the "I made it up on the spot" part of this PR, and I don't know if it makes any sense.

I tried factors of 1, 0.1, 0.01, using a graphing calculator and some regular hot ranks, and compared them with various community sizes. Using smaller numbers lessens the initial sharp decline sensitivity.

-- Default for score = 1, active users = 1, and now, is 742
RETURN (hot_rank (score, published) / log(2 + 0.1 * users_active_month))::integer;
END;
$$
LANGUAGE plpgsql
IMMUTABLE PARALLEL SAFE;

ALTER TABLE post_aggregates
ADD COLUMN scaled_rank integer NOT NULL DEFAULT 742;

CREATE INDEX idx_post_aggregates_featured_community_scaled ON post_aggregates (featured_community DESC, scaled_rank DESC, published DESC);

CREATE INDEX idx_post_aggregates_featured_local_scaled ON post_aggregates (featured_local DESC, scaled_rank DESC, published DESC);

-- We forgot to add the controversial sort type
ALTER TYPE sort_type_enum
ADD VALUE 'Controversial';

-- Add the Scaled enum
ALTER TYPE sort_type_enum
ADD VALUE 'Scaled';

60 changes: 50 additions & 10 deletions src/scheduled_tasks.rs
Original file line number Diff line number Diff line change
Expand Up @@ -154,22 +154,16 @@ fn startup_jobs(db_url: &str) {
fn update_hot_ranks(conn: &mut PgConnection) {
info!("Updating hot ranks for all history...");

process_hot_ranks_in_batches(
conn,
"post_aggregates",
"a.hot_rank != 0 OR a.hot_rank_active != 0",
"SET hot_rank = hot_rank(a.score, a.published),
hot_rank_active = hot_rank(a.score, a.newest_comment_time_necro)",
);
process_post_aggregates_ranks_in_batches(conn);

process_hot_ranks_in_batches(
process_ranks_in_batches(
conn,
"comment_aggregates",
"a.hot_rank != 0",
"SET hot_rank = hot_rank(a.score, a.published)",
);

process_hot_ranks_in_batches(
process_ranks_in_batches(
conn,
"community_aggregates",
"a.hot_rank != 0",
Expand All @@ -189,7 +183,7 @@ struct HotRanksUpdateResult {
/// In `where_clause` and `set_clause`, "a" will refer to the current aggregates table.
/// Locked rows are skipped in order to prevent deadlocks (they will likely get updated on the next
/// run)
fn process_hot_ranks_in_batches(
fn process_ranks_in_batches(
conn: &mut PgConnection,
table_name: &str,
where_clause: &str,
Expand Down Expand Up @@ -238,6 +232,52 @@ fn process_hot_ranks_in_batches(
);
}

/// Post aggregates is a special case, since it needs to join to the community_aggregates
/// table, to get the active monthly user counts.
fn process_post_aggregates_ranks_in_batches(conn: &mut PgConnection) {
let process_start_time = NaiveDateTime::from_timestamp_opt(0, 0).expect("0 timestamp creation");

let update_batch_size = 1000; // Bigger batches than this tend to cause seq scans
let mut processed_rows_count = 0;
let mut previous_batch_result = Some(process_start_time);
while let Some(previous_batch_last_published) = previous_batch_result {
let result = sql_query(
r#"WITH batch AS (SELECT a.id
FROM post_aggregates pa
WHERE pa.published > $1
AND (pa.hot_rank != 0 OR pa.hot_rank_active != 0 OR pa.scaled_rank != 0)
Copy link
Collaborator

@phiresky phiresky Aug 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for performance, either the index on idx_post_aggregates_nonzero_hotrank should be replaced and conditioned on this condition, or the OR pa.scaled_rank != 0 be removed

removing the scaled_rank != 0 check should not impact the output I think because scaled_rank with it being floats scaled_rank will only be 0 if hot_rank is also 0

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds right, and you're correct that it does seem pointless to change that index.

"idx_post_aggregates_nonzero_hotrank" btree (published DESC) WHERE hot_rank::double precision <> 0::double precision OR hot_rank_active::double precision <> 0::double precision

ORDER BY a.published
LIMIT $2
FOR UPDATE SKIP LOCKED)
UPDATE post_aggregates pa
SET hot_rank = hot_rank(pa.score, pa.published),
hot_rank_active = hot_rank(pa.score, pa.newest_comment_time_necro),
scaled_rank = scaled_rank(pa.score, pa.published, ca.users_active_month),
FROM batch, community_aggregates ca
WHERE pa.id = batch.id and pa.community_id = ca.community_id RETURNING a.published;
"#,
)
.bind::<Timestamp, _>(previous_batch_last_published)
.bind::<Integer, _>(update_batch_size)
.get_results::<HotRanksUpdateResult>(conn);

match result {
Ok(updated_rows) => {
processed_rows_count += updated_rows.len();
previous_batch_result = updated_rows.last().map(|row| row.published);
}
Err(e) => {
error!("Failed to update {} hot_ranks: {}", "post_aggregates", e);
break;
}
}
}
info!(
"Finished process_hot_ranks_in_batches execution for {} (processed {} rows)",
"post_aggregates", processed_rows_count
);
}

fn delete_expired_captcha_answers(conn: &mut PgConnection) {
diesel::delete(
captcha_answer::table.filter(captcha_answer::published.lt(now - IntervalDsl::minutes(10))),
Expand Down