-
-
Notifications
You must be signed in to change notification settings - Fork 886
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding a scaled sort, to boost smaller communities. #3907
Conversation
- Previously referred to as *best* . - Fixes #3622
// Diesel can't update based on a join, which is necessary for the scaled_rank | ||
// https://github.com/diesel-rs/diesel/issues/1478 | ||
// Just select the users_active_month manually for now, since its a single post anyway | ||
let users_active_month = community_aggregates::table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can probably move this query down into the update statement, to avoid the round-trip cost.
edit: tried and failed at this.
@@ -100,6 +100,8 @@ pub struct PostAggregates { | |||
pub community_id: CommunityId, | |||
pub creator_id: PersonId, | |||
pub controversy_rank: f64, | |||
/// A rank that amplifies smaller communities | |||
pub scaled_rank: i32, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this value should be a float? Hot_rank for example really quickly goes to zero (after just ~ 2 days mostly) and all information of values between 0 and 1 and 1 and 2 is lost.
For example, right now select scaled_rank(80, '2023-08-20', 10000);
and select scaled_rank(1000, '2023-08-20', 10000);
both have the exact same rank even though one has > 10 times the votes!
Really hot rank should also be a float imo but that's maybe out of scope here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's true, these all should probably be made into floats (not sure why I didn't want to use floats originally), could you open an issue for that? Lets do that as a separate issue.
Also being a "news" type sort, the time decay is supposed to go to zero after ~ 2 days, after which published descending takes over.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mh, that's interesting. Then maybe we can just keep hot_rank as is but make the new scaled_rank a float? I think it would make sense to make the new scaled_rank function and column float from the start instead of migration later.
There's no code change required for making this a float but keeping hot_rank an int
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe I should just change hot_rank to a float then as a part of this, since scaled_rank depends on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. Note it will reduce the performance of the scheduled task a fair bit since afterwards almost no post will have a hot rank of 0 and be filtered out (ideas @sunaurus? )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's def a concern... We could either use the published timestamp as a filter, or minimum hot_rank threshold.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't had a chance to look at the implementation yet, so sorry if this is a dumb idea, but can we not just update the rank functions to set the rank 0 after it decays to some threshold, to guarantee that the majority of posts will still have a 0 value for rank and thus get filtered out during updates?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean something like if the rank is below 1.0 (or perhaps even higher!) then just set it to 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that should work. But it would need careful consideration what the threshold should be both for hot_rank and scaled_rank. According to dess above, the published sort is supposed to take over after a few days. So maybe it would be better to make the function return zero if published is more than 7 days in the past? The hot_rank function gets published
as a parameter anyways.
So that change would just be to replace IF (hours_diff > 0) THEN
with IF (hours_diff > 0 AND hours_diff < 24 * 7) THEN
AS $$ | ||
BEGIN | ||
-- Add 2 to avoid divide by zero errors | ||
-- Use 0.1 to lessen the initial sharp decline at a hot_rank ~ 300 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This 0.1
scale factor is the "I made it up on the spot" part of this PR, and I don't know if it makes any sense.
I tried factors of 1, 0.1, 0.01, using a graphing calculator and some regular hot ranks, and compared them with various community sizes. Using smaller numbers lessens the initial sharp decline sensitivity.
I can do the int-> float convert for hot_rank and scaled_rank if you don't have time @dessalines |
@phiresky No that's okay, I'll work on that today. |
Okay that's updated now. I also removed the |
Will this mean that bot communities like Lemmit will dominate my feed? I like Lemmit but I wouldn't want it to be all of the top posts. |
This has nothing to do with bots, but communities with few active users. You can already block bot accounts universally in your user profile settings, as well as block their users or communities if you wish. |
Yes but I enjoy the Lemmit bot, I just don't want the top 100 posts on my feed to be entirely Lemmit. I'm glad to know it's based on subscribers and not active users, but I still feel like posts from bot-only communities might be strongly favored by this sorting. Maybe I'll just be toggling the show bots option on and off each day. |
Sry I misspoke above, this is absolutely based on active monthly users, not subscribers, as subscribers is really a pointless metric, considering how communities can have a ton of subscribers but little to no activity. Bot communities have nothing to do with this, this will boost any community with few active users. This won't do a spread either (IE picking a single post from a lot of communities, as that's too slow to do performance-wise). |
Right it'll be something to get used to. Lemmit communities are mostly just a single active user (the bot itself) since they don't allow posts from other users. |
My suggestion to both improve the performance and to go in the direction of "published sort should take over after a few days" would be to replace |
src/scheduled_tasks.rs
Outdated
r#"WITH batch AS (SELECT pa.id | ||
FROM post_aggregates pa | ||
WHERE pa.published > $1 | ||
AND (pa.hot_rank != 0 OR pa.hot_rank_active != 0 OR pa.scaled_rank != 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for performance, either the index on idx_post_aggregates_nonzero_hotrank should be replaced and conditioned on this condition, or the OR pa.scaled_rank != 0
be removed
removing the scaled_rank != 0 check should not impact the output I think because scaled_rank with it being floats scaled_rank will only be 0 if hot_rank is also 0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds right, and you're correct that it does seem pointless to change that index.
"idx_post_aggregates_nonzero_hotrank" btree (published DESC) WHERE hot_rank::double precision <> 0::double precision OR hot_rank_active::double precision <> 0::double precision
Good call, I'll do that now. |
Instead of adding another sort option I would rather use this scaling logic for the existing Active and Hot sorts. That way users will actually notice an improvement immediately, instead having to tell everyone to change sorts manually. If this works well it will be a clear improvement and there should be no need for the current Active/Hot sorts. If it doesnt work, we can finetune during rc process. |
I don't feel too comfortable with replacing them, because they are completely different sorts. One will show posts from popular communities, the other will show sorts from unpopular communities. I'm open to making it the new default tho, if it ends up looking good with production data. EDIT: I'm converting this to a draft, so I can test with some local production data, to make sure things look okay. |
Okay its ready for review. The only part I'm unsure about, is a scale factor added to I tested this with production data, and the |
hot_rank_active = hot_rank(pa.score, pa.newest_comment_time_necro), | ||
scaled_rank = scaled_rank(pa.score, pa.published, ca.users_active_month) | ||
FROM batch, community_aggregates ca | ||
WHERE pa.id = batch.id and pa.community_id = ca.community_id RETURNING pa.published; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just as a note: with this change it might make sense to do the post ordering / batch selection with ORDER BY (community_id, published) so that the join on community_aggregates is less expensive. But it's probably not a worth the extra pagination complexity if we assume that the whole communities table+indexes will be in memory in any case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. This will require a new index as well if #3872 is merged.
Still needs lots of testing with prod data, as well as verifying the scheduled_tasks changes are working correctly.
An example of some ranks: