-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speed up person-related trends queries for large users #7548
Comments
@macobo I have another potential "approach" (I won't call it a solution, yet) #7537 https://metabase.posthog.net/question/204-optimized-dau-query-for-2635 |
Some thoughts: Balancing which table is larger isn't as concrete as we originally assumed/still may be assuming. There are cases where the person table will be larger (100 persons and only 20 events have been fired and are being considered in the join). There are cases when persons are smaller (10 persons and each person performed the event in question 10 times). At scale, this could cause the memory issues mentioned above. Something I was trying was guaranteeing that the persons joined in would always be smaller than the events table. https://metabase.posthog.net/question/205-dau-query-that-ensures-limit-persons-table This query filters events and then only retrieves the person distinct ids that are associated with the events in consideration. It's not particularly fast (weekly aggregation for 5 weeks returns in 15 seconds). Could be useful in thinking about. Also need to check for correctness |
Gonna dump some raw debugging notes here, will be editing this post as I continue
Talked with marcus and ran a few experiments together:
-- Current pdis subquery, takes 13s
SELECT
count(1) as data
FROM
(
SELECT
distinct_id,
argMax(person_id, _timestamp) as person_id
FROM
(
SELECT
distinct_id,
person_id,
max(_timestamp) as _timestamp
FROM
person_distinct_id
WHERE
team_id = 2635
GROUP BY
person_id,
distinct_id
HAVING
max(is_deleted) = 0
)
GROUP BY
distinct_id
) pdi
-- Attempt at optimizing. Isn't quite correct.
SELECT count(1) FROM (
SELECT
distinct_id,
argMax(person_id, _timestamp) as person_id
FROM
person_distinct_id
WHERE
team_id = 2635
GROUP BY distinct_id
HAVING argMax(is_deleted, _timestamp) = 0
) Action steps from this:
Plan: new distinct_ids tableTalked with Yakko. We ended up at a different schema for person_distinct_ids table, basically:
with version starting at 0 and every time a distinct_id should point at a new row increasing by 1. We wouldn't emit "deletions" anymore per (distinct_id, person_id) pair, but only when distinct_id gets deleted. This should allow for more optimized queries. The optimized query would then look like something like this:
This requires:
I'll be tacking 1-4 right now and will hand off 5 and 6 to team platform when it's in. This way they don't need to drop their current tasks and we don't block the release on 5 and 6. Benchmarking new person_distinct_ids table: I ran 3 different daily active users queries, taking measurements of each 4 times to account for variance in cluster load
Taking the 2nd fastest of each:
Note that the (4) one uses ~5GB of memory compared to rest using ~8GB. 5 uses slightly more memory but might be easier to implement. Given this I'd:
Benchmarks on the benchmarking server (before -> after): track_trends_dau 4906.0±52 -> 4463.5±77 These wins are significant, but less so than on cloud for the large team. The issue is the test data - we have ~700k users instead of >10M in there. To be improved! |
I attempted to optimize the PDI query by converting the outer argMax query into a window function. The query is shorter but just as slow - there's something I can't quite work out abut ordering the MAX(_timestamp) SELECT
count(1) as data
FROM (
SELECT
distinct_id,
FIRST_VALUE(person_id) OVER (PARTITION BY distinct_id, person_id ORDER BY MAX(_timestamp) DESC) as person_id_2
FROM
person_distinct_id
WHERE
team_id = 2635
AND _timestamp < today()
GROUP BY
distinct_id,
person_id
HAVING
max(is_deleted) = 0
) |
Status update: Technical:
For the client this means: We had an issue with how we were storing people in our codebase causing it not to scale well beyond 10M users.
Over the next 4 weeks over the holidays getting phantom performance even better is going to be one of the main focusses of my team. |
Closing this, following up with #7663 |
In what situation are you experiencing subpar performance?
Some of our largest users have run into both memory and time limit issues in insights
Example queries affected
Potential solution(s)
person
table select as the left-most query.Steps to do solve
Environment
Additional context
Relevant slack thread: https://posthog.slack.com/archives/C01MM7VT7MG/p1638272862302100
Additional potential optimization: #7537 (comment)
Thank you for your performance issue report – we want PostHog to go supersonic!
The text was updated successfully, but these errors were encountered: