-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Lifecycle query is slow #7382
Comments
@paolodamico and @marcushyett-ph given this is one of the goals next sprint let's make solve the business question. Are you in favor of using person.created_at column for discovering resurrected? |
looping @hazzadous to follow this in preparation for next sprint person.created_at makes sense as long as we validate the data is trustworthy |
In a general sense, product-wise, I strongly suggest keeping the base functionality as it's key part of the lifecycle insight (even though feature has low usage). The proposed alternatives make sense to me conceptually, but let's dive a little bit deeper.
Perhaps we can do a quick sanity check query to see if there are any major discrepancies between the first event seen for a user and this date? We could also do a one-off sync job. |
+1 to @paolodamico and @EDsCODE, do we have any anecdotal reasons not to trust created_at? |
I just checked and I think created_at was an original field on the model so we should have good data on it. The one thing with created_at is that it will lock us in to only being able to do $pageview lifecycles. We're making the assumption that a created person always hits a $pageview which is pretty fair but this won't be true for other events. (Though this isn't a blocker because we could eventually materialize tables around events precalculate the earliest timestamp of a person and event) |
Sitrep
either we switch the order, or put in a join table. I wonder if using a smaller join key might help also. But at any rate filtering the pdis by person created date should reduce the size of the join considerably I’ll follow up to more details and actions |
Nice! @hazzadous totally anecdotal and probably not useful context. If I remove the filter for internal users lifecycle queries are way way faster. Not sure if there's anything in that statement to help identifying any bottlenecks. |
Sitrep: @macobo is working on #7663 which should considerably speedup teams/projects with a large number of distinct_ids. This at least makes the query in isolation faster, a couple of other opportunities:
I'll do some experimentation to see what works, but we have the materialisation option as an immediate tool. The join speed looks more involved. |
After speaking to @macobo is seems email is materialized for us as pmat_email, but lifecycle isn't using it |
Previously we were using the first event that matched the filtering parameters. This could be expensive if there are lots of events/users and event filtering doesn't utilize sorting or index skipping much. Instead we use the created_at as the date of first activity, regardless of any filtering that may have been applied to events. Note that this may not be as selective as the query on events, but fingers crossed this is an outlier. Note that this change also diverges from the current functionality in that previously we would consider the first activity for a specific event type, but now create_at is implicitly the earliest of any event. This PR doesn't handle an optimisation of further filtering the persons by any person filters that may be applied to ensure the right hand earliest JOIN is as small as it could be. Refers to #7382
This updates the SQL to be comprised of two queries, one for getting new, returning, and resurrecting periods of activity, one for getting dormant periods right after periods of activity. Refers to #7382
This updates the SQL to be comprised of two queries, one for getting new, returning, and resurrecting periods of activity, one for getting dormant periods right after periods of activity. Refers to #7382
* refactor(lifecycle): simplify clickhouse sql logic This updates the SQL to be comprised of two queries, one for getting new, returning, and resurrecting periods of activity, one for getting dormant periods right after periods of activity. Refers to #7382 * refactor(lifecyle): use `ClickhouseEventQuery` to build event query * format * Use bounded_person_activity_by_period for both sides of dormant join * refactor(lifecycle): reduce pdi2 join by one This means we're now under the current query memory limit for orgs with around 20m distinct_ids. It does remove some readability though :( * update snapshot * Add further comments to query * Add further comments to query * Add further comments to query * Remove dead variables * Refactor person_query overriding * Lifecycle refactoring continued * Update lifecycle tests (except people ones) * Make lifecycle people endpoint happy * Remove django lifecycle tests * Add some edge case tests * Add missing type Co-authored-by: Harry Waye <[email protected]>
In what situation are you experiencing subpar performance?
Lifecycle query is quite slow.
Causes
1. Query over all of time
Digging into the query, we do this subquery per user without filtering on timestamp:
This is to identify if the user we see in the period is a new user or someone we've seen before.
This is very very expensive in clickhouse since we end up reading all partitions in clickhouse. See also #5459 for a similar problem elsewhere.
Full query can be found here: https://github.com/PostHog/posthog/blob/master/ee/clickhouse/sql/trends/lifecycle.py
2. (Minor) No joins + property filter pushdown for person properties
Instead each person property ends up doing another subquery for pdis+persons, which makes the query even slower.
Compared to the above this is relatively minor but still remarkable.
How to reproduce
Environment
Additional context
We aren't able to do "over all of time" queries in clickhouse - we should always have a time limit. Would love some product input on what to do here:
Couple of options:
person.created_at
timestamp over detecting first visit time. This sounds appealing but not sure if this data can be relied on - we don't use it elsewhere. Requires joining withperson
always, likely also rewriting the query completely.Other product questions:
cc @paolodamico and @marcushyett-ph for product input
cc @EDsCODE for core analytics context
Thank you for your performance issue report – we want PostHog to go supersonic!
The text was updated successfully, but these errors were encountered: