-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Significant performance issues with funnels on Postgres #5519
Labels
performance
Has to do with performance. For PRs, runs the clickhouse query performance suite
Comments
paolodamico
added
the
performance
Has to do with performance. For PRs, runs the clickhouse query performance suite
label
Aug 9, 2021
@paolodamico mind putting me in touch with the user in users slack? Would love to take a look and debug this, but this might require a bigger dataset or access I don't have :) |
Done! |
Problem reported by another user, funnels not loading in the insights page. |
hazzadous
pushed a commit
that referenced
this issue
Sep 15, 2021
We're using this to join onto person distinct ids in lots of places to this really should be indexed. There is almost certainly a better composite index here however. Relates to #5519 although I have yet to look into if this is a resolution to the issue specifically. The specific query for #5519 looks like this, from looking `pytest posthog/api/test/test_insight.py -k test_insight_funnels_basic_post`: ``` SELECT "posthog_person"."uuid", "posthog_person"."created_at", "posthog_person"."team_id", "posthog_person"."properties", "posthog_person"."is_user_id", MIN("step_0"."step_ts") as "step_0", MIN("step_1"."step_ts") as "step_1" FROM ( SELECT "pdi"."person_id", MIN("posthog_event"."timestamp") AS "step_ts" FROM posthog_event JOIN posthog_persondistinctid pdi ON pdi.distinct_id = posthog_event.distinct_id WHERE ( "posthog_event"."timestamp" >= '2020-10-08T00:00:00+00:00'::timestamptz AND "posthog_event"."timestamp" <= '2020-11-15T13:10:40.739763+00:00'::timestamptz AND "posthog_event"."event" = 'user signed up' AND "posthog_event"."team_id" = 2 ) GROUP BY "pdi"."person_id" ) step_0 LEFT JOIN LATERAL ( SELECT "pdi"."person_id", MIN("posthog_event"."timestamp") AS "step_ts" FROM posthog_event JOIN posthog_persondistinctid pdi ON pdi.distinct_id = posthog_event.distinct_id WHERE ( "posthog_event"."timestamp" >= '2020-10-08T00:00:00+00:00'::timestamptz AND "posthog_event"."timestamp" <= '2020-11-15T13:10:40.739763+00:00'::timestamptz AND "pdi"."person_id" = "step_0"."person_id" AND "posthog_event"."event" = 'user did things' AND "posthog_event"."team_id" = 2 AND "posthog_event"."timestamp" >= "step_0"."step_ts" ) GROUP BY "pdi"."person_id" ) step_1 ON TRUE JOIN posthog_person ON posthog_person.id = "step_0".person_id WHERE "step_0".person_id IS NOT NULL GROUP BY "posthog_person"."uuid", "posthog_person"."created_at", "posthog_person"."team_id", "posthog_person"."properties", "posthog_person"."is_user_id" ```
hazzadous
pushed a commit
that referenced
this issue
Sep 16, 2021
Have tested on prod with team_id=2. Seems to work at making the query complete in time. Quite quick there but not much data. There's probably a lot more we could do here but I don't want to burn too much time on this. Future improvements could be: 1. Remove the first subquery and just do a join 2. Investigate if there's benefit in removing the group by and using distinct on. I don't know enough about this tbh. 3. Remove all the string manipulations with re. Just write up the query by hand. 4. Remove the implicit coupling between query_bodies gen and the `for step, qb` look. It's probably best to do this in one pass. Closes #5519
hazzadous
pushed a commit
that referenced
this issue
Sep 20, 2021
* docs(funnel): add comments around how the funnel query is build for pg * chore(funnel): explicitly couple `_gen_lateral_bodies` to `_build_query` `_gen_lateral_bodies` result seems to always be passed to `_build_query` so it doesn't make too much sense to have them required to be called in conjunction. There are probably further changes that could be made. There is some implicit coupling around the return value and the joining of the bodies into a LATERAL JOIN, which should be made explicit. * chore(funnels): reduce the query generation complexity This is mainly just to make it a little more clear how the query fits together. I have done something naughty here in that I have changed from a using a GROUP BY to using DISTINCT ON for the top level. I think you'd get uniqueness of funnel path anyway, as each lateral join should only produce only one result. * perf(funnels): add filter on pdi.team_id to speed up query Have tested on prod with team_id=2. Seems to work at making the query complete in time. Quite quick there but not much data. There's probably a lot more we could do here but I don't want to burn too much time on this. Future improvements could be: 1. Remove the first subquery and just do a join 2. Investigate if there's benefit in removing the group by and using distinct on. I don't know enough about this tbh. 3. Remove all the string manipulations with re. Just write up the query by hand. 4. Remove the implicit coupling between query_bodies gen and the `for step, qb` look. It's probably best to do this in one pass. Closes #5519
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
In what situation are you experiencing subpar performance?
An incredibly helpful user today, shared this experience on funnels taking way too much time and resources (exact numbers TBC) to execute with me after upgrading to 1.27 (Postgres). User reported event volume is pretty small (exact numbers coming soon).
How to reproduce
See above.
Environment
Additional context
This seems like a huge blocker on Postgres, even if we start considering deprecating Postgres usage, this might be one of those things to actually address. Thoughts? @macobo @neilkakkar @EDsCODE
Thank you for your performance issue report – we want PostHog to go supersonic!
The text was updated successfully, but these errors were encountered: