Lifecycle query is slow #7382

macobo · 2021-11-26T12:42:32Z

In what situation are you experiencing subpar performance?

Lifecycle query is quite slow.

Causes

1. Query over all of time

Digging into the query, we do this subquery per user without filtering on timestamp:

JOIN (
    SELECT DISTINCT person_id, {trunc_func}(min(events.timestamp)) earliest FROM events
    JOIN
    ({GET_TEAM_PERSON_DISTINCT_IDS}) pdi on events.distinct_id = pdi.distinct_id
    WHERE team_id = %(team_id)s AND {event_query} {filters}
    GROUP BY person_id
) earliest ON e.person_id = earliest.person_id

This is to identify if the user we see in the period is a new user or someone we've seen before.

This is very very expensive in clickhouse since we end up reading all partitions in clickhouse. See also #5459 for a similar problem elsewhere.

Full query can be found here: https://github.com/PostHog/posthog/blob/master/ee/clickhouse/sql/trends/lifecycle.py

2. (Minor) No joins + property filter pushdown for person properties

Instead each person property ends up doing another subquery for pdis+persons, which makes the query even slower.

Compared to the above this is relatively minor but still remarkable.

How to reproduce

Visit stickiness

Environment

PostHog Cloud
self-hosted PostHog, version/commit: please provide

Additional context

We aren't able to do "over all of time" queries in clickhouse - we should always have a time limit. Would love some product input on what to do here:

Couple of options:

Kill the resurrected/new differentiation
Introduce a fixed time range for detecting resurrected (e.g. +3 months to chosen time range)
Leverage person.created_at timestamp over detecting first visit time. This sounds appealing but not sure if this data can be relied on - we don't use it elsewhere. Requires joining with person always, likely also rewriting the query completely.

Thank you for your performance issue report – we want PostHog to go supersonic!

The text was updated successfully, but these errors were encountered:

macobo · 2021-12-08T16:28:31Z

@paolodamico and @marcushyett-ph given this is one of the goals next sprint let's make solve the business question. Are you in favor of using person.created_at column for discovering resurrected?

EDsCODE · 2021-12-08T20:16:26Z

looping @hazzadous to follow this in preparation for next sprint

person.created_at makes sense as long as we validate the data is trustworthy

paolodamico · 2021-12-08T23:12:21Z

In a general sense, product-wise, I strongly suggest keeping the base functionality as it's key part of the lifecycle insight (even though feature has low usage).

The proposed alternatives make sense to me conceptually, but let's dive a little bit deeper.

How do we set that created_at attribute? Do we set it automatically when creating a person record?
When was this attribute introduced?

Perhaps we can do a quick sanity check query to see if there are any major discrepancies between the first event seen for a user and this date? We could also do a one-off sync job.

marcushyett-ph · 2021-12-08T23:44:10Z

+1 to @paolodamico and @EDsCODE, do we have any anecdotal reasons not to trust created_at?

EDsCODE · 2021-12-08T23:58:20Z

I just checked and I think created_at was an original field on the model so we should have good data on it. The one thing with created_at is that it will lock us in to only being able to do $pageview lifecycles. We're making the assumption that a created person always hits a $pageview which is pretty fair but this won't be true for other events.

(Though this isn't a blocker because we could eventually materialize tables around events precalculate the earliest timestamp of a person and event)

hazzadous · 2021-12-17T08:54:18Z

Sitrep

have added a benchmark pr
have looked into what is slow, initial findings: events over all time is not so slow at least for the cases I tested. The slow part is loading all pdis into memory and joining by the looks. Placing pdis to the left of the join appears to speed this up

either we switch the order, or put in a join table. I wonder if using a smaller join key might help also. But at any rate filtering the pdis by person created date should reduce the size of the join considerably

I’ll follow up to more details and actions

marcushyett-ph · 2021-12-17T10:19:23Z

Nice! @hazzadous totally anecdotal and probably not useful context.

If I remove the filter for internal users lifecycle queries are way way faster. Not sure if there's anything in that statement to help identifying any bottlenecks.

hazzadous · 2022-01-04T11:26:20Z

Sitrep: @macobo is working on #7663 which should considerably speedup teams/projects with a large number of distinct_ids. This at least makes the query in isolation faster, a couple of other opportunities:

even if the pdi query is fast, we still need to perform a join with a large, e.g. multiple millions right table on some cases, which appears to be slow as will need to be loaded into memory to perform the hash join. We can make the right table smaller, or keep it in memory e.g. with a Join Table
as @marcushyett-ph mentions removing internal users filter speeds things up considerable. For filtering, we are parsing with JSONExtractRaw for all persons within a team for properties that are not materialized. This requires reading the entire properties column and json parsing it, both IO and CPU heavy, I'm not sure which is the bottleneck yet. Immediate options here are ensuring that any properties used for filtering are materialized. If hits are sparse, using data skipping indices might help, although I suspect the query pattern here doesn't relate. As @macobo mentions here there are potentially small gains with using Array for properties. Further, using a nested key, value structure may help with limiting io, and avoiding CPU having to parse JSON.

I'll do some experimentation to see what works, but we have the materialisation option as an immediate tool. The join speed looks more involved.

hazzadous · 2022-01-04T11:46:51Z

After speaking to @macobo is seems email is materialized for us as pmat_email, but lifecycle isn't using it

Previously we were using the first event that matched the filtering parameters. This could be expensive if there are lots of events/users and event filtering doesn't utilize sorting or index skipping much. Instead we use the created_at as the date of first activity, regardless of any filtering that may have been applied to events. Note that this may not be as selective as the query on events, but fingers crossed this is an outlier. Note that this change also diverges from the current functionality in that previously we would consider the first activity for a specific event type, but now create_at is implicitly the earliest of any event. This PR doesn't handle an optimisation of further filtering the persons by any person filters that may be applied to ensure the right hand earliest JOIN is as small as it could be. Refers to #7382

This updates the SQL to be comprised of two queries, one for getting new, returning, and resurrecting periods of activity, one for getting dormant periods right after periods of activity. Refers to #7382

* refactor(lifecycle): simplify clickhouse sql logic This updates the SQL to be comprised of two queries, one for getting new, returning, and resurrecting periods of activity, one for getting dormant periods right after periods of activity. Refers to #7382 * refactor(lifecyle): use `ClickhouseEventQuery` to build event query * format * Use bounded_person_activity_by_period for both sides of dormant join * refactor(lifecycle): reduce pdi2 join by one This means we're now under the current query memory limit for orgs with around 20m distinct_ids. It does remove some readability though :( * update snapshot * Add further comments to query * Add further comments to query * Add further comments to query * Remove dead variables * Refactor person_query overriding * Lifecycle refactoring continued * Update lifecycle tests (except people ones) * Make lifecycle people endpoint happy * Remove django lifecycle tests * Add some edge case tests * Add missing type Co-authored-by: Harry Waye <[email protected]>

macobo added performance Has to do with performance. For PRs, runs the clickhouse query performance suite insights labels Nov 26, 2021

macobo mentioned this issue Dec 9, 2021

Support groups in lifecycle query #7603

Closed

EDsCODE mentioned this issue Dec 10, 2021

Sprint 1.32.0 1/2 - Dec 13 to Jan 7 #7526

Closed

hazzadous self-assigned this Dec 13, 2021

hazzadous mentioned this issue Dec 16, 2021

perf(lifecycle): add benchmark for lifecycle #7753

Merged

macobo mentioned this issue Dec 22, 2021

Filtering on distinct_id in trends #7810

Closed

hazzadous mentioned this issue Jan 4, 2022

perf(lifecycle): use created_at as the earliest person activity #7886

Closed

macobo mentioned this issue Jan 5, 2022

Sprint 1.32.0 2/2 - Jan 10 to Jan 21 #7893

Closed

hazzadous mentioned this issue Jan 10, 2022

perf: refactor lifecycle query #7914

Closed

macobo mentioned this issue Jan 13, 2022

Speed up lifecycle query #8021

Merged

macobo closed this as completed in #8021 Jan 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lifecycle query is slow #7382

Lifecycle query is slow #7382

macobo commented Nov 26, 2021 •

edited

Loading

macobo commented Dec 8, 2021

EDsCODE commented Dec 8, 2021

paolodamico commented Dec 8, 2021

marcushyett-ph commented Dec 8, 2021

EDsCODE commented Dec 8, 2021

hazzadous commented Dec 17, 2021 •

edited

Loading

marcushyett-ph commented Dec 17, 2021

hazzadous commented Jan 4, 2022

hazzadous commented Jan 4, 2022

Lifecycle query is slow #7382

Lifecycle query is slow #7382

Comments

macobo commented Nov 26, 2021 • edited Loading

In what situation are you experiencing subpar performance?

Causes

1. Query over all of time

2. (Minor) No joins + property filter pushdown for person properties

How to reproduce

Environment

Additional context

Thank you for your performance issue report – we want PostHog to go supersonic!

macobo commented Dec 8, 2021

EDsCODE commented Dec 8, 2021

paolodamico commented Dec 8, 2021

marcushyett-ph commented Dec 8, 2021

EDsCODE commented Dec 8, 2021

hazzadous commented Dec 17, 2021 • edited Loading

marcushyett-ph commented Dec 17, 2021

hazzadous commented Jan 4, 2022

hazzadous commented Jan 4, 2022

macobo commented Nov 26, 2021 •

edited

Loading

hazzadous commented Dec 17, 2021 •

edited

Loading