Event filters: suggest properties that actually have data #4166

samwinslow · 2021-04-29T16:55:14Z

Is your feature request related to a problem?

When filtering events, it's overwhelming to see properties that every event potentially has. If you forget the exact name of a property, you either scroll through the list of all properties or rely on fuzzy-search. The size of this list is n * p where n is the number of event types and p is the average number of unique property types on each event type. So as an organization adds more event types the problem gets worse.

You can already see this phenomenon on app.posthog.com:

Screen.Cast.2021-04-29.at.12.30.17.PM.mp4

Describe the solution you'd like

We should get smarter about ranking / filtering the properties in the filter dropdown. The majority of the properties in the list might not apply to the chosen event type. Fuse fuzzy-search helps a little bit, but we could A) give it a ranking parameter based on how many matching events actually have each property; or B) binary include / exclude custom properties based on a new schema described below.

For the example below - insight viewed - we could predict that the user might want to filter on the Insight property, and exclude or de-rank properties that are undefined for insight viewed events.

This warrants some discussion re: the events schema. As you can see here, there are no $pageview events that have app_rating, no installed_app events that have is_first_movie, etc. Queries that look for those event-property combinations will of course return 0 rows.

Do we already have a table which represents all the custom property keys PostHog has "seen" on an event? This seems like the simplest solution. The schema would look like this, and incoming events would append properties as needed.

| event         | properties              |
-------------------------------------------
| rated_app     | ['app_rating']          |
| installed_app | []                      |
| watched_movie | ['is_first_movie']      |

For efficiency, updates to this table could be batched rather than being processed for every incoming event.

Describe alternatives you've considered

Additional context

Thank you for your feature request – we love each and every one!

The text was updated successfully, but these errors were encountered:

samwinslow · 2021-04-29T17:01:52Z

@timgl tagging you since this is Fuse-related.

Would love your feedback as well as @paolodamico's - especially as to whether you guys agree this is a problem worth fixing, and how to fit this into sprint priorities if so.

timgl · 2021-04-29T17:17:34Z

I think this would be useful (it's definitely a source of confusion for users I've noticed). The only issue is it's an expensive query to populate the table with this data initially, as it'll need to loop through every event every received. I think there's a solution to this where we only enable this for new teams and optionally create a management command that populates this table for existing teams, so that we can at least run this on cloud.

mariusandra · 2021-04-29T18:21:07Z

We just merged in this: https://github.com/PostHog/plugin-server/pull/331/files

An easy way forward is to create a third table, posthog_eventpropertydefinitions, and populate it with pretty much the same system, just with both event name and property name fields.

I'd advise against using postgres string arrays for this. They're very annoying to update, as you always need to select the array, patch it and then update the row again. With concurrent servers accessing the same data and with large enough arrays, it'll be a mess sooner or later. Best just to split it up.

paolodamico · 2021-05-13T13:17:04Z

Got some more feedback on this from a user yesterday. When adding a property filter, she expected to be able to search across all matches. In her case, she was trying to filter for a specific user, so she was searching for a particular email address. We weren't suggesting that address because it wasn't in the first results. I think ideally we would allow searching across all properties values, but depending on technical complexity if this is not possible we would at least want to update the UI/UX of the filters to make it clear that we're only suggesting values, not showing hard results on a search.

paolodamico · 2021-05-25T13:52:12Z

Other than above, there are also some other cases that cause confusion. For instance "Active Feature Flags" we use contains and then you're supposed to type the FF you care about, but this is not clear to users, nor a good experience.

timgl · 2021-05-25T15:03:05Z

I'm going to make a little bit of progress on this now

paolodamico · 2021-05-25T19:56:12Z

Related #3161 & #4398

timgl · 2021-06-14T15:53:50Z

I've tried to make progress on this for a while but apart from the one PR I did edit: and the one I'm about to do I don't think I'll have enough time to improve this.

The way I would solve this is

add an event field to PropertyDefinition and add event to the unique props. Make sure that you dedupe this in the api
update plugin-server to add event if it doesn't exist
When filtering, show any properties that are definitely in the event first

pauldambra · 2021-11-15T10:42:11Z

Clickhouse 21.11 introduces JSONExtractKeys that could be used to map existing event and property links (if we were on latest)

neilkakkar · 2021-11-22T12:30:28Z

re: JSONExtractKeys, we do already have JSONExtractKeysAndValuesRaw which we can use to then get the keys, like so: https://github.com/PostHog/posthog/blob/master/ee/clickhouse/queries/funnels/funnel_correlation.py#L471

(Just mentioning so we know if we do want to go this route, we're not blocked by upgrading CH)

timgl · 2021-11-23T17:38:51Z

I've got a super simple solution to this that won't involve parsing all events.

When we hit property_definitions we can return both a list of all properties, and a sample of the last 1k events for the specific event that have come in, and get the relevant keys for that sample. We then display those sampled properties first as they are likely most relevant, but will also show all other properties in case we missed any.

We could do the same with the values to make those more relevant.

paolodamico · 2021-11-25T21:34:00Z

Duplicate scope of #4791

timgl · 2021-12-02T12:32:15Z

@paolodamico I don't think it is? #4791 is about events, this is about properties

paolodamico · 2021-12-02T15:49:27Z

You're right @timgl, apologies. Adding to our sprint planning. It's indeed quite relevant.

macobo · 2021-12-03T12:22:37Z

I just chatted about @paolodamico regarding having this logic in the plugin server, and I'm against it for scalability reasons.

The basic problem boils down to plugin-server and postgres being hard to scale so we should do as little there as possible. See a more in-depth analysis here but this part of the ingestion pipeline being slow will slow down whole ingestion and it being buggy will bring ingestion down. We've already had several broken releases due to exactly the same sort of causes.

Alternative 1: Materialized views

My alternative suggestion: Do this in clickhouse using materialized views! So create a new table, event_properties with something like the following schema

team_id 
event_name
property_name

SORT BY (team_id, event_name, property_name)

Next, create another materialized view from events table to that table, populating these columns (e.g. using array joins). Lastly, use that table to query the property usage.

The advantage here is decoupling this ingestion from postgres/plugin-server while still having it be calculated in the background.

This isn't super proven though, hence...

Alternative 2: cronjob

Alternatively alternatively doing this in a cronjob isn't a bad idea either - as long as we're not touching postgres and the schema is optimized for it and the calculation is thought out (which it isn't for past cronjobs) it will work well! In that case the logic would be similar to cohort refreshing, in that we just issue a single clickhouse query that will update stats.

cc @yakkomajuri @mariusandra

yakkomajuri · 2021-12-03T13:42:57Z

For the record, I was looped into this but don't have full context. Like @macobo, I did suggest materialized views or a cronjob.

However, when asked about a potential solution with a cache and infrequent, non-granular updates, I did say it could be worth experimenting with.

Nevertheless, I had a brief pairing session with @mariusandra and think for his PR (the one relating to this issue), something like this could do the job neatly:

CREATE MATERIALIZED VIEW event_property_volume_mv ON CLUSTER posthog
ENGINE = SummingMergeTree()
ORDER BY (team_id, event, property)
POPULATE
AS SELECT team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property, count() as volume
FROM events 
WHERE timestamp > '2021-12-01'
GROUP BY team_id, event, property

We ran this in prod and populate worked nicely to give us some data to play around with, but we didn't manage to get this to update with the incoming data. Surely there are mistakes in the query above, but I won't be spending more time on this. Just leaving it here to spark ideas.

mariusandra · 2021-12-03T14:40:17Z

The entire field of software engineering can basically be summarised with the question "when exactly do you cache things?".

So let's explore the various caching options.

Approach 1: query time from "events"

We got this far when pairing with Tim:

select team_id, event, property, count(property) as total from (
  select team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property from events where team_id = 2 and timestamp > '2021-11-01'
)
group by team_id, event, property
order by total desc

For our team for one month, this runs in ~20sec. Reducing the timerange and somehow optimising the query could help, but we will run into a limit somewhere, and this just isn't sustainable. Sampling just e.g. 1000 events might increase speed, but at the cost of data integrity, and I'm not sure we want that tradeoff.

We could cache results after querying, but that'll only get us so far.

Keep in mind, that this total is just one of the fields that we want. Ideally we're also interested in created_at and last_seen_at. This is also for our short team.

Approach 2: materialized views from "events"

Pairing with Yakko, we realised a materialized view backed by SummingMergeTree could help count the properties per event.

We still want last_seen_at, and not just total_volume, and this won't help with that, but getting one of the values would already be a start.

CREATE MATERIALIZED VIEW event_property_volume_mv ON CLUSTER posthog
ENGINE = SummingMergeTree()
ORDER BY (team_id, event, property)
POPULATE
AS SELECT team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property, count() as volume
FROM events 
WHERE timestamp > '2021-12-01'
GROUP BY team_id, event, property

The problems:

We could not get live updates working. Probably doing something wrong and this is fixable.
With POPULATE above, we had data in this table, but there was nothing new coming in. Without POPULATE, the count remained at zero.
The total was wildly off (~251 vs ~641) when we were using FINAL vs SUM(). I don't know what to trust here. From what I know, running FINAL on production queries, on tables that update often, is not recommended.
This specific MV approach removes the option of backfilling data (other than e.g. the last day), as the initial query to create the table will be too slow otherwise, and we can't add rows later. I guess this can be improved by using another table somewhere in the pipeline?
To get the data back, in order to show the properties in the taxonomic filter, we need to query both CH and PG (CH gives the total_count, PG gives names, descriptions and other unused properties - think of global filters). Then we have to combine the two results into a unified sorted list that makes sense. I'm not sure that's doable without fetching full sets from both sources, using django to combine them to one list, and responding with a paginated slice. This sounds wasteful and slow.

I'd happily accept pointers on how to improve this approach.

Approach 3. stream processing in the plugin server

This is PR #7500, and IMO this is precisely the type of thing the plugin server should be doing. After all, the plugin server is a place that deals with ingestion, plugins, scheduled tasks, load distribution, live action matching, person updates, making coffee, watering the plants, etc, etc. That's by design and not an accident.

The general approach of the plugin server is: We have the data, as it's coming in. Let's extract the maximum value out of that data, right here, right now. We could even call this "Kafka Stream Processing" if we wanted to get fancy.

To illustrate and prove this approach, we moved action matching into the plugin server, so it could operate on the data as it's flying in, instead of later making a backward-looking query to ask for the same data we already had in the system. This increased load on postgres somewhat (fetch the user if needed, etc), but we're handing it fine from what I can see.

Hence, for me, the plugin server fits like a glove for the task of tallying event properties (and gathering the created_at and last_seen_at fields), and I fail to see how making a few extra postgres queries every 10k events or so is going to bring us down. I really don't see it, and would prefer to discuss with benchmarks and numbers, instead of going in endless "what if" fear mongering :). You have much more experience with wrangling huge postgres installations, and normally I'd defer to that experience, but I'm not convinced without proof that this will become our new pain point.

Personally I'd just love to try this on production, behind an env that we can quickly disable, and iterate from there.

I consider ClickHouse's materialized views a sort-of equivalent "at ingestion" stream processing approach, and I'm not opposed to them (other than complicated queries), but I don't consider them superior.

Hence, I'm still not convinced, and think avoiding this approach is just premature optimisation. If this really becomes the part that slows down ingestion, we can a) add Kafka and insert to PG in the background, b) replace PG with CH and send everything through Kafka anyway (but, still, we need a sum table and a max table and a min table then probably for the different values?).

Extra consideration: dates and taxonomy

We have Issue #6619 --> there's no intelligence on the types of properties, and everything is practically a string (or number or bool). This is causing issues with timestamps in properties. You can only filter them like strings, aka they're quite useless.

Having a separate event<->property mapping in Postgres allows us to annotate those properties with whatever information we want. For example, type = DateTime, format = ISO8061 with timestamp. We'd fetch this during query time, and use the metadata for better date conversion functions in CH.

This metadata could be used for other purposes, e.g. convert an user_id property to a link in your backend, etc (type=String, format="http://backend/admin/users/{id}").... but this is getting way way off topic.

Approach 4: reduce granularity.

Same as nr 3, but we save "last seen at" at the date level (skipping the time), and instead of "total_count" we have "seen_on_days", which we only +1 if the "last_seen_at" date is not today.

Summary

I'd still like to run this behind an env in production to see if the load is insane, or if we can handle it without skipping a beat.

posthog-contributions-bot · 2021-12-03T14:40:18Z

This issue has 2106 words at 17 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

macobo · 2021-12-03T14:54:58Z

I think the above comment basically boils down to

I'm not convinced without proof that this will become our new pain point.

My proof:

Past experience
We've had several broken releases this year due to overloading postgres in various ways - both in cronjobs and plugin server.
I spent a few weeks earlier in the year debugging plugin-server - it is hard-bottlenecked on postgres, where it spends majority of its time. It's not easy to scale this out.

As I said in my linked comment from before:

As we scale, scaling plugin-server and especially postgres is hard, we will be spending man-years building towards a more scalable solution. That's really expensive and delaying this work is cruicial for the sake of the company. This means avoiding doing unneeded work and being clever about work that is done.

The later we can do hard scaling work the better for us as a company. As-is, plugin server will fall down as we scale 10x or 100x and faster if we keep adding postgres-reliant logic onto our critical ingestion path. Those problems are much much harder to solve than the ones this ticket is solving.

All that said - if you're committed, you do you. I'm not owning plugin-server nor this feature currently and am just calling out a bad idea when I see it.

If you want help working out an alternative solution and getting it functional, also happy to help!

macobo · 2021-12-03T16:40:27Z

Discussed this async with Marius, main points:

This is fine to go live with the current plugin-server approach. The team owning this ticket is the main decider not me.
Intent was to point out a scalability issue we're facing and help propose alternative solutions with similar complexity in hopes of avoiding long-term issues. Goal was not to stop work on this.
I was and still am available to help prototype these solutions as needed. If issues arise I'm still convinced I can make both cronjob and mat column approaches work.

I will be continuing to call out issues if I see risk for the company, but given how multiple people are now aware of this particular solution we're already in a better spot than before. :)

samwinslow added enhancement New feature or request analytics-filters labels Apr 29, 2021

paolodamico added the UI/UX label May 13, 2021

paolodamico mentioned this issue May 25, 2021

[EPIC] - Insights UI/UX improvements #4475

Closed

17 tasks

timgl mentioned this issue May 25, 2021

Remove legacy event_names code #4480

Merged

5 tasks

samwinslow mentioned this issue Jun 14, 2021

Add indicator in event tooltip when an event in stale #4502

Closed

paolodamico added the P1 Urgent, non-breaking (no crash but low usability) label Sep 16, 2021

pauldambra mentioned this issue Nov 15, 2021

Sprint 1.31.0 1/2 - 15 Nov - 26 Nov #6984

Closed

timgl mentioned this issue Nov 23, 2021

Epic - Nail Seamless Analytics Experience #6513

Closed

paolodamico closed this as completed Nov 25, 2021

timgl reopened this Dec 2, 2021

paolodamico mentioned this issue Dec 2, 2021

Nail insights experience - 1.31.0 2/2 (Core Experience) #7412

Closed

58 tasks

mariusandra mentioned this issue Dec 3, 2021

Event property counter #7500

Merged

paolodamico mentioned this issue Dec 13, 2021

Core Experience sprint planning - 1.32.0 1/2 #7581

Closed

24 tasks

mariusandra closed this as completed in #7500 Dec 17, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Event filters: suggest properties that actually have data #4166

Event filters: suggest properties that actually have data #4166

samwinslow commented Apr 29, 2021 •

edited

Loading

samwinslow commented Apr 29, 2021

timgl commented Apr 29, 2021

mariusandra commented Apr 29, 2021

paolodamico commented May 13, 2021 •

edited

Loading

paolodamico commented May 25, 2021

timgl commented May 25, 2021

paolodamico commented May 25, 2021 •

edited

Loading

timgl commented Jun 14, 2021 •

edited

Loading

pauldambra commented Nov 15, 2021

neilkakkar commented Nov 22, 2021

timgl commented Nov 23, 2021

paolodamico commented Nov 25, 2021

timgl commented Dec 2, 2021

paolodamico commented Dec 2, 2021

macobo commented Dec 3, 2021 •

edited

Loading

yakkomajuri commented Dec 3, 2021 •

edited

Loading

mariusandra commented Dec 3, 2021 •

edited

Loading

posthog-contributions-bot bot commented Dec 3, 2021

macobo commented Dec 3, 2021 •

edited

Loading

macobo commented Dec 3, 2021 •

edited

Loading

Event filters: suggest properties that actually have data #4166

Event filters: suggest properties that actually have data #4166

Comments

samwinslow commented Apr 29, 2021 • edited Loading

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Thank you for your feature request – we love each and every one!

samwinslow commented Apr 29, 2021

timgl commented Apr 29, 2021

mariusandra commented Apr 29, 2021

paolodamico commented May 13, 2021 • edited Loading

paolodamico commented May 25, 2021

timgl commented May 25, 2021

paolodamico commented May 25, 2021 • edited Loading

timgl commented Jun 14, 2021 • edited Loading

pauldambra commented Nov 15, 2021

neilkakkar commented Nov 22, 2021

timgl commented Nov 23, 2021

paolodamico commented Nov 25, 2021

timgl commented Dec 2, 2021

paolodamico commented Dec 2, 2021

macobo commented Dec 3, 2021 • edited Loading

Alternative 1: Materialized views

Alternative 2: cronjob

yakkomajuri commented Dec 3, 2021 • edited Loading

mariusandra commented Dec 3, 2021 • edited Loading

Approach 1: query time from "events"

Approach 2: materialized views from "events"

Approach 3. stream processing in the plugin server

Extra consideration: dates and taxonomy

Approach 4: reduce granularity.

Summary

posthog-contributions-bot bot commented Dec 3, 2021

macobo commented Dec 3, 2021 • edited Loading

macobo commented Dec 3, 2021 • edited Loading

samwinslow commented Apr 29, 2021 •

edited

Loading

paolodamico commented May 13, 2021 •

edited

Loading

paolodamico commented May 25, 2021 •

edited

Loading

timgl commented Jun 14, 2021 •

edited

Loading

macobo commented Dec 3, 2021 •

edited

Loading

yakkomajuri commented Dec 3, 2021 •

edited

Loading

mariusandra commented Dec 3, 2021 •

edited

Loading

macobo commented Dec 3, 2021 •

edited

Loading

macobo commented Dec 3, 2021 •

edited

Loading