Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Event filters: suggest properties that actually have data #4166

Closed
samwinslow opened this issue Apr 29, 2021 · 20 comments · Fixed by #7500
Closed

Event filters: suggest properties that actually have data #4166

samwinslow opened this issue Apr 29, 2021 · 20 comments · Fixed by #7500
Labels
enhancement New feature or request P1 Urgent, non-breaking (no crash but low usability)

Comments

@samwinslow
Copy link
Contributor

samwinslow commented Apr 29, 2021

Is your feature request related to a problem?

When filtering events, it's overwhelming to see properties that every event potentially has. If you forget the exact name of a property, you either scroll through the list of all properties or rely on fuzzy-search. The size of this list is n * p where n is the number of event types and p is the average number of unique property types on each event type. So as an organization adds more event types the problem gets worse.

You can already see this phenomenon on app.posthog.com:

Screen.Cast.2021-04-29.at.12.30.17.PM.mp4

Describe the solution you'd like

We should get smarter about ranking / filtering the properties in the filter dropdown. The majority of the properties in the list might not apply to the chosen event type. Fuse fuzzy-search helps a little bit, but we could A) give it a ranking parameter based on how many matching events actually have each property; or B) binary include / exclude custom properties based on a new schema described below.

For the example below - insight viewed - we could predict that the user might want to filter on the Insight property, and exclude or de-rank properties that are undefined for insight viewed events.

Screen Shot 2021-04-29 at 12 06 52 PM

This warrants some discussion re: the events schema. As you can see here, there are no $pageview events that have app_rating, no installed_app events that have is_first_movie, etc. Queries that look for those event-property combinations will of course return 0 rows.

Screen Shot 2021-04-29 at 12 41 02 PM

Do we already have a table which represents all the custom property keys PostHog has "seen" on an event? This seems like the simplest solution. The schema would look like this, and incoming events would append properties as needed.

| event         | properties              |
-------------------------------------------
| rated_app     | ['app_rating']          |
| installed_app | []                      |
| watched_movie | ['is_first_movie']      |

For efficiency, updates to this table could be batched rather than being processed for every incoming event.

Describe alternatives you've considered

Additional context

Thank you for your feature request – we love each and every one!

@samwinslow samwinslow added enhancement New feature or request analytics-filters labels Apr 29, 2021
@samwinslow
Copy link
Contributor Author

@timgl tagging you since this is Fuse-related.

Would love your feedback as well as @paolodamico's - especially as to whether you guys agree this is a problem worth fixing, and how to fit this into sprint priorities if so.

@timgl
Copy link
Collaborator

timgl commented Apr 29, 2021

I think this would be useful (it's definitely a source of confusion for users I've noticed). The only issue is it's an expensive query to populate the table with this data initially, as it'll need to loop through every event every received. I think there's a solution to this where we only enable this for new teams and optionally create a management command that populates this table for existing teams, so that we can at least run this on cloud.

@mariusandra
Copy link
Collaborator

We just merged in this: https://github.com/PostHog/plugin-server/pull/331/files

An easy way forward is to create a third table, posthog_eventpropertydefinitions, and populate it with pretty much the same system, just with both event name and property name fields.

I'd advise against using postgres string arrays for this. They're very annoying to update, as you always need to select the array, patch it and then update the row again. With concurrent servers accessing the same data and with large enough arrays, it'll be a mess sooner or later. Best just to split it up.

@paolodamico
Copy link
Contributor

paolodamico commented May 13, 2021

Got some more feedback on this from a user yesterday. When adding a property filter, she expected to be able to search across all matches. In her case, she was trying to filter for a specific user, so she was searching for a particular email address. We weren't suggesting that address because it wasn't in the first results. I think ideally we would allow searching across all properties values, but depending on technical complexity if this is not possible we would at least want to update the UI/UX of the filters to make it clear that we're only suggesting values, not showing hard results on a search.

@paolodamico
Copy link
Contributor

Other than above, there are also some other cases that cause confusion. For instance "Active Feature Flags" we use contains and then you're supposed to type the FF you care about, but this is not clear to users, nor a good experience.

@timgl
Copy link
Collaborator

timgl commented May 25, 2021

I'm going to make a little bit of progress on this now

@paolodamico
Copy link
Contributor

paolodamico commented May 25, 2021

Related #3161 & #4398

@timgl
Copy link
Collaborator

timgl commented Jun 14, 2021

I've tried to make progress on this for a while but apart from the one PR I did edit: and the one I'm about to do I don't think I'll have enough time to improve this.

The way I would solve this is

  • add an event field to PropertyDefinition and add event to the unique props. Make sure that you dedupe this in the api
  • update plugin-server to add event if it doesn't exist
  • When filtering, show any properties that are definitely in the event first

@paolodamico paolodamico added the P1 Urgent, non-breaking (no crash but low usability) label Sep 16, 2021
@pauldambra
Copy link
Member

Clickhouse 21.11 introduces JSONExtractKeys that could be used to map existing event and property links (if we were on latest)

@neilkakkar
Copy link
Contributor

re: JSONExtractKeys, we do already have JSONExtractKeysAndValuesRaw which we can use to then get the keys, like so: https://github.com/PostHog/posthog/blob/master/ee/clickhouse/queries/funnels/funnel_correlation.py#L471

(Just mentioning so we know if we do want to go this route, we're not blocked by upgrading CH)

@timgl
Copy link
Collaborator

timgl commented Nov 23, 2021

I've got a super simple solution to this that won't involve parsing all events.

When we hit property_definitions we can return both a list of all properties, and a sample of the last 1k events for the specific event that have come in, and get the relevant keys for that sample. We then display those sampled properties first as they are likely most relevant, but will also show all other properties in case we missed any.

We could do the same with the values to make those more relevant.

@paolodamico
Copy link
Contributor

Duplicate scope of #4791

@timgl
Copy link
Collaborator

timgl commented Dec 2, 2021

@paolodamico I don't think it is? #4791 is about events, this is about properties

@timgl timgl reopened this Dec 2, 2021
@paolodamico
Copy link
Contributor

You're right @timgl, apologies. Adding to our sprint planning. It's indeed quite relevant.

@macobo
Copy link
Contributor

macobo commented Dec 3, 2021

I just chatted about @paolodamico regarding having this logic in the plugin server, and I'm against it for scalability reasons.

The basic problem boils down to plugin-server and postgres being hard to scale so we should do as little there as possible. See a more in-depth analysis here but this part of the ingestion pipeline being slow will slow down whole ingestion and it being buggy will bring ingestion down. We've already had several broken releases due to exactly the same sort of causes.

Alternative 1: Materialized views

My alternative suggestion: Do this in clickhouse using materialized views! So create a new table, event_properties with something like the following schema

team_id 
event_name
property_name

SORT BY (team_id, event_name, property_name)

Next, create another materialized view from events table to that table, populating these columns (e.g. using array joins). Lastly, use that table to query the property usage.

The advantage here is decoupling this ingestion from postgres/plugin-server while still having it be calculated in the background.

This isn't super proven though, hence...

Alternative 2: cronjob

Alternatively alternatively doing this in a cronjob isn't a bad idea either - as long as we're not touching postgres and the schema is optimized for it and the calculation is thought out (which it isn't for past cronjobs) it will work well! In that case the logic would be similar to cohort refreshing, in that we just issue a single clickhouse query that will update stats.

cc @yakkomajuri @mariusandra

@yakkomajuri
Copy link
Contributor

yakkomajuri commented Dec 3, 2021

For the record, I was looped into this but don't have full context. Like @macobo, I did suggest materialized views or a cronjob.

However, when asked about a potential solution with a cache and infrequent, non-granular updates, I did say it could be worth experimenting with.

Nevertheless, I had a brief pairing session with @mariusandra and think for his PR (the one relating to this issue), something like this could do the job neatly:

CREATE MATERIALIZED VIEW event_property_volume_mv ON CLUSTER posthog
ENGINE = SummingMergeTree()
ORDER BY (team_id, event, property)
POPULATE
AS SELECT team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property, count() as volume
FROM events 
WHERE timestamp > '2021-12-01'
GROUP BY team_id, event, property

We ran this in prod and populate worked nicely to give us some data to play around with, but we didn't manage to get this to update with the incoming data. Surely there are mistakes in the query above, but I won't be spending more time on this. Just leaving it here to spark ideas.

@mariusandra
Copy link
Collaborator

mariusandra commented Dec 3, 2021

The entire field of software engineering can basically be summarised with the question "when exactly do you cache things?".

So let's explore the various caching options.

Approach 1: query time from "events"

We got this far when pairing with Tim:

select team_id, event, property, count(property) as total from (
  select team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property from events where team_id = 2 and timestamp > '2021-11-01'
)
group by team_id, event, property
order by total desc

For our team for one month, this runs in ~20sec. Reducing the timerange and somehow optimising the query could help, but we will run into a limit somewhere, and this just isn't sustainable. Sampling just e.g. 1000 events might increase speed, but at the cost of data integrity, and I'm not sure we want that tradeoff.

We could cache results after querying, but that'll only get us so far.

Keep in mind, that this total is just one of the fields that we want. Ideally we're also interested in created_at and last_seen_at. This is also for our short team.

Approach 2: materialized views from "events"

Pairing with Yakko, we realised a materialized view backed by SummingMergeTree could help count the properties per event.

We still want last_seen_at, and not just total_volume, and this won't help with that, but getting one of the values would already be a start.

CREATE MATERIALIZED VIEW event_property_volume_mv ON CLUSTER posthog
ENGINE = SummingMergeTree()
ORDER BY (team_id, event, property)
POPULATE
AS SELECT team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property, count() as volume
FROM events 
WHERE timestamp > '2021-12-01'
GROUP BY team_id, event, property

The problems:

  • We could not get live updates working. Probably doing something wrong and this is fixable.
  • With POPULATE above, we had data in this table, but there was nothing new coming in. Without POPULATE, the count remained at zero.
  • The total was wildly off (~251 vs ~641) when we were using FINAL vs SUM(). I don't know what to trust here. From what I know, running FINAL on production queries, on tables that update often, is not recommended.
  • This specific MV approach removes the option of backfilling data (other than e.g. the last day), as the initial query to create the table will be too slow otherwise, and we can't add rows later. I guess this can be improved by using another table somewhere in the pipeline?
  • To get the data back, in order to show the properties in the taxonomic filter, we need to query both CH and PG (CH gives the total_count, PG gives names, descriptions and other unused properties - think of global filters). Then we have to combine the two results into a unified sorted list that makes sense. I'm not sure that's doable without fetching full sets from both sources, using django to combine them to one list, and responding with a paginated slice. This sounds wasteful and slow.

I'd happily accept pointers on how to improve this approach.

Approach 3. stream processing in the plugin server

This is PR #7500, and IMO this is precisely the type of thing the plugin server should be doing. After all, the plugin server is a place that deals with ingestion, plugins, scheduled tasks, load distribution, live action matching, person updates, making coffee, watering the plants, etc, etc. That's by design and not an accident.

The general approach of the plugin server is: We have the data, as it's coming in. Let's extract the maximum value out of that data, right here, right now. We could even call this "Kafka Stream Processing" if we wanted to get fancy.

To illustrate and prove this approach, we moved action matching into the plugin server, so it could operate on the data as it's flying in, instead of later making a backward-looking query to ask for the same data we already had in the system. This increased load on postgres somewhat (fetch the user if needed, etc), but we're handing it fine from what I can see.

Hence, for me, the plugin server fits like a glove for the task of tallying event properties (and gathering the created_at and last_seen_at fields), and I fail to see how making a few extra postgres queries every 10k events or so is going to bring us down. I really don't see it, and would prefer to discuss with benchmarks and numbers, instead of going in endless "what if" fear mongering :). You have much more experience with wrangling huge postgres installations, and normally I'd defer to that experience, but I'm not convinced without proof that this will become our new pain point.

Personally I'd just love to try this on production, behind an env that we can quickly disable, and iterate from there.

I consider ClickHouse's materialized views a sort-of equivalent "at ingestion" stream processing approach, and I'm not opposed to them (other than complicated queries), but I don't consider them superior.

Hence, I'm still not convinced, and think avoiding this approach is just premature optimisation. If this really becomes the part that slows down ingestion, we can a) add Kafka and insert to PG in the background, b) replace PG with CH and send everything through Kafka anyway (but, still, we need a sum table and a max table and a min table then probably for the different values?).

Extra consideration: dates and taxonomy

We have Issue #6619 --> there's no intelligence on the types of properties, and everything is practically a string (or number or bool). This is causing issues with timestamps in properties. You can only filter them like strings, aka they're quite useless.

Having a separate event<->property mapping in Postgres allows us to annotate those properties with whatever information we want. For example, type = DateTime, format = ISO8061 with timestamp. We'd fetch this during query time, and use the metadata for better date conversion functions in CH.

This metadata could be used for other purposes, e.g. convert an user_id property to a link in your backend, etc (type=String, format="http://backend/admin/users/{id}").... but this is getting way way off topic.

Approach 4: reduce granularity.

Same as nr 3, but we save "last seen at" at the date level (skipping the time), and instead of "total_count" we have "seen_on_days", which we only +1 if the "last_seen_at" date is not today.

Summary

I'd still like to run this behind an env in production to see if the load is insane, or if we can handle it without skipping a beat.

@posthog-contributions-bot
Copy link
Contributor

This issue has 2106 words at 17 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

@macobo
Copy link
Contributor

macobo commented Dec 3, 2021

I think the above comment basically boils down to

I'm not convinced without proof that this will become our new pain point.

My proof:

  1. Past experience
  2. We've had several broken releases this year due to overloading postgres in various ways - both in cronjobs and plugin server.
  3. I spent a few weeks earlier in the year debugging plugin-server - it is hard-bottlenecked on postgres, where it spends majority of its time. It's not easy to scale this out.

As I said in my linked comment from before:

As we scale, scaling plugin-server and especially postgres is hard, we will be spending man-years building towards a more scalable solution. That's really expensive and delaying this work is cruicial for the sake of the company. This means avoiding doing unneeded work and being clever about work that is done.

The later we can do hard scaling work the better for us as a company. As-is, plugin server will fall down as we scale 10x or 100x and faster if we keep adding postgres-reliant logic onto our critical ingestion path. Those problems are much much harder to solve than the ones this ticket is solving.

All that said - if you're committed, you do you. I'm not owning plugin-server nor this feature currently and am just calling out a bad idea when I see it.

If you want help working out an alternative solution and getting it functional, also happy to help!

@macobo
Copy link
Contributor

macobo commented Dec 3, 2021

Discussed this async with Marius, main points:

  1. This is fine to go live with the current plugin-server approach. The team owning this ticket is the main decider not me.
  2. Intent was to point out a scalability issue we're facing and help propose alternative solutions with similar complexity in hopes of avoiding long-term issues. Goal was not to stop work on this.
  3. I was and still am available to help prototype these solutions as needed. If issues arise I'm still convinced I can make both cronjob and mat column approaches work.

I will be continuing to call out issues if I see risk for the company, but given how multiple people are now aware of this particular solution we're already in a better spot than before. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request P1 Urgent, non-breaking (no crash but low usability)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants