-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Event filters: suggest properties that actually have data #4166
Comments
@timgl tagging you since this is Fuse-related. Would love your feedback as well as @paolodamico's - especially as to whether you guys agree this is a problem worth fixing, and how to fit this into sprint priorities if so. |
I think this would be useful (it's definitely a source of confusion for users I've noticed). The only issue is it's an expensive query to populate the table with this data initially, as it'll need to loop through every event every received. I think there's a solution to this where we only enable this for new teams and optionally create a management command that populates this table for existing teams, so that we can at least run this on cloud. |
We just merged in this: https://github.com/PostHog/plugin-server/pull/331/files An easy way forward is to create a third table, I'd advise against using postgres string arrays for this. They're very annoying to update, as you always need to |
Got some more feedback on this from a user yesterday. When adding a property filter, she expected to be able to search across all matches. In her case, she was trying to filter for a specific user, so she was searching for a particular email address. We weren't suggesting that address because it wasn't in the first results. I think ideally we would allow searching across all properties values, but depending on technical complexity if this is not possible we would at least want to update the UI/UX of the filters to make it clear that we're only suggesting values, not showing hard results on a search. |
Other than above, there are also some other cases that cause confusion. For instance "Active Feature Flags" we use contains and then you're supposed to type the FF you care about, but this is not clear to users, nor a good experience. |
I'm going to make a little bit of progress on this now |
I've tried to make progress on this for a while but apart from the one PR I did edit: and the one I'm about to do I don't think I'll have enough time to improve this. The way I would solve this is
|
Clickhouse 21.11 introduces JSONExtractKeys that could be used to map existing event and property links (if we were on latest) |
re: (Just mentioning so we know if we do want to go this route, we're not blocked by upgrading CH) |
I've got a super simple solution to this that won't involve parsing all events. When we hit We could do the same with the values to make those more relevant. |
Duplicate scope of #4791 |
@paolodamico I don't think it is? #4791 is about events, this is about properties |
You're right @timgl, apologies. Adding to our sprint planning. It's indeed quite relevant. |
I just chatted about @paolodamico regarding having this logic in the plugin server, and I'm against it for scalability reasons. The basic problem boils down to plugin-server and postgres being hard to scale so we should do as little there as possible. See a more in-depth analysis here but this part of the ingestion pipeline being slow will slow down whole ingestion and it being buggy will bring ingestion down. We've already had several broken releases due to exactly the same sort of causes. Alternative 1: Materialized viewsMy alternative suggestion: Do this in clickhouse using materialized views! So create a new table, event_properties with something like the following schema
Next, create another materialized view from events table to that table, populating these columns (e.g. using array joins). Lastly, use that table to query the property usage. The advantage here is decoupling this ingestion from postgres/plugin-server while still having it be calculated in the background. This isn't super proven though, hence... Alternative 2: cronjobAlternatively alternatively doing this in a cronjob isn't a bad idea either - as long as we're not touching postgres and the schema is optimized for it and the calculation is thought out (which it isn't for past cronjobs) it will work well! In that case the logic would be similar to cohort refreshing, in that we just issue a single clickhouse query that will update stats. |
For the record, I was looped into this but don't have full context. Like @macobo, I did suggest materialized views or a cronjob. However, when asked about a potential solution with a cache and infrequent, non-granular updates, I did say it could be worth experimenting with. Nevertheless, I had a brief pairing session with @mariusandra and think for his PR (the one relating to this issue), something like this could do the job neatly: CREATE MATERIALIZED VIEW event_property_volume_mv ON CLUSTER posthog
ENGINE = SummingMergeTree()
ORDER BY (team_id, event, property)
POPULATE
AS SELECT team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property, count() as volume
FROM events
WHERE timestamp > '2021-12-01'
GROUP BY team_id, event, property We ran this in prod and populate worked nicely to give us some data to play around with, but we didn't manage to get this to update with the incoming data. Surely there are mistakes in the query above, but I won't be spending more time on this. Just leaving it here to spark ideas. |
The entire field of software engineering can basically be summarised with the question "when exactly do you cache things?". So let's explore the various caching options. Approach 1: query time from "events"We got this far when pairing with Tim: select team_id, event, property, count(property) as total from (
select team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property from events where team_id = 2 and timestamp > '2021-11-01'
)
group by team_id, event, property
order by total desc For our team for one month, this runs in ~20sec. Reducing the timerange and somehow optimising the query could help, but we will run into a limit somewhere, and this just isn't sustainable. Sampling just e.g. 1000 events might increase speed, but at the cost of data integrity, and I'm not sure we want that tradeoff. We could cache results after querying, but that'll only get us so far. Keep in mind, that this Approach 2: materialized views from "events"Pairing with Yakko, we realised a materialized view backed by We still want CREATE MATERIALIZED VIEW event_property_volume_mv ON CLUSTER posthog
ENGINE = SummingMergeTree()
ORDER BY (team_id, event, property)
POPULATE
AS SELECT team_id, event, arrayJoin(arrayMap(x -> x.1, JSONExtractKeysAndValuesRaw(properties))) as property, count() as volume
FROM events
WHERE timestamp > '2021-12-01'
GROUP BY team_id, event, property The problems:
I'd happily accept pointers on how to improve this approach. Approach 3. stream processing in the plugin serverThis is PR #7500, and IMO this is precisely the type of thing the plugin server should be doing. After all, the plugin server is a place that deals with ingestion, plugins, scheduled tasks, load distribution, live action matching, person updates, making coffee, watering the plants, etc, etc. That's by design and not an accident. The general approach of the plugin server is: We have the data, as it's coming in. Let's extract the maximum value out of that data, right here, right now. We could even call this "Kafka Stream Processing" if we wanted to get fancy. To illustrate and prove this approach, we moved action matching into the plugin server, so it could operate on the data as it's flying in, instead of later making a backward-looking query to ask for the same data we already had in the system. This increased load on postgres somewhat (fetch the user if needed, etc), but we're handing it fine from what I can see. Hence, for me, the plugin server fits like a glove for the task of tallying event properties (and gathering the Personally I'd just love to try this on production, behind an I consider ClickHouse's materialized views a sort-of equivalent "at ingestion" stream processing approach, and I'm not opposed to them (other than complicated queries), but I don't consider them superior. Hence, I'm still not convinced, and think avoiding this approach is just premature optimisation. If this really becomes the part that slows down ingestion, we can a) add Kafka and insert to PG in the background, b) replace PG with CH and send everything through Kafka anyway (but, still, we need a Extra consideration: dates and taxonomyWe have Issue #6619 --> there's no intelligence on the types of properties, and everything is practically a string (or number or bool). This is causing issues with timestamps in properties. You can only filter them like strings, aka they're quite useless. Having a separate event<->property mapping in Postgres allows us to annotate those properties with whatever information we want. For example, This metadata could be used for other purposes, e.g. convert an Approach 4: reduce granularity.Same as nr 3, but we save "last seen at" at the date level (skipping the time), and instead of "total_count" we have "seen_on_days", which we only SummaryI'd still like to run this behind an |
This issue has 2106 words at 17 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
Is this issue intended to be sprawling? Consider adding label |
I think the above comment basically boils down to
My proof:
As I said in my linked comment from before:
The later we can do hard scaling work the better for us as a company. As-is, plugin server will fall down as we scale 10x or 100x and faster if we keep adding postgres-reliant logic onto our critical ingestion path. Those problems are much much harder to solve than the ones this ticket is solving. All that said - if you're committed, you do you. I'm not owning plugin-server nor this feature currently and am just calling out a bad idea when I see it. If you want help working out an alternative solution and getting it functional, also happy to help! |
Discussed this async with Marius, main points:
I will be continuing to call out issues if I see risk for the company, but given how multiple people are now aware of this particular solution we're already in a better spot than before. :) |
Is your feature request related to a problem?
When filtering events, it's overwhelming to see properties that every event potentially has. If you forget the exact name of a property, you either scroll through the list of all properties or rely on fuzzy-search. The size of this list is
n * p
wheren
is the number of event types andp
is the average number of unique property types on each event type. So as an organization adds more event types the problem gets worse.You can already see this phenomenon on app.posthog.com:
Screen.Cast.2021-04-29.at.12.30.17.PM.mp4
Describe the solution you'd like
We should get smarter about ranking / filtering the properties in the filter dropdown. The majority of the properties in the list might not apply to the chosen event type. Fuse fuzzy-search helps a little bit, but we could A) give it a ranking parameter based on how many matching events actually have each property; or B) binary include / exclude custom properties based on a new schema described below.
For the example below -
insight viewed
- we could predict that the user might want to filter on theInsight
property, and exclude or de-rank properties that are undefined forinsight viewed
events.This warrants some discussion re: the events schema. As you can see here, there are no
$pageview
events that haveapp_rating
, noinstalled_app
events that haveis_first_movie
, etc. Queries that look for those event-property combinations will of course return 0 rows.Do we already have a table which represents all the custom property keys PostHog has "seen" on an event? This seems like the simplest solution. The schema would look like this, and incoming events would append properties as needed.
For efficiency, updates to this table could be batched rather than being processed for every incoming event.
Describe alternatives you've considered
Additional context
Thank you for your feature request – we love each and every one!
The text was updated successfully, but these errors were encountered: