feat: add a session recording metadata table #10294

pauldambra · 2022-06-14T12:00:13Z

Problem

see #2142 or https://github.com/PostHog/product-internal/pull/316 for context

Changes

Adds a session_recording_metadata table. This has similar schema and setup to the session_recording_events table. Adds storage of start and end time for a session and a column to store data about snapshot locations

table is populated via kafka and uses ReplacingMergeTree engine. If processing a session more than once this would allow a client to load the metadata, alter it, and write a "new" metadata row for the table

👉 Stay up-to-date with PostHog coding conventions for a smoother review.

How did you test this code?

running it to see that the tables are created

pauldambra · 2022-06-14T12:01:12Z

ee/clickhouse/sql/session_recording_metadata.py

+SESSION_RECORDING_METADATA_TABLE_SQL = lambda: (
+    SESSION_RECORDING_METADATA_TABLE_BASE_SQL
+    + """PARTITION BY toYYYYMMDD(timestamp)
+ORDER BY (toHour(timestamp), session_id, timestamp, uuid)


Should order by have team id since we will query by team id and session id?

Yes. See ee/clickhouse/sql/session_recording_events.py - this should be the first key.

pauldambra · 2022-06-14T12:01:58Z

Not certain how best to test this...

@hazzadous what's the best way to create new kafka topics... Terraform to create in staging and then manually in prod?

hazzadous · 2022-06-14T12:13:30Z

@pauldambra this is manual at the moment. Yakko was implementing something here although it looks like it still needs work. Something along these lines would be preferable over Terraform IMO so we can handle this uniformly across the types of deployments.

hazzadous · 2022-06-14T12:15:58Z

(fwiw I believe self-hosted may be configured to automatically create topics although I'd need to check that)

macobo · 2022-06-14T12:25:52Z

ee/clickhouse/sql/session_recording_metadata.py

+SESSION_RECORDING_METADATA_TABLE_BASE_SQL = """
+CREATE TABLE IF NOT EXISTS {table_name} ON CLUSTER '{cluster}'
+(
+    uuid UUID,


what is this UUID? Who sets it?

Ah! It's the UUID of the snapshot event... so redundant here!

macobo · 2022-06-14T12:26:11Z

ee/clickhouse/sql/session_recording_metadata.py

+    window_id VARCHAR,
+    session_start DateTime64(6, 'UTC'),
+    session_end DateTime64(6, 'UTC'),
+    snapshot_data_location VARCHAR  -- no trailing comma, extra_fields leads with one


Can we remove this comment? This would generate confusion in the SQL log.

macobo · 2022-06-14T12:26:22Z

ee/clickhouse/sql/session_recording_metadata.py

+    SESSION_RECORDING_METADATA_TABLE_BASE_SQL
+    + """PARTITION BY toYYYYMMDD(timestamp)
+ORDER BY (toHour(timestamp), session_id, timestamp, uuid)
+SETTINGS index_granularity=512


Why the lower granularity?

Was copied from session_recording_events :)

I guess it's in there to reduce the amount of snapshot_data CH reads unnecessarily while querying. And so not applicable to this table

macobo · 2022-06-14T12:26:58Z

ee/clickhouse/sql/session_recording_metadata.py

+""".format(
+    target_table=(
+        "writable_session_recording_metadata"
+        if settings.CLICKHOUSE_REPLICATION


Assume this is always true in this code.

macobo · 2022-06-14T12:27:29Z

ee/clickhouse/sql/session_recording_metadata.py

+from ee.kafka_client.topics import KAFKA_SESSION_RECORDING_METADATA
+
+SESSION_RECORDING_METADATA_DATA_TABLE = (
+    lambda: "sharded_session_recording_metadata" if settings.CLICKHOUSE_REPLICATION else "session_recording_metadata"


Build this as if CLICKHOUSE_REPLICATION was true.

macobo · 2022-06-14T12:27:32Z

ee/clickhouse/migrations/0030_session_recording_metadata.py

+    migrations.RunSQL(SESSION_RECORDING_METADATA_TABLE_MV_SQL()),
+]
+
+if CLICKHOUSE_REPLICATION:


Build this as if CLICKHOUSE_REPLICATION was true.

pauldambra · 2022-06-14T12:27:50Z

https://clickhouse.com/docs/en/engines/table-engines/integrations/kafka/

Says "Process streams as they become available."

Which I'm reading as "it is safe to add ClickHouse changes before the kafka topic is available"

macobo · 2022-06-14T12:28:59Z

ee/clickhouse/sql/session_recording_metadata.py

+CREATE TABLE IF NOT EXISTS {table_name} ON CLUSTER '{cluster}'
+(
+    uuid UUID,
+    timestamp DateTime64(6, 'UTC'),


What does this timestamp represent?

As with UUID. It's the snapshot event timestamp. Not applicable here!

macobo · 2022-06-14T12:29:58Z

ee/clickhouse/sql/session_recording_metadata.py

+SESSION_RECORDING_METADATA_TABLE_SQL = lambda: (
+    SESSION_RECORDING_METADATA_TABLE_BASE_SQL
+    + """PARTITION BY toYYYYMMDD(timestamp)
+ORDER BY (toHour(timestamp), session_id, timestamp, uuid)


This order by key is out of whack. Basically this should match your list/fetch query for efficient lookups.

Include team_id

What is timestamp? I don't think this would be used in any queries, meaning all queries devolve into full table scans with this order by?

Interesting...

This is copied over from session_recording_events

When loading the snapshot data we query by team_id and session_id.

But when listing sessions...

MIN(timestamp) AS start_time, MAX(timestamp) AS end_time, dateDiff('second', toDateTime(MIN(timestamp)), toDateTime(MAX(timestamp))) as duration,

The listing API allows date_from and date_to filters

we query on the API filter date_from being after the recording start_time and the API filter date_to being before the recording end_time. Both explicitly on the aggregates above and implicitly by searching for snapshot_event timestamp is within that range.

So we would be querying by start and end time... (and by duration which needs adding to table setup in this PR)

In case it affects order by we'll be joining this table with the events table on distinct_id

…n on disk)?

macobo · 2022-06-15T06:58:52Z

posthog/models/session_recording_metadata/session_recording_metadata.py

+SESSION_RECORDING_METADATA_TABLE_SQL = lambda: (
+    SESSION_RECORDING_METADATA_TABLE_BASE_SQL
+    + """PARTITION BY toYYYYMMDD(session_end)
+ORDER BY (team_id, session_id, session_start, session_end)


This sort key doesn't work for listing sessions - all queries basically devolve into full table scans since session_id is unique per session and doesn't help with filtering. Having session_start/session_end after that in the sort key doesn't affect behavior at all.

Can you list out what the full list and fetch queries look like against this table? We should design the sort key accordingly.

Loading the recordings list page

payload by default includes date_from: -7d and
session_recording_duration: {"type":"recording","key":"duration","value":60,"operator":"gt"} and lets you edit these as well as adding person/cohort and event filters

These get templated into ClickhouseSessionRecordingList

We'll be joining between session_recording_events and session_recording_metadata. Ignoring the join (to make the query smaller here) this will generate a query close to:

SELECT session_id, window_id, session_start, session_end, duration, distinct_id, FROM session_recording_metadata WHERE team_id = %(team_id)s AND session_start >= %(start_time)s --defaults to seven days ago AND session_end <= %(end_time)s -- defaults to end of today AND duration > %(recording_duration)s --defaults to one minute

To generate the session recording list. That would be joined with the equivalent for session_recording_events and person. And sometimes with events. To allow filtering. I'm assuming you don't need that context here

Loading Snapshot Data (playback)

(if we don't|before we) calculate and load direct from object storage

SELECT session_id, window_id, snapshot_data_location FROM session_recording_metadata WHERE team_id = %(team_id)s AND session_id = %(session_id)s ORDER BY session_start

API expects to use this query to get the snapshot_data, decompress it, reconstruct the chunks, and then return paged chunks grouped by window_id

Loading Metadata

This follows "Loading Snapshot Data" and then calculates "segments", and "start and end times by window id"

For session_recording_metadata these could be written as JSON to CH or to object_storage and loaded directly. Probably best in object storage to support loading everything for a session without going to DB

Suggestion for partition and sort key based on these queries:

PARTITION BY toYYYYMMDD(session_start) ORDER BY (team_id, toStartOfHour(session_start), session_id)

Why this ordering? It's optimized for the list query - we only look at the hours of data where a session started.

I also include session_id for the single session fetching.

posthog/models/session_recording_metadata/session_recording_metadata.py

pauldambra · 2022-06-16T08:57:26Z

I wanted to see this work locally before proposing to merge it...

But I get an error whether I use a python or console producer. With the script

producer = KafkaProducer(
            retries=KAFKA_PRODUCER_RETRIES,
            bootstrap_servers=KAFKA_HOSTS,
            security_protocol="PLAINTEXT",
            request_timeout_ms=2000,
        )

        producer.send(
            topic=KAFKA_SESSION_RECORDING_METADATA,
            value=json.dumps(
                {
                    "team_id": 1,
                    "distinct_id": "12345",
                    "session_id": "12345",
                    "window_id": "12345",
                    "session_start": "2012-04-01T12:34:56",
                    "session_end": "2012-04-01T16:34:56",
                    "duration": 4,
                    "snapshot_data_location": "somewhere",
                }
            ).encode("utf-8"),
            key="12345".encode("utf-8"),
        )

        query_result = []
        while not query_result:
            query_result = sync_execute(
                """
                select * from session_recording_metadata
                """
            )
            print(f"query result is: {query_result}")
            time.sleep(1)

I can see that the message reaches kafka but ClickHouse prints an error

Code: 41. DB::ParsingException: Cannot parse datetime: Cannot parse DateTime from String: while converting source column _timestamp to destination column _timestamp: while executing 'FUNCTION _CAST(_timestamp :: 7, DateTime :: 9) -> _CAST(_timestamp, DateTime) DateTime : 10': while pushing to view default.session_recording_metadata_mv (2c160edd-2e26-4dfe-b832-c784ada963a7). (CANNOT_PARSE_DATETIME), Stack trace (when copying this message, always include the lines below):

cc @macobo in case this means anything to you :)

pauldambra · 2022-06-16T11:52:21Z

The materialized view table was missing a comma in the definition. This didn't fail the migration because it meant _timestamp was actually being defined as an alias for snapshot_data_location

I've now produced to the kafka topic from Python and using the console producer in the bitnami kafka image and seen the data arrive in the table

macobo

This looks reasonable to me.

Couple of considerations:

It might be wise to hold off merging this until you have other prototype PRs ready in case there's anything needing fixuping here. Adding additional migrations is more expensive.
If you have a review buddy for this project, have them review this as well!

macobo · 2022-06-17T11:46:01Z

Actually, please also update posthog/clickhouse/schema.py.

Note that @EDsCODE also seems to have recently updated code conventions around these files so paths likely need updating.

posthog-bot · 2022-07-01T07:32:44Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

posthog-bot · 2022-07-12T07:32:33Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

posthog-bot · 2022-07-21T07:32:25Z

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

pauldambra added 2 commits June 10, 2022 08:26

quick first steps of adding a snapshot_data table

ea37715

feat: add a session_recordings_metadata table

6f7e6f2

pauldambra requested a review from hazzadous June 14, 2022 12:00

pauldambra commented Jun 14, 2022

View reviewed changes

pauldambra requested a review from macobo June 14, 2022 12:01

macobo reviewed Jun 14, 2022

View reviewed changes

pauldambra added 10 commits June 14, 2022 13:39

remove sql comment

dbbc03a

remove uuid and timestamp

b48d00c

remove index granularity setting

e8b46e7

pre-calculate duration

037531c

assume replicated

ae430d9

does order of order by clauses matter (big to small to affect locatio…

dfedf6b

…n on disk)?

why tohour on date after date

563baa5

can query by only start or only end so need both in order?

fffd962

Merge branch 'master' into feat/metadata-table

c57a062

move file to non-ee location

a8491a3

macobo reviewed Jun 15, 2022

View reviewed changes

posthog/models/session_recording_metadata/session_recording_metadata.py Outdated Show resolved Hide resolved

macobo reviewed Jun 15, 2022

View reviewed changes

posthog/models/session_recording_metadata/session_recording_metadata.py Outdated Show resolved Hide resolved

pauldambra added 3 commits June 15, 2022 10:15

remove unnecessary format parameters

57d9af9

update partition and orber by

6bcb58e

slightly more deliberately

fa0e137

pauldambra added 2 commits June 16, 2022 12:25

rename file to match others

13a4af6

add a comma

98938da

pauldambra requested a review from macobo June 16, 2022 11:53

pauldambra marked this pull request as ready for review June 16, 2022 11:53

pauldambra added 2 commits June 17, 2022 11:09

Merge branch 'master' into feat/metadata-table

68a9074

Merge branch 'master' into feat/metadata-table

59d1013

macobo approved these changes Jun 17, 2022

View reviewed changes

pauldambra added 6 commits June 17, 2022 19:12

Merge branch 'master' into feat/metadata-table

717c48c

Merge branch 'master' into feat/metadata-table

30226e3

add calculated metadata to table

13427dd

fix column definitions

bde70e8

metadata is across windows

410bd74

it's a not a metadata table, it's a table with metadata

a4f4ac5

pauldambra mentioned this pull request Jun 23, 2022

VERY WIP: feat: populate recordings table and S3 storage #10460

Closed

merge from master

64a26fb

posthog-bot added the stale label Jul 1, 2022

pauldambra removed the stale label Jul 1, 2022

posthog-bot added the stale label Jul 12, 2022

pauldambra removed the stale label Jul 12, 2022

posthog-bot added the stale label Jul 21, 2022

pauldambra closed this Aug 1, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add a session recording metadata table #10294

feat: add a session recording metadata table #10294

pauldambra commented Jun 14, 2022

pauldambra Jun 14, 2022

macobo Jun 14, 2022

pauldambra commented Jun 14, 2022

hazzadous commented Jun 14, 2022

hazzadous commented Jun 14, 2022

macobo Jun 14, 2022

pauldambra Jun 14, 2022

macobo Jun 14, 2022

macobo Jun 14, 2022

pauldambra Jun 14, 2022

macobo Jun 14, 2022

macobo Jun 14, 2022

macobo Jun 14, 2022

pauldambra commented Jun 14, 2022

macobo Jun 14, 2022

pauldambra Jun 14, 2022

macobo Jun 14, 2022

pauldambra Jun 14, 2022

macobo Jun 15, 2022 •

edited

Loading

pauldambra Jun 15, 2022

macobo Jun 15, 2022 •

edited

Loading

pauldambra commented Jun 16, 2022

pauldambra commented Jun 16, 2022

macobo left a comment

macobo commented Jun 17, 2022

posthog-bot commented Jul 1, 2022

posthog-bot commented Jul 12, 2022

posthog-bot commented Jul 21, 2022

feat: add a session recording metadata table #10294

feat: add a session recording metadata table #10294

Conversation

pauldambra commented Jun 14, 2022

Problem

Changes

How did you test this code?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldambra commented Jun 14, 2022

hazzadous commented Jun 14, 2022

hazzadous commented Jun 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pauldambra commented Jun 14, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

macobo Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Loading the recordings list page

Loading Snapshot Data (playback)

Loading Metadata

macobo Jun 15, 2022 • edited Loading

Choose a reason for hiding this comment

pauldambra commented Jun 16, 2022

pauldambra commented Jun 16, 2022

macobo left a comment

Choose a reason for hiding this comment

macobo commented Jun 17, 2022

posthog-bot commented Jul 1, 2022

posthog-bot commented Jul 12, 2022

posthog-bot commented Jul 21, 2022

macobo Jun 15, 2022 •

edited

Loading

macobo Jun 15, 2022 •

edited

Loading