Save old session recordings to S3 #2142

macobo · 2020-10-30T14:01:44Z

Is your feature request related to a problem?

We prune session recordings regularly. However accessing old recordings can give a lot of insight.

We could (if credentials are provided) store the session data in S3 or compatible service. This way of storing data is less expensive and faster for long-lived recordings.

Describe the solution you'd like

User adds their credentials under project settings and we back this data up, storing references to saved sessions in postgres.

Under cloud, we could allow retention if user is on a higher tier.

Describe alternatives you've considered

Additional context

Split this out from #1846 as clickhouse work has made this a bit harder to do.

Thank you for your feature request – we love each and every one!

paolodamico · 2020-10-30T14:16:35Z

Great stuff! Just adding to the conversation here, as this will permit storing recordings for longer time periods (even indefinitely) I suggest we ship this with a mechanism to comply with GDPR (or equivalent) requests and erase recordings that pertain to persons that file these requests.

paolodamico · 2022-01-06T13:38:04Z

Please see relevant Slack thread. Currently the recordings are only kept on Clickhouse-based installations for 3 weeks. There's no way to preserve recordings for longer (either all or a subset). As such, it may be more important to consider something like this.

Some customers have reported pain with this (see thread for details).

macobo · 2022-06-01T07:33:45Z

@pauldambra @rcmarron @hazzadous @timgl @yakkomajuri some feedback on the current plan for session recordings.

Outline/my understanding of the plan

Instead of storing individual snapshot events for session recordings in clickhouse, we save each event to S3, storing pointers in clickhouse and fetching the events from s3 when fetching the session.

Problems with the approach

Performance

On average, session recordings have 150 events, p95 being 500 events and p99 of 2691 events (metabase)

This means to load a given session we need to:

Still query clickhouse, fetching N payloads containing pointers to that session
Load each event separately from S3/alternatively

(1) is still the same old query - this will still load a bunch of data.
(2) is executed serially in python, meaning that if each request takes 100ms we end up spending 15s loading an average session.

Expense

Retrieval: Each S3 GET request costs $0.0004 (source). This means that loading a thousand recordings on average will cost us at least 0.06$ (which isn't terrible).

Saving theses sessions: We would be doing on the order of billions of PUT requests per month to S3. According to their pricing page, this would cost us in thousands-to-tens-of-thousands. metabase

Ingestion

Ingestion/plugin-server is a really performance-sensitive piece of infrastructure where adding 100ms per event ingested can mean we aren't able to keep up ingestion.

Note that forwarding each event to S3 means a harder time keeping plugin-server lively.

Alternatives

My original proposal to @rcmarron was to:

Create 1-N files per recording rather than event.
Upload to S3 in batches asynchronously from ingestion
Store recordings-to-s3 pointers in a table in ch or postgres for fast retrieval

Note

I want this to succeed so if I missed something obvious as to why this will work let me know.

mariusandra · 2022-06-01T07:51:34Z

Great analysis! I've raised similar points earlier in a review and in a few calls.

The alternative proposal seems solid, and we can still reuse the object storage work.

rcmarron · 2022-06-03T06:04:44Z

It seems like a next step here is to hash out more details on what an alternate approach could like.

Product requirements

Note: this isn't an exhaustive list, but these are important

Recordings should be viewable as they're being created with a <1 minute delay
- This is critical for the experience when users are first setting up recordings
- Some users use recordings to verify new deployments are "working in the wild". Having minimal delay here is huge
Data for a specific recording can arrive over a long period of time (on the order of a day after the first recording events)
Data for viewing a recording should be retrievable in chunks of a reasonable size.
- Some recordings can be several MB. A user shouldn't have to download the entire thing before beginning to watch the recording.
Users should be able to fetch recordings where specific events occurred
- In practice, this means a recording list should exist within clickhouse
We should have a story for customer who do not have blob storage set up.

Proposal

I'm very open to other options here, but wanted to put some thoughts on paper to start a discussion (I believe this captures what @macobo and I had talked through a couple months ago)

Ingestion remains untouched
- Data is put into ClickHouse as it is today
Create a background job that moves older recording data to s3
- For recordings older than X days, we:
  - Read them out of ClickHouse
  - Chunk them into appropriate sizes
  - Write them to s3
  - Writes a row to a "session recording table" in clickhouse (containing some metadata about the recording as well - start time, end time, where the data is stored, number of chunks etc.)
List for recordings is generated by merging data from the existing event list and the "session recording table" above
- I'm unclear on how this would work, but @macobo had mentioned there might be some good options that take advantage of clickhouse table types. Needs hashing out
When fetching a specific recording, we retrieve the recording from the table above to see where the data is stored.
- If it's stored in Clickhouse, then we follow today's code path for generating metadata and paging data down
- If it's stored in s3, then we do something very similar to the above, but fetching the data from s3

The big advantages of this approach are that:

The list view remains in ClickHouse
The approach gracefully falls back to the existing behavior if users do not have a blob storage solution set up

cc @pauldambra @mariusandra

macobo · 2022-06-03T06:58:54Z

@rcmarron thanks for the writeup, this sounds good to me, especially capturing the nuances of splitting the recording into chunks!

Couple of additions on my part:

Product requirements

ClickHouse shouldn't be storing extra data after snapshot events have been moved to S3.

Additions to proposal

Partition clickhouse session events by day
After all session recordings of a day have been moved off to S3, either manually drop the partition or do this via TTL
We set a maximum limit on the duration of a recording at 24 hours - longer than that is acceptable to "split" into multiple.

Other notes

We should be moving recordings away continually, but retain a buffer in clickhouse in events we're not touching. If we're okay/desire splitting recordings into chunks we can leverage that.

List for recordings is generated by merging data from the existing event list and the "session recording table" above

Run two queries (which should return identical schemas) - one for sessions in object storage, one for sessions in clickhouse and uniq. You can also UNION ALL these :)

Note that if clickhouse list query is becoming slow long-term we can set up projections to speed that query up!

hazzadous · 2022-06-06T06:21:50Z

Late to the party 🥳

One more alternative. Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.

to some extent deals with the expense point, would need to tweak some knobs re batch size etc.
to some extent the performance on read. We'd be reading anything with a prefix of $session_id, we can put in some pagination to handle progressive loading
it sidesteps any issues with the plugin-server but rather uses Kakfa Connect to handle this. We could also just write a Consumer specifically for this task if we wanted more control. tbh I haven't looked at what the plugin-server is doing with these events.

This would avoid needing to do any batch jobs to push the data from ClickHouse to S3.

Disclaimer: I have never used Kafka Connect or S3 Sink before, we may want to instead consider using our own Consumer. It should however be battle tested at large scale.

I've put together a prototype here #prsoverissues

hazzadous · 2022-06-06T06:23:34Z

(also, happy to help out however is needed)

pauldambra · 2022-06-06T07:30:23Z

love the kafka to s3 connector idea @hazzadous!

(2) is executed serially in python, meaning that if each request takes 100ms we end up spending 15s loading an average session.

I'd expect s3/minio reads to be way faster than that... but #9901 was only intended to be a deployable step. I was anticipating passing the paths back instead of the data and having a slim API endpoint to let the UI read them. That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us.

macobo · 2022-06-07T06:43:08Z

Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.

Note that one of the problems outlined #2142 (comment) is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead.

You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?

it sidesteps any issues with the plugin-server but rather uses Kakfa Connect to handle this. We could also just write a Consumer specifically for this task if we wanted more control. tbh I haven't looked at what the plugin-server is doing with these events.

Plugin-server currently

Creates missing people based off of $snapshot events
Calls onSnapshot plugin events

You can find the code in plugin-server/src/worker/ingestion/process-event.ts

@pauldambra

That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us.

Is that the request pattern for session recordings? My expectation is that the vast majority recordings are read between 0..1 times, with the actual value being closer to 0.01 in most cases. In this case the caching isn't really relevant as almost all recordings are opened once.

hazzadous · 2022-06-07T07:00:01Z

Note that one of the problems outlined #2142 (comment) is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead.

Note that the S3 connector handles batching events into single S3 objects. For instance, you can set that each object either have a size limit, message limit, or time limit, or a combination of these. e.g. you could say each batch should not be greater than 10mb, or 30 seconds, which ever comes first. I think this should suffice on the PUT cost front although there's still a risk there.

As for slow reads that could be an issue in some cases 🤔 this can partially be tackled with parallel reads, which is pretty scalable for S3 but it's something to watch out for. If we wanted to we could periodically compact, or have a separate connector that handles larger chunks at the expense of latency.

Plugin-server currently

As for the other functions of the plugin-server, we can leave this as it already is if this is required. Either we continue sending the same message as we do now (see the prototype) or we can send a striped down version. Either way I propose initially we continue sending the messages as they are for now, we can feature flag the sessions snapshot endpoint, nothing else needs to change for the time being.

My expectation is that the vast majority recordings are read between 0..1

I'm not sure exactly what the pattern looks like, I'll have a look at the logs/PostHog insight for that!

hazzadous · 2022-06-07T07:01:57Z

You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?

This is how it's implemented in the prototype. Not optimized, but gives the general flow of things.

guidoiaquinti · 2022-06-07T08:50:31Z

Data is put into ClickHouse as it is today

I would like to challenge this point as well. The proposal from @hazzadous looks very solid to me and I don't see major downside of implementing it.

posthog-contributions-bot · 2022-06-07T08:50:32Z

This issue has 2002 words at 13 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

Write some code and submit a pull request! Code wins arguments
Have a sync meeting to reach a conclusion
Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

pauldambra · 2022-06-08T08:31:20Z

This issue has 2002 words at 13 comments.

so responding to a specific part...

Data is put into ClickHouse as it is today

Instead of having an ongoing process where we write to ClickHouse and then migrate to S3 after a delay

We have a setting/flag that enables writing to S3 and an async migration that moves data from ClickHouse to S3 once.

Before that migration has started: we write to CH and S3 and fallback to reading from CH.
While it is running: we write snapshot data only to S3, and read only from S3.

downside: Need to read from the appropriate store while migration is running
upside: the overall complexity is lower

After it has run: no snapshot data is written to or stored in CH.

At this point we don't need to send snapshot data to the plugin server so it is handling smaller messages

NB I'm making two assumptions

reading/writing to CH and object storage have similar latency

We're still transferring the same amount of data over the network (at least by the end of the migration)

we want to allow people to carry on storing recordings in CH

Don't turn on the setting/flag. Never run the migration. Code has two read/write paths based on that setting

pauldambra · 2022-06-08T09:12:19Z

So, I think we're discussing between two options (if we ignore the option of not doing it)

	A	B
Ingestion	Write all to CH	Write all except snapshot to CH, write snapshot to S3
Chunking	On a timer, chunk from CH to S3 with kafka connect	At write time, chunk to S3 with kafka connect
Read	From S3, fallback to CH	From CH and S3 (to cover recordings started in CH and finished in S3) (Table TTL means CH stored data eventually removed and all recordings only have snapshots in S3)

In both options

the list view remains in CH
no new tables are necessary (???)

pauldambra · 2022-06-08T09:13:04Z

I think that

what if kafka is down
what if CH is down
what if network is slow

affect both options equally

mariusandra · 2022-06-08T10:36:16Z

This sounds like a hour long Thursday meeting. Let's get everyone in a room and talk it over? @pauldambra @hazzadous @guidoiaquinti @macobo @rcmarron

To add more context and more fuel to the issue, my big fear is this:

At write time, chunk to S3 with kafka connect

First, looks like kafka connect is another container you need to run via docker/helm? Adding another depending like that is not free... 🤔 Also those 13 todos in that PR scare me 😅. Doable of course, but it's a big project to get this just right.

Second, how can we reliably assure that a stream of snapshot events for each $team_id/$session_id/$window_id combo gets reliably chunked into as few 100kb-1MB pieces as possible. This means not smaller, not larger.

In my mind, the main complexity with this project is in assuring we can reliably chunk the data. Otherwise the worse case "we will pay a 💩-ton of money" scenario outlined by Karl here kicks in and we'll have a new project to implement what we could have in the first place :).

Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes, or what am I missing here? We read data from Kafka in batches, and then mark the entire batch as processed. The batches in the plugin server usually go up to a few hundred events. We could possibly make them bigger, but if you assume we're receiving snapshots from a few thousand unique users every minute, it's highly unlikely, that more than just a few events for each $team_id/$session_id/$window_id end up in one batch.

With this, we'll reduce the files uploaded to S3 by a volume of 1x (expected)-3x (best case), not the 50x we could with post-batching.

Is there some special place where kafka-connect stores these in-flight events to assure decently sized chunks? In memory? Can we configure it to use clickhouse for buffering? 🤣.

So what am I missing?

pauldambra · 2022-06-08T10:52:51Z

Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes

yes :) https://github.com/PostHog/posthog/pull/10142/files#diff-44c65cdf7d29921a0b4dcf47453356304816877fad1a587637b769fd0dc0580dR32

So no code for us to maintain to do it, it is set to a 10x saving. But we can tweak that as needed.

We read data from Kafka in batches,

No reading for us to do... write to Kafka (which is the storage while not committed to disk). kafka-connect writes to S3 in chunks for us.

mariusandra · 2022-06-08T11:02:21Z

Is that the max or min saving? Where does it buffer the events to assure it?

hazzadous · 2022-06-08T14:17:51Z

@mariusandra these partitions are not Kafka partitions afaik

mariusandra · 2022-06-08T14:21:57Z

Where does it store the events then? 😆 Or at least the metadata regarding which event in the last gigabyte of data belongs to which partition? How often is a kafka batch committed? How frequently does this batch to S3?

I get that kafka-connect has a lot of configuration options, but those are wishes, not guarantees. We still can't escape math... I think. Please tell me if I'm wrong, how I'm wrong.

The way I see it, we have this huge unknown question that throws the feasibility of the entire implementation into question, and nobody can tell me how it'll actually work. Instead the approach is "let's spend the next 2 weeks on all the other details of the implementation, and then let's hope the main issue is also resolved".

And, disclaimer, I can be wrong, but:

it's mathematically correct but not how this works

So how does it work. Can anyone please tell me before we invest more time in this? 😅

I scheduled a meeting for tomorrow to discuss this in sync, in case that's needed.

mariusandra · 2022-06-08T14:24:20Z

@mariusandra do you happen to know what the post-processing requirements would be e.g. in terms of numbers of queries of ClickHouse per second, and volume of data, ClickHouse memory requirements?

I don't have the numbers right now, but basically everything that goes in must go out 🤷 😆. So it'll just select * from each_partition_in_clickhouse, and either group by team_id or group by team_id, window_id or yolo. I'm not sure. In any case we'll have "all the time in the world" to move the data from one place to another.

fuziontech · 2022-06-09T05:16:38Z

Drive by suggestion 🚗

We could just use ClickHouse to manage this.

I've made a sample table posthog.testing_s3_session_recording_events on CH8 (backup node) that is 99% backed by S3 and partially backed by st1 EBS volume. This is totally configurable.

If you log into metabase and run the following on CH8

select table, disk_name, count(1) parts, sum(bytes_on_disk) bytes_on_disk from system.parts
where table = 'testing_s3_session_recording_events'
group by table, disk_name
order by bytes_on_disk desc;

You can verify that most of the parts for the table reside in s3 with the above query.

This was mainly inspired by Altinity's post https://altinity.com/blog/clickhouse-and-s3-compatible-object-storage which I think we will want to use long term for all long term data retention at some point. Having a tiered storage solution would be really valuable and this seems like a great first step.

The nice thing about this is:

All merges happen on disk on CH and then parts are beamed up to S3
- This avoids small files on S3 (the cost that @macobo was mentioning which is a real issue)
Moving to S3 is done depending on some rule
- TTL
- CH disk capacity
- We can also manually move parts/partitions to S3 if we want that level of control
ClickHouse manages everything behind the scenes (less complexity in our app)
Is totally transparent to the app
We'll need to do this with events eventually

Negatives here:

We can't shard this table. Maybe this is ok if performance is solid enough on S3.

Go ahead and query this table to see what the performance implications are, but they should be relatively minimal.

At the time of writing this we are about 10% complete with the upload to S3 with 550GB uploaded and 4211 files uploaded. This means we have an average file size of 130mb which is really quite ideal for S3, much better than a file per part of a session recording or even a file per session recording. ClickHouse is really quite efficient at grabbing just the parts of files out of S3 that it needs since it has the metadata store that describes what it needs to grab out of a specific file on S3 for a query. ClickHouse intelligently uses S3's Range query feature which grabs just the bits of a file that are needed for the results. 💥

Give the table a try and let me know how performance looks. 👀

mariusandra · 2022-06-09T06:49:52Z

This is sweet. To understand the dynamics better, what happens on reads over large volumes, or, said differently, will we get charged an integer amount of dollars every time someone opens the session recordings page or makes any query with a filter?

macobo · 2022-06-09T06:54:31Z

Note that S3-backed-table doesn't do any "session rollups". Meaning if we query the table it reads either the whole (or maybe only a subset) of a partition from S3, then constructs the recording in-memory inside clickhouse. In essence we're doing what we're doing now, just on a slower disk.

I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all.

pauldambra · 2022-06-09T12:13:35Z

I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all.

On this...

I simulated this by changing the DEFAULT_CHUNK_LIMIT when viewing a recording. #10198

We load the pages of recording data serially. So loading a larger blob is faster overall.

But...

We start playing the video as soon as the first blob is available. So it appears quicker to the viewer if the first blob (or first few) is quick to download even if the browser takes longer overall.

So in order to maintain a snappy start to playback of recordings...

Bundling the files to S3 probably needs to slowly increase the size of the blobs over the life of the recording so that we can buffer faster than playback

e.g.

instead of 5 x 10MB blobs

2 x 0.5MB
2 x 1MB
1 x 2MB
1 x 5MB
4 x 10MB

(or whatever works in practice)

pauldambra · 2022-06-09T12:22:16Z

dumping comments from sync discussion here before thinking about actual implementation steps

Kafka-Connect

Concerns about management and complexity in scaling this if writing to S3 on ingestion
Concerns about whether additional service is needed if writing to S3 as a post-processed step

Backing CH Tables with S3

Concerns about query performance
Particularly given snapshot data and recording metadata are in the same table
- Which supports listing, paging, filtering operations

Post Processing

Desired end-state

No changes to SDK
Recordings metadata stored in a ClickHouse table with no TTL
Recordings snapshot data stored in a separate ClickHouse table with no TTL
Sessions have a maximum length (e.g. 24 hours)
After that maximum length an entire ClickHouse partition can be moved to object storage and the partition dropped
When viewing snapshot data is read from object storage or ClickHouse based on age of session recording (or probably less complex read from both and return whichever has the data)

Pros

Can watch a recording very soon after a session starts
Can load a recording with little buffering
Can reduce amount of expensive CH storage
Can have longer (or indefinite) TTL on recordings
Write to S3 can be less efficient because viewers aren’t waiting for it
Doesn’t rule out using kafka-connect or backing CH tables with S3 but doesn’t require them

Needs to have solutions for

recordings that finished before new tables were used
recordings in progress during move to new tables
recordings started after move to new tables

mariusandra · 2022-06-09T12:45:49Z

So... we had the call, and I think the conclusions were these. Correct me if I'm wrong.

Option 3, JamS3.

This option is very cool, but is still half of a solution. We do need the metadata in ClickHouse, in a place where we can quickly query it. Otherwise queries like these (basically any listing query), will repeatedly fetch a lot of partitions from S3, causing problems and 💸.
To get around that, we need to split the session recording events table into session_recordings_metadata and session_recordings_data (names pending). The first we'd query for events, etc. The second we'd use to store the data. We would give the second table a TTL to move things to S3, and only query it when we find events that match our time range or other filters.
In order to paginate the results (this is an absolute must to enable fast playback, and player seeking support), we still need to fetch all pages of a recording in our API, chunk the parts, add metadata (e.g. which part contains which timestamps), discard all the chunks except the ones the user requested, and then send that to the user. This is a lot of wasted work and redundant queries. Yet doing this, and sending the user the first chunk, is faster than sending all the chunks.

Conclusion: this could work well, but it won't be faster than what we have now (must read all chunks to serve the first), and we do need to split the tables to metadata and data.

Option 2, Post-Processing

Initially we do everything as it is now, but after some TTL (1h? 24h? 14d?), we manually batch and move session recordings to S3 with a cronjob.
To do this, we still need a split of tables: metadata and data (names pending). The latter we'd migrate and delete one partition at a time when it's ready.
When chunking we pre-calculate the metadata and create the most conveniently sized json files that our users can fetch directly.

Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥.

Option 1, Live Streaming via kafka-connect (partially implemented)

In this originally proposed option, we move events to S3 as they come in
We could create a ~60sec in-memory buffer to gather events belonging to one recording, before uploading them to S3
We're constrained by memory usage
This requires adding one more service to docker-compose, which basically just does await fetch in a loop, and comes with unknown maintenance hurdles.
Instead of await pushToKafka we could just do await readFromKafkaAndPushToS3 in the plugin server, and get virtually the same results with no extra services.
There's no way (according to math) how this can be faster than the other two approaches. The number of events will be the same (or 2-10x less... but this is moot since we grow 2x very quickly), we'll talk to an external service that we need to individually query and request files from, all those operations are measured in dollars, and we need to run them many times per recording due to the pagination code, which will not be simplified.
All in all, I think this is a bad idea that we want to avoid.

fuziontech · 2022-06-09T17:48:05Z

This option is very cool, but is still half of a solution. We do need the metadata in ClickHouse, in a place where we can quickly query it. Otherwise queries like these (basically any listing query), will repeatedly fetch a lot of partitions from S3, causing problems and 💸.

It shouldn't be too bad. That query is only pulling the following columns because ClickHouse is columnar:
session_id, window_id, timestamp, distinct_id, has_full_snapshot

Meaning that the actual amount of data pulled from s3 is pretty small (since the actual payloads are not getting pulled).

The table itself is ordered by:
team_id, toHour(timestamp), session_id, timestamp, uuid

which means that as long as we are able to hit these 'indexes' the size of the files pulled from S3 can be scoped down to a relatively small portion.

We are partitioned by:
PARTITION BY toYYYYMMDD(timestamp)

The number of files that are pulled from s3 can be calculated with this formula (ideally):
partitions x columns

I just ran this query as an example:

SELECT session_id,
	any(window_id) as window_id,
	MIN(timestamp) AS start_time,
	MAX(timestamp) AS end_time,
	dateDiff(
		'second',
		toDateTime(MIN(timestamp)),
		toDateTime(MAX(timestamp))
	) as duration,
	any(distinct_id) as distinct_id,
	SUM(has_full_snapshot) as full_snapshots
FROM testing_s3_session_recording_events
WHERE team_id = 2
	and timestamp >= '2022-05-21'
	and timestamp <= '2022-05-27'
GROUP BY session_id
HAVING full_snapshots > 0

It executed in
10544 rows in set. Elapsed: 0.658 sec. Processed 387.93 thousand rows, 82.80 MB (589.68 thousand rows/s., 125.85 MB/s.)

The number of files selectively range gotten from S3 should be roughly:
5 columns x 7 partitions = 35 files

I'm sure that there is a way we could better format the schema to improve this (metadata and data like @mariusandra suggests). I'd love to hop on a call to talk about this and how ClickHouse is storing this merge tree in S3. This just makes our app a lot simpler and keeps the complexity inside of CH for the orchestration of recording storage.

Edit:
I do like this idea:

Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥.

Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3.

pauldambra · 2022-06-10T07:25:11Z

ok, so that suggests a way forward

split the snapshot- and meta- data for recordings by adding a new snapshot_data table (I could do with pairing @hazzadous as I've remained blissfully ignorant of CH and Kafka infrastructure until now)
keep TTL on both tables
experiment with backing the snapshot_data table onto S3
either that doesn't perform well enough and we continue on to post processing, or it's at least good enough for rock and roll and we can pause and concentrate elsewhere (imagining a world where I work on the team's sprint goals instead of side quests 🤣)

pauldambra · 2022-06-10T07:33:04Z

I'm treating #10234 as a sketch... Think putting the wrong answer on the internet to shake the right answer loose :)

mariusandra · 2022-06-10T07:38:57Z

@fuziontech that is really cool. I don't have ssh access to the cluster (too lazy/busy to set it up now), and metabase doesn't report the "MB read" number. However when I removed the team_id clause, it took 5 seconds to give some results, probably reading around a gigabyte in back-of-the-napkin math. This sounds like it'll add up in dollars. We'll eventually have teams recording as many sessions as we do now, combined... so I guess we should still store the metadata locally, and just the recordings in S3.

@pauldambra if you'll keep working on it (which is fine by me), this basically will be your sprint goal, as getting this just right isn't a 1-day project. A MVP split could be 🤔

mariusandra · 2022-06-10T07:45:59Z

Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3.

More metadata :party_parrot:

The constraint is that we'll have events (avg 150 per recording) of varying sizes. Some 100 bytes, some 2+MB. However we want stable pagination based on both the recording timestamp and chunk sizes. We also need to know where in those chunks the "keyframes" are to be able to seek anywhere.

Maybe clickhouse could create that metadata itself with a query, and then we make another query to just fetch the relevant chunks... 🤔

macobo · 2022-06-10T07:47:52Z

@fuziontech you're measuring the wrong query.

So session recording has 3 components:

Ingestion - getting data in
"List all sessions query" - this is the query you were measuring
"Get a single recording query"

It's the no 3 that needs most consideration in a "just-store-the-table-in-s3" scenario.

In an store-sessions-in-s3 scenario no 3 won't work well because it's reading the column containining 99% of the data (metabase). The normal rules of clickhouse apply here - to get a single recording we will be reading multiple partitions, and due to granularity much more extra data compared to what we need for that single recording.

Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower.

I think when discussing this project we keep missing the important detail here: Conceptually thinking of session recordings are not the events, but rather the sum of them. The most efficient way to solve (3) is to have a file or multiple ones per recording containing all the events for that recording that you can directly request/stream on the client-side.

We side-stepped that by leveraging clickhouse initially. Rolling up sessions is kind of hard^tm IMO to do streaming, which is why I'm still in favor of the postprocessing approach where we move whole recordings to s3 in batches after the fact.

pauldambra · 2022-06-10T07:53:03Z

this basically will be your sprint goal,

team of one 🤣

Not complaining, it's valuable work... Just worried about pulling on other people's time and being a distraction for them

We query snapshots using

        SELECT session_id, window_id, distinct_id, timestamp, snapshot_data
        FROM session_recording_events
        WHERE
            team_id = %(team_id)s
            AND session_id = %(session_id)s
        ORDER BY timestamp

and page in memory...

If we split the snapshots and metadata then that's the only query we need to care about for S3 performance.

So, as is, we'd read all session snapshots from S3 multiple times for each recording. If we can make this "page-able" (definitely a word) then we'd read each bit of snapshot data once

We need the split in order to do post processing... and it'd be great to figure out how to make paging easier anyway.

I'd really like to get to a point where we can test the speed of reading from S3. IME (admittedly reading in AWS Lambda not in ECS) it is ridonculously fast. That may not be true in this context but would be good to know one way or another so we don't fix for problems we don't have (or build for affordances we don't have)

pauldambra · 2022-06-10T07:55:00Z

I'm still in favor of the postprocessing approach

Yep, I think absent discovering the S3 tables are incredible (enough) or uncovering un-fixable (also definitely a word) problems with post-processing then that's the approach now

If for no other reason than we've spent enough time debating it and need to make stuff and see if we discover some real reason to change course now :)

macobo · 2022-06-10T08:10:25Z

Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower.

Potentially confusion-adding note on this, I ran a quick test on the table @fuziontech mentioned:

SELECT
    session_id,
    window_id,
    distinct_id,
    timestamp,
    snapshot_data
FROM
    testing_s3_session_recording_events
WHERE
    team_id = 2
    AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109'
ORDER BY
    timestamp

Against the current table (which should be disk-based, not 100% though) : 2467 rows in set. Elapsed: 8.706 sec. Processed 773.40 thousand rows, 120.53 MB (88.84 thousand rows/s., 13.84 MB/s.)
S3-backed testing_s3_session_recording_events table: 2467 rows in set. Elapsed: 9.674 sec. Processed 313.12 thousand rows, 88.77 MB (32.37 thousand rows/s., 9.18 MB/s.)

These seem comparable read performance wise, however:

not sure where the difference is coming from between the two queries in rows read, maybe the test table doesn't have as much data? Also note the mb/s is slower with the new table.
This recording is 21MB uncompressed and we are reading 4-5x the MB needed from disk (or potentially more than that due to compression)
This approach would be slower in practice as we'd need to also send the data to app -> compress it -> client loads from api. Loading a precompressed file from S3 would be faster than that given all the steps it skips.

See size query

SELECT
    formatReadableSize(sum(length(snapshot_data)))
FROM
    testing_s3_session_recording_events
WHERE
    team_id = 2
    AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109'

lancedouglas1 · 2022-07-19T21:09:45Z

In the interim, is it possible to "just" drop session recordings into us users' own S3 bucket and use session_id-based naming and dated folders, and let us users worry about both pricing and loading recording. As for events and meta data, that's all secondary for me anyways, I need the recordings archived for compliance auditing purposes.

mariusandra · 2022-07-20T10:19:51Z

Hey @lancedouglas1 this is actively being worked on: #10799

pauldambra · 2023-11-30T12:52:15Z

we now store session recordings in S3, offering 30 days (up from 21 days) retention for free users and 90 days retention (up from 21) for paid

guidoiaquinti · 2023-11-30T13:38:25Z

Nice work @pauldambra and team! 👏

macobo added enhancement New feature or request session recording labels Oct 30, 2020

macobo mentioned this issue Nov 13, 2020

Release 1.18.0 - 27 November 2020 #2366

Closed

macobo mentioned this issue Nov 30, 2020

Release 1.19.0 - 7 December 2020 #2523

Closed

macobo mentioned this issue Jan 21, 2021

add filters to enable session recordings #2966

Closed

paolodamico added the team-core-experience label Jan 6, 2022

paolodamico removed the team-core-experience label Feb 22, 2022

pauldambra mentioned this issue Mar 30, 2022

feat(session-recording): store session recording data outside of ClickHouse #9294

Closed

hazzadous mentioned this issue Jun 6, 2022

wip(session-recordings): store session recordings to S3 #10142

Closed

macobo mentioned this issue Jun 7, 2022

Document onSnapshot going away when releasing recordings-on-s3 #10168

Closed

pauldambra mentioned this issue Jun 9, 2022

chore: remove unused object storage from plugin server #10216

Closed

pauldambra mentioned this issue Jun 10, 2022

WIP: quick first steps of adding a snapshot_data table #10234

Closed

This was referenced Jun 13, 2022

feat: add a session length maximum time limit PostHog/posthog-js#405

Merged

feat: add a session recording metadata table #10294

Closed

This was referenced Jun 23, 2022

feat: separate post processing worker #10448

Closed

VERY WIP: feat: populate recordings table and S3 storage #10460

Closed

benjackwhite mentioned this issue Feb 1, 2023

Increase Recordings retention #14040

Closed

pauldambra closed this as completed Nov 30, 2023

Save old session recordings to S3 #2142

Save old session recordings to S3 #2142

Comments

macobo commented Oct 30, 2020

Is your feature request related to a problem?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Thank you for your feature request – we love each and every one!

paolodamico commented Oct 30, 2020

paolodamico commented Jan 6, 2022

macobo commented Jun 1, 2022 • edited Loading

Outline/my understanding of the plan

Problems with the approach

Performance

Expense

Ingestion

Alternatives

Note

mariusandra commented Jun 1, 2022

rcmarron commented Jun 3, 2022

Product requirements

Proposal

macobo commented Jun 3, 2022

Product requirements

Additions to proposal

Other notes

hazzadous commented Jun 6, 2022

hazzadous commented Jun 6, 2022

pauldambra commented Jun 6, 2022

macobo commented Jun 7, 2022 • edited Loading

hazzadous commented Jun 7, 2022

hazzadous commented Jun 7, 2022

guidoiaquinti commented Jun 7, 2022

posthog-contributions-bot bot commented Jun 7, 2022

pauldambra commented Jun 8, 2022 • edited Loading

pauldambra commented Jun 8, 2022 • edited Loading

pauldambra commented Jun 8, 2022

mariusandra commented Jun 8, 2022 • edited Loading

pauldambra commented Jun 8, 2022 • edited Loading

mariusandra commented Jun 8, 2022

hazzadous commented Jun 8, 2022

mariusandra commented Jun 8, 2022

mariusandra commented Jun 8, 2022

fuziontech commented Jun 9, 2022 • edited Loading

mariusandra commented Jun 9, 2022

macobo commented Jun 9, 2022 • edited Loading

pauldambra commented Jun 9, 2022

pauldambra commented Jun 9, 2022

Kafka-Connect

Backing CH Tables with S3

Post Processing

Desired end-state

Pros

Needs to have solutions for

mariusandra commented Jun 9, 2022 • edited Loading

Option 3, JamS3.

Option 2, Post-Processing

Option 1, Live Streaming via kafka-connect (partially implemented)

fuziontech commented Jun 9, 2022 • edited Loading

pauldambra commented Jun 10, 2022

pauldambra commented Jun 10, 2022

mariusandra commented Jun 10, 2022

mariusandra commented Jun 10, 2022

macobo commented Jun 10, 2022 • edited Loading

pauldambra commented Jun 10, 2022

pauldambra commented Jun 10, 2022 • edited Loading

macobo commented Jun 10, 2022 • edited Loading

lancedouglas1 commented Jul 19, 2022

mariusandra commented Jul 20, 2022

pauldambra commented Nov 30, 2023

guidoiaquinti commented Nov 30, 2023

macobo commented Jun 1, 2022 •

edited

Loading

macobo commented Jun 7, 2022 •

edited

Loading

pauldambra commented Jun 8, 2022 •

edited

Loading

pauldambra commented Jun 8, 2022 •

edited

Loading

mariusandra commented Jun 8, 2022 •

edited

Loading

pauldambra commented Jun 8, 2022 •

edited

Loading

fuziontech commented Jun 9, 2022 •

edited

Loading

macobo commented Jun 9, 2022 •

edited

Loading

mariusandra commented Jun 9, 2022 •

edited

Loading

fuziontech commented Jun 9, 2022 •

edited

Loading

macobo commented Jun 10, 2022 •

edited

Loading

pauldambra commented Jun 10, 2022 •

edited

Loading

macobo commented Jun 10, 2022 •

edited

Loading