Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save old session recordings to S3 #2142

Closed
macobo opened this issue Oct 30, 2020 · 46 comments
Closed

Save old session recordings to S3 #2142

macobo opened this issue Oct 30, 2020 · 46 comments
Labels
enhancement New feature or request

Comments

@macobo
Copy link
Contributor

macobo commented Oct 30, 2020

Is your feature request related to a problem?

We prune session recordings regularly. However accessing old recordings can give a lot of insight.

We could (if credentials are provided) store the session data in S3 or compatible service. This way of storing data is less expensive and faster for long-lived recordings.

Describe the solution you'd like

User adds their credentials under project settings and we back this data up, storing references to saved sessions in postgres.

Under cloud, we could allow retention if user is on a higher tier.

Describe alternatives you've considered

Additional context

Split this out from #1846 as clickhouse work has made this a bit harder to do.

Thank you for your feature request – we love each and every one!

@macobo macobo added enhancement New feature or request session recording labels Oct 30, 2020
@paolodamico
Copy link
Contributor

Great stuff! Just adding to the conversation here, as this will permit storing recordings for longer time periods (even indefinitely) I suggest we ship this with a mechanism to comply with GDPR (or equivalent) requests and erase recordings that pertain to persons that file these requests.

@paolodamico
Copy link
Contributor

Please see relevant Slack thread. Currently the recordings are only kept on Clickhouse-based installations for 3 weeks. There's no way to preserve recordings for longer (either all or a subset). As such, it may be more important to consider something like this.

Some customers have reported pain with this (see thread for details).

@macobo
Copy link
Contributor Author

macobo commented Jun 1, 2022

@pauldambra @rcmarron @hazzadous @timgl @yakkomajuri some feedback on the current plan for session recordings.

Outline/my understanding of the plan

Instead of storing individual snapshot events for session recordings in clickhouse, we save each event to S3, storing pointers in clickhouse and fetching the events from s3 when fetching the session.

Problems with the approach

Performance

On average, session recordings have 150 events, p95 being 500 events and p99 of 2691 events (metabase)

This means to load a given session we need to:

  1. Still query clickhouse, fetching N payloads containing pointers to that session
  2. Load each event separately from S3/alternatively

(1) is still the same old query - this will still load a bunch of data.
(2) is executed serially in python, meaning that if each request takes 100ms we end up spending 15s loading an average session.

Expense

Retrieval: Each S3 GET request costs $0.0004 (source). This means that loading a thousand recordings on average will cost us at least 0.06$ (which isn't terrible).

Saving theses sessions: We would be doing on the order of billions of PUT requests per month to S3. According to their pricing page, this would cost us in thousands-to-tens-of-thousands. metabase

Ingestion

Ingestion/plugin-server is a really performance-sensitive piece of infrastructure where adding 100ms per event ingested can mean we aren't able to keep up ingestion.

Note that forwarding each event to S3 means a harder time keeping plugin-server lively.

Alternatives

My original proposal to @rcmarron was to:

  1. Create 1-N files per recording rather than event.
  2. Upload to S3 in batches asynchronously from ingestion
  3. Store recordings-to-s3 pointers in a table in ch or postgres for fast retrieval

Note

I want this to succeed so if I missed something obvious as to why this will work let me know.

@mariusandra
Copy link
Collaborator

Great analysis! I've raised similar points earlier in a review and in a few calls.

The alternative proposal seems solid, and we can still reuse the object storage work.

@rcmarron
Copy link
Contributor

rcmarron commented Jun 3, 2022

It seems like a next step here is to hash out more details on what an alternate approach could like.

Product requirements

Note: this isn't an exhaustive list, but these are important

  • Recordings should be viewable as they're being created with a <1 minute delay
    • This is critical for the experience when users are first setting up recordings
    • Some users use recordings to verify new deployments are "working in the wild". Having minimal delay here is huge
  • Data for a specific recording can arrive over a long period of time (on the order of a day after the first recording events)
  • Data for viewing a recording should be retrievable in chunks of a reasonable size.
    • Some recordings can be several MB. A user shouldn't have to download the entire thing before beginning to watch the recording.
  • Users should be able to fetch recordings where specific events occurred
    • In practice, this means a recording list should exist within clickhouse
  • We should have a story for customer who do not have blob storage set up.

Proposal

I'm very open to other options here, but wanted to put some thoughts on paper to start a discussion (I believe this captures what @macobo and I had talked through a couple months ago)

  • Ingestion remains untouched
    • Data is put into ClickHouse as it is today
  • Create a background job that moves older recording data to s3
    • For recordings older than X days, we:
      • Read them out of ClickHouse
      • Chunk them into appropriate sizes
      • Write them to s3
      • Writes a row to a "session recording table" in clickhouse (containing some metadata about the recording as well - start time, end time, where the data is stored, number of chunks etc.)
  • List for recordings is generated by merging data from the existing event list and the "session recording table" above
    • I'm unclear on how this would work, but @macobo had mentioned there might be some good options that take advantage of clickhouse table types. Needs hashing out
  • When fetching a specific recording, we retrieve the recording from the table above to see where the data is stored.
    • If it's stored in Clickhouse, then we follow today's code path for generating metadata and paging data down
    • If it's stored in s3, then we do something very similar to the above, but fetching the data from s3

The big advantages of this approach are that:

  • The list view remains in ClickHouse
  • The approach gracefully falls back to the existing behavior if users do not have a blob storage solution set up

cc @pauldambra @mariusandra

@macobo
Copy link
Contributor Author

macobo commented Jun 3, 2022

@rcmarron thanks for the writeup, this sounds good to me, especially capturing the nuances of splitting the recording into chunks!

Couple of additions on my part:

Product requirements

  • ClickHouse shouldn't be storing extra data after snapshot events have been moved to S3.

Additions to proposal

  • Partition clickhouse session events by day
  • After all session recordings of a day have been moved off to S3, either manually drop the partition or do this via TTL
  • We set a maximum limit on the duration of a recording at 24 hours - longer than that is acceptable to "split" into multiple.

Other notes

  • We should be moving recordings away continually, but retain a buffer in clickhouse in events we're not touching. If we're okay/desire splitting recordings into chunks we can leverage that.

List for recordings is generated by merging data from the existing event list and the "session recording table" above

Run two queries (which should return identical schemas) - one for sessions in object storage, one for sessions in clickhouse and uniq. You can also UNION ALL these :)

Note that if clickhouse list query is becoming slow long-term we can set up projections to speed that query up!

@hazzadous
Copy link
Contributor

Late to the party 🥳

One more alternative. Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.

  1. to some extent deals with the expense point, would need to tweak some knobs re batch size etc.
  2. to some extent the performance on read. We'd be reading anything with a prefix of $session_id, we can put in some pagination to handle progressive loading
  3. it sidesteps any issues with the plugin-server but rather uses Kakfa Connect to handle this. We could also just write a Consumer specifically for this task if we wanted more control. tbh I haven't looked at what the plugin-server is doing with these events.

This would avoid needing to do any batch jobs to push the data from ClickHouse to S3.

Disclaimer: I have never used Kafka Connect or S3 Sink before, we may want to instead consider using our own Consumer. It should however be battle tested at large scale.

I've put together a prototype here #prsoverissues

@hazzadous
Copy link
Contributor

(also, happy to help out however is needed)

@pauldambra
Copy link
Member

love the kafka to s3 connector idea @hazzadous!

(2) is executed serially in python, meaning that if each request takes 100ms we end up spending 15s loading an average session.

I'd expect s3/minio reads to be way faster than that... but #9901 was only intended to be a deployable step. I was anticipating passing the paths back instead of the data and having a slim API endpoint to let the UI read them. That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us.

@macobo
Copy link
Contributor Author

macobo commented Jun 7, 2022

Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.

Note that one of the problems outlined #2142 (comment) is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead.

You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?

it sidesteps any issues with the plugin-server but rather uses Kakfa Connect to handle this. We could also just write a Consumer specifically for this task if we wanted more control. tbh I haven't looked at what the plugin-server is doing with these events.

Plugin-server currently

  • Creates missing people based off of $snapshot events
  • Calls onSnapshot plugin events

You can find the code in plugin-server/src/worker/ingestion/process-event.ts

@pauldambra

That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us.

Is that the request pattern for session recordings? My expectation is that the vast majority recordings are read between 0..1 times, with the actual value being closer to 0.01 in most cases. In this case the caching isn't really relevant as almost all recordings are opened once.

@hazzadous
Copy link
Contributor

Note that one of the problems outlined #2142 (comment) is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead.

Note that the S3 connector handles batching events into single S3 objects. For instance, you can set that each object either have a size limit, message limit, or time limit, or a combination of these. e.g. you could say each batch should not be greater than 10mb, or 30 seconds, which ever comes first. I think this should suffice on the PUT cost front although there's still a risk there.

As for slow reads that could be an issue in some cases 🤔 this can partially be tackled with parallel reads, which is pretty scalable for S3 but it's something to watch out for. If we wanted to we could periodically compact, or have a separate connector that handles larger chunks at the expense of latency.

Plugin-server currently

As for the other functions of the plugin-server, we can leave this as it already is if this is required. Either we continue sending the same message as we do now (see the prototype) or we can send a striped down version. Either way I propose initially we continue sending the messages as they are for now, we can feature flag the sessions snapshot endpoint, nothing else needs to change for the time being.

My expectation is that the vast majority recordings are read between 0..1

I'm not sure exactly what the pattern looks like, I'll have a look at the logs/PostHog insight for that!

@hazzadous
Copy link
Contributor

You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?

This is how it's implemented in the prototype. Not optimized, but gives the general flow of things.

@guidoiaquinti
Copy link
Contributor

Data is put into ClickHouse as it is today

I would like to challenge this point as well. The proposal from @hazzadous looks very solid to me and I don't see major downside of implementing it.

@posthog-contributions-bot
Copy link
Contributor

This issue has 2002 words at 13 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:

  1. Write some code and submit a pull request! Code wins arguments
  2. Have a sync meeting to reach a conclusion
  3. Create a Request for Comments and submit a PR with it to the meta repo or product internal repo

Is this issue intended to be sprawling? Consider adding label epic or sprint to indicate this.

@pauldambra
Copy link
Member

pauldambra commented Jun 8, 2022

This issue has 2002 words at 13 comments.

so responding to a specific part...

Data is put into ClickHouse as it is today

Instead of having an ongoing process where we write to ClickHouse and then migrate to S3 after a delay

We have a setting/flag that enables writing to S3 and an async migration that moves data from ClickHouse to S3 once.

Before that migration has started: we write to CH and S3 and fallback to reading from CH.
While it is running: we write snapshot data only to S3, and read only from S3.

  • downside: Need to read from the appropriate store while migration is running
  • upside: the overall complexity is lower

After it has run: no snapshot data is written to or stored in CH.

At this point we don't need to send snapshot data to the plugin server so it is handling smaller messages

NB I'm making two assumptions

  1. reading/writing to CH and object storage have similar latency

We're still transferring the same amount of data over the network (at least by the end of the migration)

  1. we want to allow people to carry on storing recordings in CH

Don't turn on the setting/flag. Never run the migration. Code has two read/write paths based on that setting

@pauldambra
Copy link
Member

pauldambra commented Jun 8, 2022

So, I think we're discussing between two options (if we ignore the option of not doing it)

A B
Ingestion Write all to CH Write all except snapshot to CH, write snapshot to S3
Chunking On a timer, chunk from CH to S3 with kafka connect At write time, chunk to S3 with kafka connect
Read From S3, fallback to CH From CH and S3 (to cover recordings started in CH and finished in S3) (Table TTL means CH stored data eventually removed and all recordings only have snapshots in S3)

In both options

  • the list view remains in CH
  • no new tables are necessary (???)

@pauldambra
Copy link
Member

I think that

  • what if kafka is down
  • what if CH is down
  • what if network is slow

affect both options equally

@mariusandra
Copy link
Collaborator

mariusandra commented Jun 8, 2022

This sounds like a hour long Thursday meeting. Let's get everyone in a room and talk it over? @pauldambra @hazzadous @guidoiaquinti @macobo @rcmarron

To add more context and more fuel to the issue, my big fear is this:

At write time, chunk to S3 with kafka connect

First, looks like kafka connect is another container you need to run via docker/helm? Adding another depending like that is not free... 🤔 Also those 13 todos in that PR scare me 😅. Doable of course, but it's a big project to get this just right.

Second, how can we reliably assure that a stream of snapshot events for each $team_id/$session_id/$window_id combo gets reliably chunked into as few 100kb-1MB pieces as possible. This means not smaller, not larger.

In my mind, the main complexity with this project is in assuring we can reliably chunk the data. Otherwise the worse case "we will pay a 💩-ton of money" scenario outlined by Karl here kicks in and we'll have a new project to implement what we could have in the first place :).

Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes, or what am I missing here? We read data from Kafka in batches, and then mark the entire batch as processed. The batches in the plugin server usually go up to a few hundred events. We could possibly make them bigger, but if you assume we're receiving snapshots from a few thousand unique users every minute, it's highly unlikely, that more than just a few events for each $team_id/$session_id/$window_id end up in one batch.

With this, we'll reduce the files uploaded to S3 by a volume of 1x (expected)-3x (best case), not the 50x we could with post-batching.

Is there some special place where kafka-connect stores these in-flight events to assure decently sized chunks? In memory? Can we configure it to use clickhouse for buffering? 🤣.

So what am I missing?

@pauldambra
Copy link
Member

pauldambra commented Jun 8, 2022

Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes

yes :) https://github.com/PostHog/posthog/pull/10142/files#diff-44c65cdf7d29921a0b4dcf47453356304816877fad1a587637b769fd0dc0580dR32

So no code for us to maintain to do it, it is set to a 10x saving. But we can tweak that as needed.

We read data from Kafka in batches,

No reading for us to do... write to Kafka (which is the storage while not committed to disk). kafka-connect writes to S3 in chunks for us.

@mariusandra
Copy link
Collaborator

Is that the max or min saving? Where does it buffer the events to assure it?

@hazzadous
Copy link
Contributor

@mariusandra these partitions are not Kafka partitions afaik

@mariusandra
Copy link
Collaborator

Where does it store the events then? 😆 Or at least the metadata regarding which event in the last gigabyte of data belongs to which partition? How often is a kafka batch committed? How frequently does this batch to S3?

I get that kafka-connect has a lot of configuration options, but those are wishes, not guarantees. We still can't escape math... I think. Please tell me if I'm wrong, how I'm wrong.

The way I see it, we have this huge unknown question that throws the feasibility of the entire implementation into question, and nobody can tell me how it'll actually work. Instead the approach is "let's spend the next 2 weeks on all the other details of the implementation, and then let's hope the main issue is also resolved".

And, disclaimer, I can be wrong, but:

it's mathematically correct but not how this works

So how does it work. Can anyone please tell me before we invest more time in this? 😅

I scheduled a meeting for tomorrow to discuss this in sync, in case that's needed.

@mariusandra
Copy link
Collaborator

@mariusandra do you happen to know what the post-processing requirements would be e.g. in terms of numbers of queries of ClickHouse per second, and volume of data, ClickHouse memory requirements?

I don't have the numbers right now, but basically everything that goes in must go out 🤷 😆. So it'll just select * from each_partition_in_clickhouse, and either group by team_id or group by team_id, window_id or yolo. I'm not sure. In any case we'll have "all the time in the world" to move the data from one place to another.

@fuziontech
Copy link
Member

fuziontech commented Jun 9, 2022

Drive by suggestion 🚗

We could just use ClickHouse to manage this.

I've made a sample table posthog.testing_s3_session_recording_events on CH8 (backup node) that is 99% backed by S3 and partially backed by st1 EBS volume. This is totally configurable.

If you log into metabase and run the following on CH8

select table, disk_name, count(1) parts, sum(bytes_on_disk) bytes_on_disk from system.parts
where table = 'testing_s3_session_recording_events'
group by table, disk_name
order by bytes_on_disk desc;

You can verify that most of the parts for the table reside in s3 with the above query.

This was mainly inspired by Altinity's post https://altinity.com/blog/clickhouse-and-s3-compatible-object-storage which I think we will want to use long term for all long term data retention at some point. Having a tiered storage solution would be really valuable and this seems like a great first step.

The nice thing about this is:

  • All merges happen on disk on CH and then parts are beamed up to S3
    • This avoids small files on S3 (the cost that @macobo was mentioning which is a real issue)
  • Moving to S3 is done depending on some rule
    • TTL
    • CH disk capacity
    • We can also manually move parts/partitions to S3 if we want that level of control
  • ClickHouse manages everything behind the scenes (less complexity in our app)
  • Is totally transparent to the app
  • We'll need to do this with events eventually

Negatives here:

  • We can't shard this table. Maybe this is ok if performance is solid enough on S3.

Go ahead and query this table to see what the performance implications are, but they should be relatively minimal.

At the time of writing this we are about 10% complete with the upload to S3 with 550GB uploaded and 4211 files uploaded. This means we have an average file size of 130mb which is really quite ideal for S3, much better than a file per part of a session recording or even a file per session recording. ClickHouse is really quite efficient at grabbing just the parts of files out of S3 that it needs since it has the metadata store that describes what it needs to grab out of a specific file on S3 for a query. ClickHouse intelligently uses S3's Range query feature which grabs just the bits of a file that are needed for the results. 💥

Give the table a try and let me know how performance looks. 👀

@mariusandra
Copy link
Collaborator

This is sweet. To understand the dynamics better, what happens on reads over large volumes, or, said differently, will we get charged an integer amount of dollars every time someone opens the session recordings page or makes any query with a filter?

@macobo
Copy link
Contributor Author

macobo commented Jun 9, 2022

Note that S3-backed-table doesn't do any "session rollups". Meaning if we query the table it reads either the whole (or maybe only a subset) of a partition from S3, then constructs the recording in-memory inside clickhouse. In essence we're doing what we're doing now, just on a slower disk.

I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all.

@pauldambra
Copy link
Member

I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all.

On this...

I simulated this by changing the DEFAULT_CHUNK_LIMIT when viewing a recording. #10198

We load the pages of recording data serially. So loading a larger blob is faster overall.

But...

We start playing the video as soon as the first blob is available. So it appears quicker to the viewer if the first blob (or first few) is quick to download even if the browser takes longer overall.

So in order to maintain a snappy start to playback of recordings...

Bundling the files to S3 probably needs to slowly increase the size of the blobs over the life of the recording so that we can buffer faster than playback

e.g.

instead of 5 x 10MB blobs

2 x 0.5MB
2 x 1MB
1 x 2MB
1 x 5MB
4 x 10MB

(or whatever works in practice)

@pauldambra
Copy link
Member

dumping comments from sync discussion here before thinking about actual implementation steps

Kafka-Connect

  • Concerns about management and complexity in scaling this if writing to S3 on ingestion
  • Concerns about whether additional service is needed if writing to S3 as a post-processed step

Backing CH Tables with S3

  • Concerns about query performance
  • Particularly given snapshot data and recording metadata are in the same table
    • Which supports listing, paging, filtering operations

Post Processing

Desired end-state

  • No changes to SDK
  • Recordings metadata stored in a ClickHouse table with no TTL
  • Recordings snapshot data stored in a separate ClickHouse table with no TTL
  • Sessions have a maximum length (e.g. 24 hours)
  • After that maximum length an entire ClickHouse partition can be moved to object storage and the partition dropped
  • When viewing snapshot data is read from object storage or ClickHouse based on age of session recording (or probably less complex read from both and return whichever has the data)

Pros

  • Can watch a recording very soon after a session starts
  • Can load a recording with little buffering
  • Can reduce amount of expensive CH storage
  • Can have longer (or indefinite) TTL on recordings
  • Write to S3 can be less efficient because viewers aren’t waiting for it
  • Doesn’t rule out using kafka-connect or backing CH tables with S3 but doesn’t require them

Needs to have solutions for

  • recordings that finished before new tables were used
  • recordings in progress during move to new tables
  • recordings started after move to new tables

@mariusandra
Copy link
Collaborator

mariusandra commented Jun 9, 2022

So... we had the call, and I think the conclusions were these. Correct me if I'm wrong.

Option 3, JamS3.

  • This option is very cool, but is still half of a solution. We do need the metadata in ClickHouse, in a place where we can quickly query it. Otherwise queries like these (basically any listing query), will repeatedly fetch a lot of partitions from S3, causing problems and 💸.
  • To get around that, we need to split the session recording events table into session_recordings_metadata and session_recordings_data (names pending). The first we'd query for events, etc. The second we'd use to store the data. We would give the second table a TTL to move things to S3, and only query it when we find events that match our time range or other filters.
  • In order to paginate the results (this is an absolute must to enable fast playback, and player seeking support), we still need to fetch all pages of a recording in our API, chunk the parts, add metadata (e.g. which part contains which timestamps), discard all the chunks except the ones the user requested, and then send that to the user. This is a lot of wasted work and redundant queries. Yet doing this, and sending the user the first chunk, is faster than sending all the chunks.

Conclusion: this could work well, but it won't be faster than what we have now (must read all chunks to serve the first), and we do need to split the tables to metadata and data.

Option 2, Post-Processing

  • Initially we do everything as it is now, but after some TTL (1h? 24h? 14d?), we manually batch and move session recordings to S3 with a cronjob.
  • To do this, we still need a split of tables: metadata and data (names pending). The latter we'd migrate and delete one partition at a time when it's ready.
  • When chunking we pre-calculate the metadata and create the most conveniently sized json files that our users can fetch directly.

Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥.

Option 1, Live Streaming via kafka-connect (partially implemented)

  • In this originally proposed option, we move events to S3 as they come in
  • We could create a ~60sec in-memory buffer to gather events belonging to one recording, before uploading them to S3
  • We're constrained by memory usage
  • This requires adding one more service to docker-compose, which basically just does await fetch in a loop, and comes with unknown maintenance hurdles.
  • Instead of await pushToKafka we could just do await readFromKafkaAndPushToS3 in the plugin server, and get virtually the same results with no extra services.
  • There's no way (according to math) how this can be faster than the other two approaches. The number of events will be the same (or 2-10x less... but this is moot since we grow 2x very quickly), we'll talk to an external service that we need to individually query and request files from, all those operations are measured in dollars, and we need to run them many times per recording due to the pagination code, which will not be simplified.
  • All in all, I think this is a bad idea that we want to avoid.

@fuziontech
Copy link
Member

fuziontech commented Jun 9, 2022

This option is very cool, but is still half of a solution. We do need the metadata in ClickHouse, in a place where we can quickly query it. Otherwise queries like these (basically any listing query), will repeatedly fetch a lot of partitions from S3, causing problems and 💸.

It shouldn't be too bad. That query is only pulling the following columns because ClickHouse is columnar:
session_id, window_id, timestamp, distinct_id, has_full_snapshot

Meaning that the actual amount of data pulled from s3 is pretty small (since the actual payloads are not getting pulled).

The table itself is ordered by:
team_id, toHour(timestamp), session_id, timestamp, uuid

which means that as long as we are able to hit these 'indexes' the size of the files pulled from S3 can be scoped down to a relatively small portion.

We are partitioned by:
PARTITION BY toYYYYMMDD(timestamp)

The number of files that are pulled from s3 can be calculated with this formula (ideally):
partitions x columns

I just ran this query as an example:

SELECT session_id,
	any(window_id) as window_id,
	MIN(timestamp) AS start_time,
	MAX(timestamp) AS end_time,
	dateDiff(
		'second',
		toDateTime(MIN(timestamp)),
		toDateTime(MAX(timestamp))
	) as duration,
	any(distinct_id) as distinct_id,
	SUM(has_full_snapshot) as full_snapshots
FROM testing_s3_session_recording_events
WHERE team_id = 2
	and timestamp >= '2022-05-21'
	and timestamp <= '2022-05-27'
GROUP BY session_id
HAVING full_snapshots > 0

It executed in
10544 rows in set. Elapsed: 0.658 sec. Processed 387.93 thousand rows, 82.80 MB (589.68 thousand rows/s., 125.85 MB/s.)

The number of files selectively range gotten from S3 should be roughly:
5 columns x 7 partitions = 35 files

I'm sure that there is a way we could better format the schema to improve this (metadata and data like @mariusandra suggests). I'd love to hop on a call to talk about this and how ClickHouse is storing this merge tree in S3. This just makes our app a lot simpler and keeps the complexity inside of CH for the orchestration of recording storage.

Edit:
I do like this idea:

Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥.

Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3.

@pauldambra
Copy link
Member

ok, so that suggests a way forward

  1. split the snapshot- and meta- data for recordings by adding a new snapshot_data table (I could do with pairing @hazzadous as I've remained blissfully ignorant of CH and Kafka infrastructure until now)
  2. keep TTL on both tables
  3. experiment with backing the snapshot_data table onto S3
  4. either that doesn't perform well enough and we continue on to post processing, or it's at least good enough for rock and roll and we can pause and concentrate elsewhere (imagining a world where I work on the team's sprint goals instead of side quests 🤣)

@pauldambra
Copy link
Member

I'm treating #10234 as a sketch... Think putting the wrong answer on the internet to shake the right answer loose :)

@mariusandra
Copy link
Collaborator

@fuziontech that is really cool. I don't have ssh access to the cluster (too lazy/busy to set it up now), and metabase doesn't report the "MB read" number. However when I removed the team_id clause, it took 5 seconds to give some results, probably reading around a gigabyte in back-of-the-napkin math. This sounds like it'll add up in dollars. We'll eventually have teams recording as many sessions as we do now, combined... so I guess we should still store the metadata locally, and just the recordings in S3.

@pauldambra if you'll keep working on it (which is fine by me), this basically will be your sprint goal, as getting this just right isn't a 1-day project. A MVP split could be 🤔

@mariusandra
Copy link
Collaborator

Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3.

More metadata :party_parrot:

The constraint is that we'll have events (avg 150 per recording) of varying sizes. Some 100 bytes, some 2+MB. However we want stable pagination based on both the recording timestamp and chunk sizes. We also need to know where in those chunks the "keyframes" are to be able to seek anywhere.

Maybe clickhouse could create that metadata itself with a query, and then we make another query to just fetch the relevant chunks... 🤔

@macobo
Copy link
Contributor Author

macobo commented Jun 10, 2022

@fuziontech you're measuring the wrong query.

So session recording has 3 components:

  1. Ingestion - getting data in
  2. "List all sessions query" - this is the query you were measuring
  3. "Get a single recording query"

It's the no 3 that needs most consideration in a "just-store-the-table-in-s3" scenario.

In an store-sessions-in-s3 scenario no 3 won't work well because it's reading the column containining 99% of the data (metabase). The normal rules of clickhouse apply here - to get a single recording we will be reading multiple partitions, and due to granularity much more extra data compared to what we need for that single recording.

Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower.

I think when discussing this project we keep missing the important detail here: Conceptually thinking of session recordings are not the events, but rather the sum of them. The most efficient way to solve (3) is to have a file or multiple ones per recording containing all the events for that recording that you can directly request/stream on the client-side.

We side-stepped that by leveraging clickhouse initially. Rolling up sessions is kind of hard^tm IMO to do streaming, which is why I'm still in favor of the postprocessing approach where we move whole recordings to s3 in batches after the fact.

@pauldambra
Copy link
Member

this basically will be your sprint goal,

team of one 🤣

Not complaining, it's valuable work... Just worried about pulling on other people's time and being a distraction for them


We query snapshots using

        SELECT session_id, window_id, distinct_id, timestamp, snapshot_data
        FROM session_recording_events
        WHERE
            team_id = %(team_id)s
            AND session_id = %(session_id)s
        ORDER BY timestamp

and page in memory...

If we split the snapshots and metadata then that's the only query we need to care about for S3 performance.

So, as is, we'd read all session snapshots from S3 multiple times for each recording. If we can make this "page-able" (definitely a word) then we'd read each bit of snapshot data once

We need the split in order to do post processing... and it'd be great to figure out how to make paging easier anyway.


I'd really like to get to a point where we can test the speed of reading from S3. IME (admittedly reading in AWS Lambda not in ECS) it is ridonculously fast. That may not be true in this context but would be good to know one way or another so we don't fix for problems we don't have (or build for affordances we don't have)

@pauldambra
Copy link
Member

pauldambra commented Jun 10, 2022

I'm still in favor of the postprocessing approach

Yep, I think absent discovering the S3 tables are incredible (enough) or uncovering un-fixable (also definitely a word) problems with post-processing then that's the approach now

If for no other reason than we've spent enough time debating it and need to make stuff and see if we discover some real reason to change course now :)

@macobo
Copy link
Contributor Author

macobo commented Jun 10, 2022

Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower.

Potentially confusion-adding note on this, I ran a quick test on the table @fuziontech mentioned:

SELECT
    session_id,
    window_id,
    distinct_id,
    timestamp,
    snapshot_data
FROM
    testing_s3_session_recording_events
WHERE
    team_id = 2
    AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109'
ORDER BY
    timestamp

Against the current table (which should be disk-based, not 100% though) : 2467 rows in set. Elapsed: 8.706 sec. Processed 773.40 thousand rows, 120.53 MB (88.84 thousand rows/s., 13.84 MB/s.)
S3-backed testing_s3_session_recording_events table: 2467 rows in set. Elapsed: 9.674 sec. Processed 313.12 thousand rows, 88.77 MB (32.37 thousand rows/s., 9.18 MB/s.)

These seem comparable read performance wise, however:

  • not sure where the difference is coming from between the two queries in rows read, maybe the test table doesn't have as much data? Also note the mb/s is slower with the new table.
  • This recording is 21MB uncompressed and we are reading 4-5x the MB needed from disk (or potentially more than that due to compression)
  • This approach would be slower in practice as we'd need to also send the data to app -> compress it -> client loads from api. Loading a precompressed file from S3 would be faster than that given all the steps it skips.
See size query
SELECT
    formatReadableSize(sum(length(snapshot_data)))
FROM
    testing_s3_session_recording_events
WHERE
    team_id = 2
    AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109'

@lancedouglas1
Copy link

In the interim, is it possible to "just" drop session recordings into us users' own S3 bucket and use session_id-based naming and dated folders, and let us users worry about both pricing and loading recording. As for events and meta data, that's all secondary for me anyways, I need the recordings archived for compliance auditing purposes.

@mariusandra
Copy link
Collaborator

Hey @lancedouglas1 this is actively being worked on: #10799

image

@pauldambra
Copy link
Member

we now store session recordings in S3, offering 30 days (up from 21 days) retention for free users and 90 days retention (up from 21) for paid

@guidoiaquinti
Copy link
Contributor

Nice work @pauldambra and team! 👏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

9 participants