-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save old session recordings to S3 #2142
Comments
Great stuff! Just adding to the conversation here, as this will permit storing recordings for longer time periods (even indefinitely) I suggest we ship this with a mechanism to comply with GDPR (or equivalent) requests and erase recordings that pertain to persons that file these requests. |
Please see relevant Slack thread. Currently the recordings are only kept on Clickhouse-based installations for 3 weeks. There's no way to preserve recordings for longer (either all or a subset). As such, it may be more important to consider something like this. Some customers have reported pain with this (see thread for details). |
@pauldambra @rcmarron @hazzadous @timgl @yakkomajuri some feedback on the current plan for session recordings. Outline/my understanding of the planInstead of storing individual snapshot events for session recordings in clickhouse, we save each event to S3, storing pointers in clickhouse and fetching the events from s3 when fetching the session. Problems with the approachPerformanceOn average, session recordings have 150 events, p95 being 500 events and p99 of 2691 events (metabase) This means to load a given session we need to:
(1) is still the same old query - this will still load a bunch of data. ExpenseRetrieval: Each S3 GET request costs $0.0004 (source). This means that loading a thousand recordings on average will cost us at least 0.06$ (which isn't terrible). Saving theses sessions: We would be doing on the order of billions of PUT requests per month to S3. According to their pricing page, this would cost us in thousands-to-tens-of-thousands. metabase IngestionIngestion/plugin-server is a really performance-sensitive piece of infrastructure where adding 100ms per event ingested can mean we aren't able to keep up ingestion. Note that forwarding each event to S3 means a harder time keeping plugin-server lively. AlternativesMy original proposal to @rcmarron was to:
NoteI want this to succeed so if I missed something obvious as to why this will work let me know. |
Great analysis! I've raised similar points earlier in a review and in a few calls. The alternative proposal seems solid, and we can still reuse the object storage work. |
It seems like a next step here is to hash out more details on what an alternate approach could like. Product requirementsNote: this isn't an exhaustive list, but these are important
ProposalI'm very open to other options here, but wanted to put some thoughts on paper to start a discussion (I believe this captures what @macobo and I had talked through a couple months ago)
The big advantages of this approach are that:
|
@rcmarron thanks for the writeup, this sounds good to me, especially capturing the nuances of splitting the recording into chunks! Couple of additions on my part: Product requirements
Additions to proposal
Other notes
Run two queries (which should return identical schemas) - one for sessions in object storage, one for sessions in clickhouse and uniq. You can also UNION ALL these :) Note that if clickhouse list query is becoming slow long-term we can set up projections to speed that query up! |
Late to the party 🥳 One more alternative. Given that we already have a buffer that is Kafka, we do not need to push these into ClickHouse. Rather we can batch and push data to S3 in batches on the fly, thereby keeping ClickHouse only for analytics workloads related to sessions, e.g. for calculation of session duration.
This would avoid needing to do any batch jobs to push the data from ClickHouse to S3. Disclaimer: I have never used Kafka Connect or S3 Sink before, we may want to instead consider using our own Consumer. It should however be battle tested at large scale. I've put together a prototype here #prsoverissues |
(also, happy to help out however is needed) |
love the kafka to s3 connector idea @hazzadous!
I'd expect s3/minio reads to be way faster than that... but #9901 was only intended to be a deployable step. I was anticipating passing the paths back instead of the data and having a slim API endpoint to let the UI read them. That would let that data end up in cloudflare and browser caches so if there are subsequent views they don't need to come from us. |
Note that one of the problems outlined #2142 (comment) is that we're still storing every single session recording event as a separate object in S3. From a retrieval perspective this is slow/unneccessary as we have a concept of sessions instead. You mentioned prefixing by $session_id - perhaps that solves that concern for us? Can you elaborate a bit/go into more technical detail?
Plugin-server currently
You can find the code in plugin-server/src/worker/ingestion/process-event.ts
Is that the request pattern for session recordings? My expectation is that the vast majority recordings are read between 0..1 times, with the actual value being closer to 0.01 in most cases. In this case the caching isn't really relevant as almost all recordings are opened once. |
Note that the S3 connector handles batching events into single S3 objects. For instance, you can set that each object either have a size limit, message limit, or time limit, or a combination of these. e.g. you could say each batch should not be greater than 10mb, or 30 seconds, which ever comes first. I think this should suffice on the PUT cost front although there's still a risk there. As for slow reads that could be an issue in some cases 🤔 this can partially be tackled with parallel reads, which is pretty scalable for S3 but it's something to watch out for. If we wanted to we could periodically compact, or have a separate connector that handles larger chunks at the expense of latency.
As for the other functions of the plugin-server, we can leave this as it already is if this is required. Either we continue sending the same message as we do now (see the prototype) or we can send a striped down version. Either way I propose initially we continue sending the messages as they are for now, we can feature flag the sessions snapshot endpoint, nothing else needs to change for the time being.
I'm not sure exactly what the pattern looks like, I'll have a look at the logs/PostHog insight for that! |
This is how it's implemented in the prototype. Not optimized, but gives the general flow of things. |
I would like to challenge this point as well. The proposal from @hazzadous looks very solid to me and I don't see major downside of implementing it. |
This issue has 2002 words at 13 comments. Issues this long are hard to read or contribute to, and tend to take very long to reach a conclusion. Instead, why not:
Is this issue intended to be sprawling? Consider adding label |
so responding to a specific part...
Instead of having an ongoing process where we write to ClickHouse and then migrate to S3 after a delay We have a setting/flag that enables writing to S3 and an async migration that moves data from ClickHouse to S3 once. Before that migration has started: we write to CH and S3 and fallback to reading from CH.
After it has run: no snapshot data is written to or stored in CH. At this point we don't need to send snapshot data to the plugin server so it is handling smaller messages NB I'm making two assumptions
We're still transferring the same amount of data over the network (at least by the end of the migration)
Don't turn on the setting/flag. Never run the migration. Code has two read/write paths based on that setting |
So, I think we're discussing between two options (if we ignore the option of not doing it)
In both options
|
I think that
affect both options equally |
This sounds like a hour long Thursday meeting. Let's get everyone in a room and talk it over? @pauldambra @hazzadous @guidoiaquinti @macobo @rcmarron To add more context and more fuel to the issue, my big fear is this:
First, looks like kafka connect is another container you need to run via docker/helm? Adding another depending like that is not free... 🤔 Also those 13 todos in that PR scare me 😅. Doable of course, but it's a big project to get this just right. Second, how can we reliably assure that a stream of snapshot events for each In my mind, the main complexity with this project is in assuring we can reliably chunk the data. Otherwise the worse case "we will pay a 💩-ton of money" scenario outlined by Karl here kicks in and we'll have a new project to implement what we could have in the first place :). Does kafka-connect employ some kind of dark magic to assure such chunking with our volumes, or what am I missing here? We read data from Kafka in batches, and then mark the entire batch as processed. The batches in the plugin server usually go up to a few hundred events. We could possibly make them bigger, but if you assume we're receiving snapshots from a few thousand unique users every minute, it's highly unlikely, that more than just a few events for each With this, we'll reduce the files uploaded to S3 by a volume of 1x (expected)-3x (best case), not the 50x we could with post-batching. Is there some special place where kafka-connect stores these in-flight events to assure decently sized chunks? In memory? Can we configure it to use clickhouse for buffering? 🤣. So what am I missing? |
So no code for us to maintain to do it, it is set to a 10x saving. But we can tweak that as needed.
No reading for us to do... write to Kafka (which is the storage while not committed to disk). kafka-connect writes to S3 in chunks for us. |
Is that the max or min saving? Where does it buffer the events to assure it? |
@mariusandra these partitions are not Kafka partitions afaik |
Where does it store the events then? 😆 Or at least the metadata regarding which event in the last gigabyte of data belongs to which partition? How often is a kafka batch committed? How frequently does this batch to S3? I get that kafka-connect has a lot of configuration options, but those are wishes, not guarantees. We still can't escape math... I think. Please tell me if I'm wrong, how I'm wrong. The way I see it, we have this huge unknown question that throws the feasibility of the entire implementation into question, and nobody can tell me how it'll actually work. Instead the approach is "let's spend the next 2 weeks on all the other details of the implementation, and then let's hope the main issue is also resolved". And, disclaimer, I can be wrong, but:
So how does it work. Can anyone please tell me before we invest more time in this? 😅 I scheduled a meeting for tomorrow to discuss this in sync, in case that's needed. |
I don't have the numbers right now, but basically everything that goes in must go out 🤷 😆. So it'll just |
Drive by suggestion 🚗 We could just use ClickHouse to manage this. I've made a sample table If you log into metabase and run the following on
You can verify that most of the parts for the table reside in s3 with the above query. This was mainly inspired by Altinity's post https://altinity.com/blog/clickhouse-and-s3-compatible-object-storage which I think we will want to use long term for all long term data retention at some point. Having a tiered storage solution would be really valuable and this seems like a great first step. The nice thing about this is:
Negatives here:
Go ahead and query this table to see what the performance implications are, but they should be relatively minimal. At the time of writing this we are about 10% complete with the upload to S3 with 550GB uploaded and 4211 files uploaded. This means we have an average file size of 130mb which is really quite ideal for S3, much better than a file per part of a session recording or even a file per session recording. ClickHouse is really quite efficient at grabbing just the parts of files out of S3 that it needs since it has the metadata store that describes what it needs to grab out of a specific file on S3 for a query. ClickHouse intelligently uses S3's Range query feature which grabs just the bits of a file that are needed for the results. 💥 Give the table a try and let me know how performance looks. 👀 |
This is sweet. To understand the dynamics better, what happens on reads over large volumes, or, said differently, will we get charged an integer amount of dollars every time someone opens the session recordings page or makes any query with a filter? |
Note that S3-backed-table doesn't do any "session rollups". Meaning if we query the table it reads either the whole (or maybe only a subset) of a partition from S3, then constructs the recording in-memory inside clickhouse. In essence we're doing what we're doing now, just on a slower disk. I still maintain that rolling up sessions into one or multiple "blobs" is the way to go for read performance reasons. That I'm willing to be surprised if it turns out this isn't needed at all. |
On this... I simulated this by changing the We load the pages of recording data serially. So loading a larger blob is faster overall. But... We start playing the video as soon as the first blob is available. So it appears quicker to the viewer if the first blob (or first few) is quick to download even if the browser takes longer overall. So in order to maintain a snappy start to playback of recordings... Bundling the files to S3 probably needs to slowly increase the size of the blobs over the life of the recording so that we can buffer faster than playback e.g. instead of 5 x 10MB blobs 2 x 0.5MB (or whatever works in practice) |
dumping comments from sync discussion here before thinking about actual implementation steps Kafka-Connect
Backing CH Tables with S3
Post ProcessingDesired end-state
Pros
Needs to have solutions for
|
So... we had the call, and I think the conclusions were these. Correct me if I'm wrong. Option 3, JamS3.
Conclusion: this could work well, but it won't be faster than what we have now (must read all chunks to serve the first), and we do need to split the tables to Option 2, Post-Processing
Conclusion: this is similar to JamS3, but we do more piping on the app side. Since we pre-calculate the chunks and do upfront optimizations, playback will be 🔥. Option 1, Live Streaming via kafka-connect (partially implemented)
|
It shouldn't be too bad. That query is only pulling the following columns because ClickHouse is columnar: Meaning that the actual amount of data pulled from s3 is pretty small (since the actual payloads are not getting pulled). The table itself is ordered by: which means that as long as we are able to hit these 'indexes' the size of the files pulled from S3 can be scoped down to a relatively small portion. We are partitioned by: The number of files that are pulled from s3 can be calculated with this formula (ideally): I just ran this query as an example: SELECT session_id,
any(window_id) as window_id,
MIN(timestamp) AS start_time,
MAX(timestamp) AS end_time,
dateDiff(
'second',
toDateTime(MIN(timestamp)),
toDateTime(MAX(timestamp))
) as duration,
any(distinct_id) as distinct_id,
SUM(has_full_snapshot) as full_snapshots
FROM testing_s3_session_recording_events
WHERE team_id = 2
and timestamp >= '2022-05-21'
and timestamp <= '2022-05-27'
GROUP BY session_id
HAVING full_snapshots > 0 It executed in The number of files selectively range gotten from S3 should be roughly: I'm sure that there is a way we could better format the schema to improve this (metadata and data like @mariusandra suggests). I'd love to hop on a call to talk about this and how ClickHouse is storing this merge tree in S3. This just makes our app a lot simpler and keeps the complexity inside of CH for the orchestration of recording storage. Edit:
Is there any way we can re-schema so that the chunks are optimized in CH? This seems ideal either way. Grabbing things from disk or S3. |
ok, so that suggests a way forward
|
I'm treating #10234 as a sketch... Think putting the wrong answer on the internet to shake the right answer loose :) |
@fuziontech that is really cool. I don't have ssh access to the cluster (too lazy/busy to set it up now), and metabase doesn't report the "MB read" number. However when I removed the @pauldambra if you'll keep working on it (which is fine by me), this basically will be your sprint goal, as getting this just right isn't a 1-day project. A MVP split could be 🤔 |
More metadata :party_parrot: The constraint is that we'll have events (avg 150 per recording) of varying sizes. Some 100 bytes, some 2+MB. However we want stable pagination based on both the recording timestamp and chunk sizes. We also need to know where in those chunks the "keyframes" are to be able to seek anywhere. Maybe clickhouse could create that metadata itself with a query, and then we make another query to just fetch the relevant chunks... 🤔 |
@fuziontech you're measuring the wrong query. So session recording has 3 components:
It's the no 3 that needs most consideration in a "just-store-the-table-in-s3" scenario. In an store-sessions-in-s3 scenario no 3 won't work well because it's reading the column containining 99% of the data (metabase). The normal rules of clickhouse apply here - to get a single recording we will be reading multiple partitions, and due to granularity much more extra data compared to what we need for that single recording. Note this will be slower than what we have currently and what we have is a problem right now: grafana - if you add lag fetching from S3 it will only get slower. I think when discussing this project we keep missing the important detail here: Conceptually thinking of session recordings are not the events, but rather the sum of them. The most efficient way to solve (3) is to have a file or multiple ones per recording containing all the events for that recording that you can directly request/stream on the client-side. We side-stepped that by leveraging clickhouse initially. Rolling up sessions is kind of hard^tm IMO to do streaming, which is why I'm still in favor of the postprocessing approach where we move whole recordings to s3 in batches after the fact. |
team of one 🤣 Not complaining, it's valuable work... Just worried about pulling on other people's time and being a distraction for them We query snapshots using
and page in memory... If we split the snapshots and metadata then that's the only query we need to care about for S3 performance. So, as is, we'd read all session snapshots from S3 multiple times for each recording. If we can make this "page-able" (definitely a word) then we'd read each bit of snapshot data once We need the split in order to do post processing... and it'd be great to figure out how to make paging easier anyway. I'd really like to get to a point where we can test the speed of reading from S3. IME (admittedly reading in AWS Lambda not in ECS) it is ridonculously fast. That may not be true in this context but would be good to know one way or another so we don't fix for problems we don't have (or build for affordances we don't have) |
Yep, I think absent discovering the S3 tables are incredible (enough) or uncovering un-fixable (also definitely a word) problems with post-processing then that's the approach now If for no other reason than we've spent enough time debating it and need to make stuff and see if we discover some real reason to change course now :) |
Potentially confusion-adding note on this, I ran a quick test on the table @fuziontech mentioned: SELECT
session_id,
window_id,
distinct_id,
timestamp,
snapshot_data
FROM
testing_s3_session_recording_events
WHERE
team_id = 2
AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109'
ORDER BY
timestamp Against the current table (which should be disk-based, not 100% though) : These seem comparable read performance wise, however:
See size querySELECT
formatReadableSize(sum(length(snapshot_data)))
FROM
testing_s3_session_recording_events
WHERE
team_id = 2
AND session_id = '180f03251a873b-06dda7934f4b06-34736704-384000-180f03251a92109' |
In the interim, is it possible to "just" drop session recordings into us users' own S3 bucket and use session_id-based naming and dated folders, and let us users worry about both pricing and loading recording. As for events and meta data, that's all secondary for me anyways, I need the recordings archived for compliance auditing purposes. |
Hey @lancedouglas1 this is actively being worked on: #10799 |
we now store session recordings in S3, offering 30 days (up from 21 days) retention for free users and 90 days retention (up from 21) for paid |
Nice work @pauldambra and team! 👏 |
Is your feature request related to a problem?
We prune session recordings regularly. However accessing old recordings can give a lot of insight.
We could (if credentials are provided) store the session data in S3 or compatible service. This way of storing data is less expensive and faster for long-lived recordings.
Describe the solution you'd like
User adds their credentials under project settings and we back this data up, storing references to saved sessions in postgres.
Under cloud, we could allow retention if user is on a higher tier.
Describe alternatives you've considered
Additional context
Split this out from #1846 as clickhouse work has made this a bit harder to do.
Thank you for your feature request – we love each and every one!
The text was updated successfully, but these errors were encountered: