Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: session recording to s3 - posthog only #9901

Closed
wants to merge 35 commits into from

Conversation

pauldambra
Copy link
Member

@pauldambra pauldambra commented May 23, 2022

Problem

see #9294

Changes

Adds ability to write session recordings to object storage for configured teams

Still stores recording to clickhouse so that recordings can continue to be displayed if the feature is turned off

How did you test this code?

added tests and ran it locally

  • shows in preflight check only when healthy (don't scare people while it isn't mandatory)
  • records and views sessions correctly to events when team not on allow list
  • records and views sessions correctly to storage when team is on allow list
  • records and views sessions correctly when team is on list, it is configured to use object storage, but docker service is stopped

If storage is enabled but not available there is a delay buffering recordings to view as the read is attempted before using the unaltered original ClickHouse data

return snapshot_data
}

if (team_id !== 2) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gather this needs to be replaced with ENABLED_SESSION_RECORDINGS_OBJECT_STORAGE_TEAM_IDS or something less long!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ENABLED_SESSION_RECORDINGS_OBJECT_STORAGE_TEAM_IDS_ALLOW_LIST_FOR_TO_ALLOW_MORE_SAFELY_THE_TESTING_OF_THIS_FEATURE

return snapshot_data
}

if (team_id !== 2) {
Copy link
Contributor

@hazzadous hazzadous May 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gather this needs to be replaced with ENABLED_SESSION_RECORDINGS_OBJECT_STORAGE_TEAM_IDS env var or something less long!

Copy link
Contributor

@hazzadous hazzadous left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me aside from the team_id hardcoding

@pauldambra pauldambra force-pushed the session_recording_to_s3_posthog_only branch from b5a4c09 to b46c460 Compare May 23, 2022 18:40
@@ -766,6 +779,57 @@ export class EventsProcessor {
}
}

private tryStoreSessionRecordingToObjectStorage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this named try?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's what I was thinking when I made the method... It can just as easily be storeSessionRecording :)

@@ -766,6 +779,57 @@ export class EventsProcessor {
}
}

private tryStoreSessionRecordingToObjectStorage(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this class is doing way too much. Please put this elsewhere, e.g. extract class/method/etc - we don't want this class to be also responsible for business logic of object storage.


const storageWriteTimer = new Date()

this.objectStorage.putObject(params, (err: any, resp: any) => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Plugin server works off of promises, not callbacks. Please create a promise-based API to be consistent.


this.objectStorage.putObject(params, (err: any, resp: any) => {
if (err) {
console.error(err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug logging.

}
})

this.pluginsServer.statsd?.timing('session_data.storage_upload.timing', storageWriteTimer, tags)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timer will always be zero since we call it just after queueing a callback.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

homer-hedge

@@ -150,6 +150,7 @@ export interface PluginsServerConfig extends Record<string, any> {
OBJECT_STORAGE_SECRET_ACCESS_KEY: string
OBJECT_STORAGE_SESSION_RECORDING_FOLDER: string
OBJECT_STORAGE_BUCKET: string
OBJECT_STORAGE_TEAM_ALLOW_LIST: 'all' | number[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see code that's parsing this

@@ -150,6 +150,7 @@ export interface PluginsServerConfig extends Record<string, any> {
OBJECT_STORAGE_SECRET_ACCESS_KEY: string
OBJECT_STORAGE_SESSION_RECORDING_FOLDER: string
OBJECT_STORAGE_BUCKET: string
OBJECT_STORAGE_TEAM_ALLOW_LIST: 'all' | number[]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggestion: Let's continue the same pattern as with other env vars rather than introducing a new one. So remove 'all' and make this a string same as CONVERSION_BUFFER_ENABLED_TEAMS. If we want to turn this on for everyone, let's yeet the code or add a separate var.

@@ -26,7 +26,7 @@ def _query_recording_snapshots(self) -> List[SessionRecordingEvent]:
window_id=window_id,
distinct_id=distinct_id,
timestamp=timestamp,
snapshot_data=json.loads(snapshot_data),
snapshot_data=try_read_from_object_storage(session_id, snapshot_data),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why try - as a code reader my immediate question is what happens if it fails?

@pauldambra pauldambra force-pushed the session_recording_to_s3_posthog_only branch 2 times, most recently from 9e989d1 to 8d969b2 Compare May 24, 2022 18:40
@pauldambra pauldambra marked this pull request as ready for review May 27, 2022 09:12
try {
connectObjectStorage(serverConfig)
objectStorage = connectObjectStorage(serverConfig)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This try-catch is dead code - let's streamline all of this and remove the status.infos/try-catches etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have stripped it right back

@@ -122,20 +124,44 @@ describe('e2e', () => {
})

test('snapshot captured, processed, ingested', async () => {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is doing too much. Can you add a separate test where object storage is explicitly enabled and check the behavior of it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are more focussed tests that prove that object_storage_path is created correctly... stripped this back to just checking that it is set. There's definitely follow-up work coming on this so will refactor/improve tests in that

expect(event.distinct_id).toEqual('some-id')
expect(event.snapshot_data).toEqual({ timestamp: 123 })

const expectedFolderDate = castTimestampOrNow(now, TimestampFormat.DateOnly)
Copy link
Contributor

@macobo macobo May 27, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a separate/new test rather than tacking new behavior onto a existing one?

Especially since the name of the test is now wrong/misleading.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the test (after first trying to move them into their own file) and they started not generating events... I'll leave this to a follow-up so I can concentrate on why that happened with less surrounding noise

}
},
}
} catch (e) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: Why would this ever throw?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would... the only place that might is initialising the s3 client and that's lazy so shouldn't throw

However, I wanted to make sure that initialising object storage couldn't stop the plugin server starting

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think there's any value of being so defensive here given there's no awaits/anything unusual. This style IMO makes the code harder to follow, covers no known usecases and potentially hides important bugs we want to be aware of/bail early on.

@pauldambra pauldambra requested a review from macobo May 27, 2022 13:58
Copy link
Collaborator

@mariusandra mariusandra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had a look, left some thoughts.

My biggest concern is with how does this handle cloud-scale? There's no way things will be fast or reliable, if for one recording, we need to first upload a thousand tiny JSON files to S3, and then download and combine all of them. I'm worried about the huge latencies.

  1. For each click event, just by moving my mouse I often generate between a dozen to a hundred session recording events. Even though our biggest volume customers don't use session recordings, we have as many session recording events/snapshots coming in each minute on cloud, as non-session recording events. Is await uploadToS3 the best things we can do?
  2. What if S3 is down?
  3. Can we or should we batch these instead, or upload async? I'd imagine a recordings table in postgres, with one row representing one .json file for everything between two full snapshots in a recording. So about 200 snapshots now will be combined into one downloadable URL.

Edit: don't take my questions as blockers for trying it out... but know that we have just entered the forest for the trek towards Mt Doom. We haven't even seen the Orcs yet.


const aws = require('aws-sdk')

let S3: typeof aws.S3 | null = null
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm usually against globals. Why not make ObjectStorage into an actual class, and store this as an instance variable? The pojo/yolo-object returned from connectObjectStorage is stored on the hub anyway, so might as well be a real class.

public async fetchSessionRecordingEvents(
sessionId?: string | undefined
): Promise<PostgresSessionRecordingEvent[] | SessionRecordingEvent[]> {
const predicate = !!sessionId ? ` WHERE session_id = '${sessionId}'` : ''
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure inlining the ID in the query without any sanitization is a good idea?

Comment on lines +58 to +64
const altered_data = { ...snapshot_data }
// don't delete the snapshot data **yet**, or if we have to roll back we lose data
// this makes the timings less accurate because we're storing and loading the data twice
// delete altered_data.data
altered_data['object_storage_path'] = object_storage_path

return JSON.stringify(altered_data)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const altered_data = { ...snapshot_data }
// don't delete the snapshot data **yet**, or if we have to roll back we lose data
// this makes the timings less accurate because we're storing and loading the data twice
// delete altered_data.data
altered_data['object_storage_path'] = object_storage_path
return JSON.stringify(altered_data)
// don't delete the snapshot data **yet**, or if we have to roll back we lose data
// this makes the timings less accurate because we're storing and loading the data twice
return JSON.stringify({ ...snapshot_data, object_storage_path })

@posthog-bot
Copy link
Contributor

This PR hasn't seen activity in a week! Should it be merged, closed, or further worked on? If you want to keep it open, post a comment or remove the stale label – otherwise this will be closed in another week.

@fuziontech fuziontech self-requested a review June 8, 2022 19:53
@pauldambra pauldambra closed this Jun 14, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants