Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing session recordings for user #2927

Closed
macobo opened this issue Jan 13, 2021 · 8 comments
Closed

Missing session recordings for user #2927

macobo opened this issue Jan 13, 2021 · 8 comments
Assignees
Labels
bug Something isn't working right P0 Critical, breaking issue (page crash, missing functionality)

Comments

@macobo
Copy link
Contributor

macobo commented Jan 13, 2021

Bug description

This came from a client report: https://posthog.slack.com/archives/C01HY5TRZP0/p1610460342002100

For this user, session recording was not working despite being enabled. Digging in a bit more, our db correctly contained a lot of events, but not full payload ones.

Expected behavior

Session recording events are correctly received

How to reproduce

Visit site referenced via slack above. See no recordings show up.

Environment

All

Additional context

This only seems to happen relatively rarely globally, but consistently for this site.

I dug into this technically and what I found was pretty fascinating:

  1. We gzip request payloads using fflate and send them to /s. I noticed the largest initial request containing the full snapshot event got a response {status: 1}
  2. Inspecting the payload, there's ~6 NULL bytes was near the beginning. According to my googling, null bytes should be valid for gzip compression.
  3. Copying the curl command and sending it to localhost, capture fails due to Not a gzipped file.
  4. Inspecting request.body locally reveals only bytes up until the NULL bytes are read, even if content-length indicates it should be larger.

I asked about this behavior on django IRC + googled, but could not find anything indicating whether NULL bytes in request body are unacceptable or if django is behaving wrong here.

Diagnosis

At least locally, django uses runserver.py behind the hood, which in turn uses a WSGIServer.

In the end, the thing that reads the request body buffer is the following BufferedIO cpython implementation Which assumes NULL bytes it reads are to be ignored.

Not sure if that's correct or not, but that's our world. Not 100% sure that gunicorn behaves in an identical way.

Misc context

Example sentry error: https://sentry.io/organizations/posthog/issues/2079528998/events/d8bf9c6fca384ee988d1288073da5ae4/?project=1899813&statsPeriod=14d (Not a gzipped file (b'\x1f\x00')), though my own is not reported.

Curl request for reproducing NULL byte behavior:

curl 'localhost:8000/s/?compression=gzip-js&ip=1&_=1610460868479' \
  -H 'authority: app.posthog.com' \
  -H 'user-agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36' \
  -H 'dnt: 1' \
  -H 'content-type: text/plain' \
  -H 'accept: */*' \
  -H 'sec-fetch-site: cross-site' \
  -H 'sec-fetch-mode: cors' \
  -H 'sec-fetch-dest: empty' \
  -H 'accept-language: en-GB,en;q=0.9,en-US;q=0.8,et;q=0.7' \
  -H 'cookie: phtoolbar=yes' \
  --data-binary $'\u0008\u0000\u0000\u0000\u0000\u0000\u0000\u0003' \
  --compressed

Next steps

To bypass CORS we send the request as content-type text/plain. Using application/x-www-form-urlencoded would bloat the request payload size significantly. One thing worth trying is multipart/form-data.

Extra complexity here might come from needing to make sure newer posthog-js stays compatible with older versions of posthog.

It is also worth updating fflate, which has had some newer developments recently.

Thank you for your bug report – we love squashing them!

@macobo macobo added bug Something isn't working right session recording labels Jan 13, 2021
@macobo
Copy link
Contributor Author

macobo commented Jan 15, 2021

Seems like this was an issue with CURL rather than event capture - set up a test suite (code below for reference) and this problem did not occur when the request was made by the browser.

Something else is (still) going wrong on that company, not sure why session recording is not working.

POC test

/// <reference types="cypress" />


describe('Event capture', () => {
    given('options', () => ({ capture_pageview: false }))
    given('sessionRecording', () => false)
    given('supportedCompression', () => ['gzip', 'lz64'])

    // :TRICKY: Use a custom start command over beforeEach to deal with given2 not being ready yet.
    const start = () => {
        cy.route({
            method: 'POST',
            url: '**/decide/*',
            response: {
                config: { enable_collect_everything: true },
                editorParams: {},
                featureFlags: ['session-recording-player'],
                isAuthenticated: false,
                supportedCompression: ['gzip-js', 'lz64'],
            },
        }).as('decide')

        cy.visit('./playground/cypress')
        cy.posthogInit(given.options)
    }

    it('works', () => {
        start()

        cy.readFile('../posthog/string.json').then((body) => {
            console.log(body)
            cy.posthog().then(($posthog) => {
                $posthog._handle_queued_event('http://localhost:8000/s', body)
                console.log($posthog._handle_queued_event)
            })
        })
    })
})

@macobo macobo changed the title Sending session recording events fails due to NULL bytes in request payload Missing session recordings for user Jan 15, 2021
@macobo
Copy link
Contributor Author

macobo commented Jan 20, 2021

Reproducing

Managed to reproduce this locally by:

  1. Blocking array.js from prod for myself, using bookmarklet from localhost

Similar to above, normal events show up but full_snapshot ones do not in the db.

Extra scripts for historic purposes

Digging in, I saved the full snapshot events to a file:

    if len(data["snapshot_data"]) > 1000000:
        open('dump.json', 'w').write(json.dumps(data))

And attempted to see run this via python shell:

from ee.clickhouse.sql.session_recording_events import INSERT_SESSION_RECORDING_EVENT_SQL
from ee.kafka_client.client import ClickhouseProducer
from ee.kafka_client.topics import KAFKA_SESSION_RECORDING_EVENTS
from datetime import datetime
import json
data = json.loads(open('dump.json').read())
p = ClickhouseProducer()
produce = lambda: p.produce(sql=INSERT_SESSION_RECORDING_EVENT_SQL, topic=KAFKA_SESSION_RECORDING_EVENTS, data=data)
produce() or datetime.now()

Nothing interesting shows up after this in clickhouse logs: docker cp posthog_clickhouse_1:/var/log/clickhouse-server/clickhouse-server.log chserver.log; less chserver.log

Local fix

I think this narrows things down to either kafka or clickhouse. For this site, the raw json message is 3962402 bytes. Might we be running into kafka message size limitations?

As per https://stackoverflow.com/questions/21020347/how-can-i-send-large-messages-with-kafka-over-15mb I did a couple of changes locally:

diff --git a/ee/docker-compose.ch.yml b/ee/docker-compose.ch.yml
index e70a5aed..b2dfdf59 100644
--- a/ee/docker-compose.ch.yml
+++ b/ee/docker-compose.ch.yml
@@ -106,3 +106,6 @@ services:
         environment:
             KAFKA_ADVERTISED_HOST_NAME: kafka
             KAFKA_ZOOKEEPER_CONNECT: zookeeper:2181
+            KAFKA_MESSAGE_MAX_BYTES: 1000048588
+            KAFKA_REPLICA_FETCH_MAX_BYTES: 1000048588
+            KAFKA_FETCH_MESSAGE_MAX_BYTES: 1000048588
diff --git a/ee/kafka_client/client.py b/ee/kafka_client/client.py
index 7c7f9aed..fad6a08e 100644
--- a/ee/kafka_client/client.py
+++ b/ee/kafka_client/client.py
@@ -34,7 +34,7 @@ class _KafkaProducer:
         elif KAFKA_BASE64_KEYS:
             self.producer = helper.get_kafka_producer(value_serializer=lambda d: d)
         else:
-            self.producer = KP(bootstrap_servers=KAFKA_HOSTS)
+            self.producer = KP(bootstrap_servers=KAFKA_HOSTS, max_request_size=1000048588)
 
     @staticmethod
     def json_serializer(d):

After this session recording worked.

@fuziontech mind handling bumping these in production kafka cluster and in our producers?

@macobo
Copy link
Contributor Author

macobo commented Jan 21, 2021

Tried out sending the raw insert statement to clickhouse as suggested by @fuziontech. This however runs into max query size limits:


0. DB::parseQueryAndMovePosition(DB::IParser&, char const*&, char const*, std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, bool, unsigned long, unsigned long) @ 0xe938c39 in /usr/bin/clickhouse
1. ? @ 0xdeca297 in /usr/bin/clickhouse
2. DB::executeQuery(std::__1::basic_string<char, std::__1::char_traits<char>, std::__1::allocator<char> > const&, DB::Context&, bool, DB::QueryProcessingStage::Enum, bool) @ 0xdec9e9d in /usr/bin/clickhouse
3. DB::TCPHandler::runImpl() @ 0xe56a3b6 in /usr/bin/clickhouse
4. DB::TCPHandler::run() @ 0xe5771b7 in /usr/bin/clickhouse
5. Poco::Net::TCPServerConnection::start() @ 0x10d4f71f in /usr/bin/clickhouse
6. Poco::Net::TCPServerDispatcher::run() @ 0x10d5112e in /usr/bin/clickhouse
7. Poco::PooledThread::run() @ 0x10e823e9 in /usr/bin/clickhouse
8. Poco::ThreadImpl::runnableEntry(void*) @ 0x10e7e31a in /usr/bin/clickhouse
9. start_thread @ 0x9609 in /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
10. clone @ 0x122293 in /usr/lib/x86_64-linux-gnu/libc-2.31.so

We're currently on Heroku with our kafka setup, but we'll move off of it soon. Hopefully we can update the max sizes then.

@fuziontech - assigning you this issue until kafka cluster has moved.

@macobo
Copy link
Contributor Author

macobo commented Jan 22, 2021

Another user reported this error.

macobo added a commit that referenced this issue Jan 25, 2021
Temporarily solves #2927,
though the error will rear it's head again with plugin-based ingestion.

Opened https://github.com/PostHog/posthog/issues/3068 for long-term
solution.
fuziontech pushed a commit that referenced this issue Jan 27, 2021
* Hotfix for missing session recording events

Temporarily solves #2927,
though the error will rear it's head again with plugin-based ingestion.

Opened https://github.com/PostHog/posthog/issues/3068 for long-term
solution.

* kwargs everywhere

* use settings over kwargs
@EDsCODE EDsCODE added the P0 Critical, breaking issue (page crash, missing functionality) label Feb 11, 2021
@paolodamico
Copy link
Contributor

Every week, 1-2 users mention that they have problems getting all their sessions recorded. I usually just explain them our technical limitations just yet, but is there some other more helpful information we can ask them to improve this?

In addition to the above, for reference a user mentioned recently that PostHog captures somewhere between 30-50% of their sessions while FullStory captures 100%.

@macobo
Copy link
Contributor Author

macobo commented Apr 27, 2021

but is there some other more helpful information we can ask them to improve this?

Not right now. These are pretty clear product/technical limitations we're dealing with here and both getting visibility and fixing them requires development focus from our side.

@paolodamico
Copy link
Contributor

What's the state of this? @rcmarron @alexkim205

@rcmarron
Copy link
Contributor

For customers who have upgraded posthog-js, this seems to be resolved. The percent of "missing recordings" is down from 10-50% (depending on their website) to <1%.

At some point, we should look into the remaining 1%, but I think it's safe to close for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right P0 Critical, breaking issue (page crash, missing functionality)
Projects
None yet
Development

No branches or pull requests

5 participants