Frequent Websocket 1006 (closed without reason) errors being seen. #27

nklisch · 2024-11-20T05:35:11Z

I am unsure if this is a problem or by design but about every 5-10 seconds, my websocket will close with error code 1006 and no provided reason.
I am using the typescript WebSocketKeepAlive class found in at-proto repo. This does keep my websocket alive and so I can consume it correctly, but it seemed odd that the socket would close every 5 to 10 seconds.
If this is expected, feel free to close this issue.

The text was updated successfully, but these errors were encountered:

cwegrzyn · 2025-02-04T15:01:03Z

We are seeing this as well (and I think #31 seems the same as well).

My current hypothesis is that this has to do with the cutover to live-streaming. We never see it if we stay firmly in the distant past (consume records at rate less than the average event rate of ~1000ev/sec) OR if we omit the cursor parameter and start live-streaming right away... but if we pick a time well in the past (say 5 minutes) and consume events at approximately the current server rate limit (5000 ev/sec), we get disconnected at approximately the point where we've caught up with the live stream.

I think it goes something like this:

ReplayEvents fills up the outbox faster than the server rate limit permits sending events
When the cutover threshold is crossed, Server.Emit tries to push new events on the outbox
The outbox is already near capacity-- and still being actively filled by ReplayEvents-- so that new live-stream event is pushed on to a full outbox causing the client to be dropped.

I'm still not 100% sure I understand the cutover logic-- but it looks like it keeps attempting to feed records from the DB until it ends closer than 0.5 second. Meanwhile, when it gets within 1 second, it'll start receiving records from the live-stream. And the logic at the start of emitToSubscriber ensures that the replayed events are discarded if it is live-streaming.

I'm not sure this is the ideal solution, but just spitballing, maybe one perhaps too costly strategy, is to have the outbox have a capacity that is greater than the replayevents rate limit (which looks to be 10 x the server rate limit). That might make it take longer to drop clients that could hold up the live stream though. Perhaps ones that are too far behind could periodically be culled if checking the length of the channel on each push is undesirable?

uniphil · 2025-02-05T20:58:03Z

@cwegrzyn I believe you're exactly on the right track. Adding on to your last points:

The 10x replay event limit affects how quickly the db cursor can advance -- it slows it down if the client sets up event filtering that would lead to very sparse output, since the subscriber event limit only applies to events that actually get sent.

For a client connecting with no filters, this has the effect of completely filling up the outbox, as long as the db cursor can actually go fast, since the the outbox can only be cleared at a lower rate. Even though the outbox is full, this does not result the client being removed in the early part of replay, because the send on the outbox channel from replay is blocking.

(so, a slow client will slow down the db cursor even more. the channel just wastes a full outbox worth of memory constantly and the cursor+client rendezvous)

Once we reach the cutover threshold, every live-tailed send on the channel threatens to trigger a disconnect as you note.

I'm not 100% sure why this doesn't happen more often on the less-overloaded instances.

Overlap period from cutover threshold (now - 1s) to when replay ends (now - 0.5s)

this is more specific to #42, but from here, the subscriber will drop "old" events instead of emitting them.

so when the live-tail emits its first newer event, jetstream will effectively just drop all the remaining replay events. it would be nice to know if this is by-design and intended, but that's probably for #42

Possible fixes

since during replay the outbox is acting as a kind of rendezvous buffer anyway, i think it would make sense to just bypass it -- do a blocking send directly to the client. replay is already happening in its own goroutine it's not risking bogging down other clients.

i'm imagining that the subscriber's outbox then becomes a kind of on-ramp around cutoff time: live-tail starts pushing events to it when the replay is within 1s. the consumer grabs the first one and holds onto it, comparing the replay sequences until they match. it can then drop the replay and pick up from the outbox, for a seemless cutover. maybe?

uniphil mentioned this issue Feb 4, 2025

Jetstream drops and reorders events during cutover from replay to live-tailing #42

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Frequent Websocket 1006 (closed without reason) errors being seen. #27

Frequent Websocket 1006 (closed without reason) errors being seen. #27

nklisch commented Nov 20, 2024

cwegrzyn commented Feb 4, 2025

uniphil commented Feb 5, 2025

Frequent Websocket 1006 (closed without reason) errors being seen. #27

Frequent Websocket 1006 (closed without reason) errors being seen. #27

Comments

nklisch commented Nov 20, 2024

cwegrzyn commented Feb 4, 2025

uniphil commented Feb 5, 2025

Overlap period from cutover threshold (now - 1s) to when replay ends (now - 0.5s)

Possible fixes