Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

buffer messages when websocket connection is interrupted #2871

Merged
merged 5 commits into from
Oct 6, 2017

Conversation

rgbkrk
Copy link
Member

@rgbkrk rgbkrk commented Sep 27, 2017

This adds an (unbounded) queue to the ZMQChannelsHandler to replay messages when a user reconnects.

This assists those that lose a connection and keep their tab open. It does not help with the general problem of wanting a long running notebook to be saved in the background.

@rgbkrk rgbkrk requested a review from minrk September 27, 2017 22:12
@rgbkrk
Copy link
Member Author

rgbkrk commented Sep 27, 2017

D'oh. Of course. This is per connection and this goes away when the connection is severed.

@rgbkrk
Copy link
Member Author

rgbkrk commented Sep 28, 2017

I'm really unsure of how I could work around this to buffer messages to replay to clients when they reconnect.

@minrk
Copy link
Member

minrk commented Sep 28, 2017

Yeah, it would need to go somewhere else. Right now, we create a zmq socket per websocket connection. That zmq socket is destroyed when the websocket connection is lost.

What we would need to do is:

  1. leave the zmq connections open after the websocket goes away (potentially with a timeout, but kernel shutdown is also a reasonable time to close)
  2. when clients close, start buffering
  3. persist transient state of cell:msg_id mapping to be restored on reconnect, so that the right handlers are hooked up to deal with the resuming messages

I think this is going to be tricky without moving the document state to the server, but may still be worth exploring.

@rgbkrk
Copy link
Member Author

rgbkrk commented Sep 28, 2017

I think this is going to be tricky without moving the document state to the server, but may still be worth exploring.

Yeah I still want that in the long term, it's hard to keep promising that in the short term though.

Yeah, it would need to go somewhere else. Right now, we create a zmq socket per websocket connection. That zmq socket is destroyed when the websocket connection is lost.

Yikes, alright. Do we just keep one extra zmq socket around then in that case. I see now how I was spinning some wheels hopelessly -- the zmq connection is created here when the websocket connection is created.

@minrk
Copy link
Member

minrk commented Sep 28, 2017

Do we just keep one extra zmq socket around then in that case.

For the lost connection case (not new tab, new browser, etc.), then replay should be fine, and all we need to track is the session_id, which is what identifies a browser session. This should work:

  1. on websocket close, leave zmq stream open and start buffering instead of closing, as we do now.
  2. on websocket open, check session_id and reconnect to zmq_stream and replay if one already exists for that session_id. If session_id doesn't match, clear any buffers that may be open (i.e. don't allow multiple viewers to have this resume behavior on the same kernel).

@minrk
Copy link
Member

minrk commented Sep 28, 2017

Saving the hairy cell:msg_id problem for another task makes sense to me. We can focus on the lost connection case, which will improve several real use cases.

@rgbkrk
Copy link
Member Author

rgbkrk commented Sep 28, 2017

Based on conversations with Min, I'm going to take a new stab at this within the ZMQChannelsHandler and:

  • Attaching the buffer to the kernel object
  • There will only be one buffer per kernel
  • This is only intended to make reconnecting with the same tab work well (we aren't worried about multiple sessions since we don't work well with multiple tabs now anyways...)

@blink1073
Copy link
Contributor

blink1073 commented Sep 28, 2017

@rgbkrk rgbkrk changed the title [WIP] buffer messages in the ZMQStreamHandler [WIP] buffer messages in the ZMQChannelsHandler Oct 3, 2017
@alexgarel
Copy link

alexgarel commented Oct 3, 2017

Just as you implements this particular mechanism, in case you want, you may also think of multi concurrent user notebooks in this form:

  • multiple user are working on the same notebook
  • as soon as a user submit a cell, it's value is updated in the other user notebook (in case of conflict, interface propose something)
  • as soon as the result is ready, it is displayed in all notebooks

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 3, 2017

We're going to tackle multi user concurrent notebooks with a server side state of the notebook, which is not done today. This is only a band aid. Check out https://github.com/jupyterlab/jupyterlab-google-drive for the current alpha for realtime.

@minrk
Copy link
Member

minrk commented Oct 3, 2017

I've pushed an implementation that works. It's still an unbounded list.

  • buffer is per-kernel
  • session_key is stored because only the same session that is buffered will resume properly
  • on any new connection to a kernel, buffer is flushed.
    If session_key matches, buffer is replayed.
    Otherwise, it is discarded (this will be refreshed pages, new browser windows).
  • buffer is an unbounded list for now
  • buffer is never discarded until the kernel shuts down

Things we could do:

  1. limit the lifetime of the buffer (e.g. expire and close after 10 minutes?)
  2. limit the size of the buffer (msg count and/or total bytes)

- buffer is per-kernel
- session_key is stored because only a single session can resume the buffer and we can't be sure
- on any new connection to a kernel, buffer is flushed.
  If session_key matches, it is replayed.
  Otherwise, it is discarded.
- buffer is an unbounded list for now
rather than establishing new connections

fixes failure to resume shell channel
def open(self, kernel_id):
super(ZMQChannelsHandler, self).open()
self.kernel_manager.notify_connect(kernel_id)

# on new connections, flush the message buffer
replay_buffer = self.kernel_manager.stop_buffering(kernel_id, self.session_key)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok cool you made the kernel manager dictate stopping the buffering

self.log.info("Replaying %s buffered messages", len(replay_buffer))
for channel, msg_list in replay_buffer:
stream = self.channels[channel]
self._on_zmq_reply(stream, msg_list)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What should we do if we fail during the replay?

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 3, 2017

PASS ws_closed_error: kernel_idle.Kernel was triggered
Timeout for http://localhost:8888/a@b/
Is the notebook server running?
FAIL "function () {
        return this.evaluate(function (events) {
            return IPython._events_triggered.length >= events.length;
        }, [events]);
    }" did not evaluate to something truthy in 10000ms
#    type: uncaughtError
#    file: /home/travis/build/jupyter/notebook/notebook/tests/services/kernel.js
#    error: "function () {
        return this.evaluate(function (events) {
            return IPython._events_triggered.length >= events.length;
        }, [events]);
    }" did not evaluate to something truthy in 10000ms
#    stack: not provided
Captured console.log:

Should we skip this test for now?

@minrk
Copy link
Member

minrk commented Oct 4, 2017

Should we skip this test for now?

No, I think it's a real bug.

minrk added 2 commits October 4, 2017 09:55
instead of in `create_stream`, which is not called on reconnect
- dismiss 'connection lost' dialog on reconnect
- set busy status on reconnect (if not busy, idle will come soon after via kernel_ready)
@minrk
Copy link
Member

minrk commented Oct 4, 2017

Bug fixed. I also improved the dialog/status handling when connection is lost.

@minrk
Copy link
Member

minrk commented Oct 4, 2017

Here's what it looks like now with a connection drop and reconnect in the middle of execution:

replay

The dialog is dismissed automatically when the connection is restored.

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 4, 2017

This is working pretty well. Even for a flaky VPN + wifi dropoff, I'm only losing one message within a one second gap.

@blink1073
Copy link
Contributor

What if this were implemented as a response to a message from the frontend that said "hey, I was disconnected but I'm back, here's the last msg_id I got, please send the rest"?

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 4, 2017

here's the last msg_id I got, please send the rest

Oooooh, I like that. I mean, it will be a bit strange as we'll be putting a new API on top of the current API over the websockets though.

@minrk
Copy link
Member

minrk commented Oct 5, 2017

@blink1073 that's definitely possible and would be more robust. To do that, we would need to maintain a cache of messages on all channels all the time, in case of dropped connections in the future, rather than only when connections are lost. Doing that points to a different mechanism, because it makes it inevitable that the cache gets huge, rather than possible. To do that, we need to spill to disk with an efficient culling mechanism. To me, that sounds like a job for sqlite.

If I were doing that, I would make last_msg_id part of the websocket connect request to trigger the replay.

@alexgarel
Copy link

If you use sqlite for messages, I would suggest default sqlite database should be in memory database.
Of course persistent database would allow to gets past results even after a jupyter server restart ! That sounds cool. But you'll have to find a policy to remove dead kernel messages.

@rgbkrk rgbkrk changed the title [wip] buffer messages when websocket connection is interrupted buffer messages when websocket connection is interrupted Oct 6, 2017
@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 6, 2017

I think this is probably good to go ahead with for the time being as further enchancements are going to require a lot more coordination between the frontends and the server.

@gnestor gnestor merged commit 43a9780 into jupyter:master Oct 6, 2017
@gnestor
Copy link
Contributor

gnestor commented Oct 6, 2017

Thanks @rgbkrk and @minrk!!

@rgbkrk rgbkrk deleted the buffer-messages branch October 6, 2017 16:47
@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 6, 2017

Thanks @gnestor and @minrk!

@minrk
Copy link
Member

minrk commented Oct 7, 2017

@alexgarel I think it can't be in-memory by default because the main point of using sqlite would be to limit how much memory the message cache will take by spilling to disk.

@kevin-bates
Copy link
Member

@rgbkrk & @minrk - First, sorry for posting this to a closed PR but it has all the necessary context. Second, I apologize for my long windedness. TL;DR: In looking at the buffering code, it seems like buffer replay never comes into play (at this time).

I needed to look into buffering support relative to a recent NB2KG issue. Since NB2KG proxies the notebook's kernel management to a gateway server (Kernel Gateway or Enterprise Gateway), I was concerned that buffering replay wouldn't work. However, after looking into this more deeply, I find these behaviors between traditional notebook and nb2kg-enabled notebooks to be the same. I also find that when the network connection between the browser and notebook is dropped, the "replay" as indicated by the video above is not taking place from the buffering code, but, rather from ZMQ directly (at least that's my assumption). I say this because I don't see any debug entries about buffering messages and replaying or discarding them. Instead, those messages are produced only after one closes the notebook browser tab, then opens that same active notebook again (which Kyle mentions above as well in describing when the PR changes take place). As a result, I don't understand when the buffering replay will actually occur since it seems like re-opening an active notebook triggers the creation of a new session - which indirectly is part of the key (along with kernel_id) into the replayability of buffered messages.

Do I have the correct understanding with respect to the state of buffering and its replay? If not, under what circumstances does the implemented buffering get replayed?

Could you explain why session is part of the key for determining whether the buffered messages are to be replayed or not? I'm guessing it represents the "channel instances" since different connections of the same kernel instance will use different channels - is that right? If so, would it be possible to equate the channel "sets" from the previous (buffered) session to the new session and drop the use of session_key?

Thank you for your time and increasing my understanding.

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 10, 2018

They are attached to a session because otherwise the frontend has no way of associating output messages (by msg_id) to the cells they originated from.

@kevin-bates
Copy link
Member

Thanks Kyle. So I think buffered message usage from NB2KG is doubly screwed (once replay is figured out) because the session_key generated in the Gateway server will be different than that generated in the (client) Notebook server where the messages are derived. Is this session key the same value as the session entry in each message (or can be derived from them)? If so, it would be nice if the message buffer initialized the session key in that manner.

Any hints on how to trigger buffer replays (and not discards)? If not, I'll post in gitter.

Thanks.

@rgbkrk
Copy link
Member Author

rgbkrk commented Oct 10, 2018

I'm never in gitter, hopefully you'll find someone to help.

I'd have to sit down again with the raw session messages to know for sure. Happy to meet about this at some point in the coming weeks (I'm out of town at the moment).

@alexgarel
Copy link

They are attached to a session because otherwise the frontend has no way of associating output messages (by msg_id) to the cells they originated from.

Could it be associated with a notebook session id (instead of connection), that would be generated on client side (as a document DOM attribute), and that is given at websocket opening ?

@kevin-bates
Copy link
Member

If the session id you're referring to is the one in the header (vs parent_header) that does appear to be consistent across the disconnection scenario. However, even if I remove the checks for session_key (so that replay is purely a function of matching kernel_id), the replayed buffer contents (just as with the current message results) have no where to go since the underlying websocket is different (at least that's my suspicion as to why nothing appears). That said, the js code is totally foreign to me and I think that's where the answers lie.

Seems like there needs to be some tolerance where the received messages can 'switch' to a different "channel set" in cases where the previous 'set' is no longer around.

@gnestor
Copy link
Contributor

gnestor commented Nov 7, 2018

Does this sound related to jupyter/jupyter#83?

@kevin-bates
Copy link
Member

Does this sound related to jupyter/jupyter#83?

Yes, I believe the issue you reference is the portion of this PR that is not addressed. That is, when a notebook is closed and re-opened, buffered messages received in the interim are discarded on the re-open due to it being a different WS connection. I took a run at this via #4105/#4110 but later learned the current implementation is working as designed. (I hadn't been able to get any messages to replay correctly, but Min's evidence on the corresponding PR showed its possible, so I dropped "my case".)

@rgbkrk
Copy link
Member Author

rgbkrk commented Nov 8, 2018

The only way you can get messages to replay is if you lose connection, primarily on a remote server.

@bthorsted
Copy link

If I put my laptop in hibernate with an open jupyterlab tab with a running notebook inside and then wake up the laptop again, no messages are replayed at all and it doesn't receive new messages either bewfore I actively tell it to run a new cell. In effect this means, that if I am e.g. running a cell that takes two days to complete, then I cannot see any results made after I put the computer in hibernate. So the only solution is to try and stay connected for as long as possible or the silly way: opening a remote desktop to the jupyter server, opening a browser on the remote and running the notebooks like that. Any solution expected soon?

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 28, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants