Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proxy upgrades -> Users disruption and how to reduce them in JupyterLab #1496

Open
consideRatio opened this issue Nov 27, 2019 · 3 comments
Open

Comments

@consideRatio
Copy link
Member

If I make a Helm chart upgrade, some issues typically arise that disrupts the users. This issue is meant to investigate these and consider what we can do.

As a user in JupyterLab I get a dialogbox saying connection lost and suggesting a restart or dismissal of the popup box. Here is the source code relating to that popup. This happens whenever the browser fails to reach the spawned server itself, which it will do through the proxy, as detected from the users browser.

Upgrade scenarios

  • If only the hub restarts, I think everything should be fine as that shouldn't interrupt the connection between the browser and the proxy or its channel of traffic towards to the user server. User browser -> Singleuser server is intact!
  • If only the proxy restarts, then the popup will show. Also, when the proxy comes back up it has no routes configured and it will need to wait for the hub to refresh it which it does periodically every 30 seconds if i recall correctly.
  • I both the proxy and hub restarts, then the proxy will need to get updated again by the hub with its routes etc that it forgets on restarts (CHP does this), so the proxy can fail until the hub has come back online as well.

The unconfigured proxy time window

There is a window where the proxy is accepting traffic but not being configured with how to route traffic other than default unknown routes to the hub. This can be mitigated by using Traefik over CHP which can retain configuration between restarts by not only being configured in-memory.

Summary

Two kinds of issues can follow.

  1. The proxy is down and we cant reach singleuser servers.
  2. The proxy is up but not configured so we can't reach singleuser servers but instead end up at the hub that may or may not be up. If the hub is down, we may end up with a "service unavailable" response because the proxy is trying to reach the hub through a k8s service that found no hub pods. If the hub hasn't been reclassified from ready yet, we may end up with a timeout response from the proxy that reports that it failed to reach the destination. I'm not sure, but I think it may respond with 408 Request Timeout then.

Solution ideas

We update the jupyterlab hub extension to not be so eager to do something not suitable, which would be to press a restart button and arrive at a hub that may also be unreachable because the proxy is down.

The big question is what kind of logic do we want for the browser based jupyterlab logic.
I think it makes sense to consider the connectivity to the user server and the hub itself, and for how long time we have waited to get connectivity back. We somehow should consider the hub connectivity as well as the singleuser server connectivity at least, because if we redirect the user to another dead service, that isn't good.

When considering these options, keep in mind that we have one more scenario where this button right now works great. It is when your server has crashed for some reason, but everything else works. What makes it not work great is when the proxy towards the singleuserserver and hub is temporarily down, and even worse if the hub is down as well while this happens.

@consideRatio consideRatio changed the title Upgrade and disruptions for users Proxy upgrades -> Users disruption and how to reduce them in JupyterLab Nov 27, 2019
@betatim
Copy link
Member

betatim commented Nov 27, 2019

Maybe the JupyterLab hub extension could check if it can reach the proxy (does it have a public health endpoint?) and if it can reach the JupyterHub API (something like https://hub.mybinder.org/hub/api). This might help with separating the case "my kernel crashed" and "everything is down right now". It sounds like if we can tell the difference between these two cases you can use a different timeout/check interval for the different cases as well as when to offer the "restart stuff" button to the user.

The other thing to think about is how often it happens that the proxy is restarted during routine upgrades. I don't think of the proxy as something that changes frequently which means my guess would be that active users don't notice that an upgrade has taken place because their session continues (while the hub or what not is being upgraded). The actionable thing here would be to check if indeed the proxy pod only restarts when it has to. Maybe there are a few cases where it restarts but didn't have to restart.

Another thing I am wondering about is the following scenarios: a user connects and is running JupyterLab, then they go away for ten minutes (aren't watching the screen), we restart the hub/the proxy/both hub and proxy, the user returns.

In neither of the three cases did their user pod get touched so ideally in all three cases the user returns and doesn't notice anything. Does this happen? I think if we only restart the hub or only the proxy then this should be what happens. I am not sure what happens if we restart hub and proxy. There will be intermittent errors but if the user isn't watching their screen they could be resolved by the time the user comes back.

@consideRatio
Copy link
Member Author

I don't think of the proxy as something that changes frequently which means my guess would be that active users don't notice that an upgrade has taken place because their session continues (while the hub or what not is being upgraded).

I want to consider potential improvements anyhow. I've updated my hub/proxy enough with feedback from my colleagues often enough to care :D

The issue regarding proxy is unconfigured for a time window after restart, is only a matter for CHP though. Traefik will handle that well. So I don't care to optimize CHP for that somehow.

Questions

  • Does the dialog box with restart/dismiss in JupyterLab disappear when the connectivity is restored?
  • Does the proxy pod only restarts when it has to?
  • Why did my users kernel (using jupyterlab) die while the hub/proxy was down for a while?

@betatim
Copy link
Member

betatim commented Nov 27, 2019

I think with a HA setup of traefik most problems will go away. Restarting the hub can be done (I think) in a way where the new hub pod is started and only when it is ready (via k8s readyness probe) is the old pod killed.

The reason I was thinking about how much people will notice if they aren't watching during the upgrade is that if they don't notice anything if they don't catch you "in the act", you can do an update after working hours/at 3am scheduled with a cron (or "at") job. Which is "hacky" but super cheap&quick to do ;)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants