-
Notifications
You must be signed in to change notification settings - Fork 804
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proxy upgrades -> Users disruption and how to reduce them in JupyterLab #1496
Comments
Maybe the JupyterLab hub extension could check if it can reach the proxy (does it have a public health endpoint?) and if it can reach the JupyterHub API (something like https://hub.mybinder.org/hub/api). This might help with separating the case "my kernel crashed" and "everything is down right now". It sounds like if we can tell the difference between these two cases you can use a different timeout/check interval for the different cases as well as when to offer the "restart stuff" button to the user. The other thing to think about is how often it happens that the proxy is restarted during routine upgrades. I don't think of the proxy as something that changes frequently which means my guess would be that active users don't notice that an upgrade has taken place because their session continues (while the hub or what not is being upgraded). The actionable thing here would be to check if indeed the proxy pod only restarts when it has to. Maybe there are a few cases where it restarts but didn't have to restart. Another thing I am wondering about is the following scenarios: a user connects and is running JupyterLab, then they go away for ten minutes (aren't watching the screen), we restart the hub/the proxy/both hub and proxy, the user returns. In neither of the three cases did their user pod get touched so ideally in all three cases the user returns and doesn't notice anything. Does this happen? I think if we only restart the hub or only the proxy then this should be what happens. I am not sure what happens if we restart hub and proxy. There will be intermittent errors but if the user isn't watching their screen they could be resolved by the time the user comes back. |
I want to consider potential improvements anyhow. I've updated my hub/proxy enough with feedback from my colleagues often enough to care :D The issue regarding proxy is unconfigured for a time window after restart, is only a matter for CHP though. Traefik will handle that well. So I don't care to optimize CHP for that somehow. Questions
|
I think with a HA setup of traefik most problems will go away. Restarting the hub can be done (I think) in a way where the new hub pod is started and only when it is ready (via k8s readyness probe) is the old pod killed. The reason I was thinking about how much people will notice if they aren't watching during the upgrade is that if they don't notice anything if they don't catch you "in the act", you can do an update after working hours/at 3am scheduled with a cron (or "at") job. Which is "hacky" but super cheap&quick to do ;) |
If I make a Helm chart upgrade, some issues typically arise that disrupts the users. This issue is meant to investigate these and consider what we can do.
As a user in JupyterLab I get a dialogbox saying connection lost and suggesting a restart or dismissal of the popup box. Here is the source code relating to that popup. This happens whenever the browser fails to reach the spawned server itself, which it will do through the proxy, as detected from the users browser.
Upgrade scenarios
The unconfigured proxy time window
There is a window where the proxy is accepting traffic but not being configured with how to route traffic other than default unknown routes to the hub. This can be mitigated by using Traefik over CHP which can retain configuration between restarts by not only being configured in-memory.
Summary
Two kinds of issues can follow.
Solution ideas
We update the jupyterlab hub extension to not be so eager to do something not suitable, which would be to press a restart button and arrive at a hub that may also be unreachable because the proxy is down.
The big question is what kind of logic do we want for the browser based jupyterlab logic.
I think it makes sense to consider the connectivity to the user server and the hub itself, and for how long time we have waited to get connectivity back. We somehow should consider the hub connectivity as well as the singleuser server connectivity at least, because if we redirect the user to another dead service, that isn't good.
When considering these options, keep in mind that we have one more scenario where this button right now works great. It is when your server has crashed for some reason, but everything else works. What makes it not work great is when the proxy towards the singleuserserver and hub is temporarily down, and even worse if the hub is down as well while this happens.
The text was updated successfully, but these errors were encountered: