-
Notifications
You must be signed in to change notification settings - Fork 66
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Move pilot-hubs cluster to a regional k8s cluster for better availability #1102
Comments
@yuvipanda +1 from me. Even if it costs marginally more, but increases the reliability and stability of the service, or gives us extra flexibility for upgrades, I think that's preferable. |
To signal that this is reasonably stable infrastructure, we're removing 'pilot' from a few domain names. I've setup a wildcard domain *.2i2c.cloud to point to the pilot-hubs cluster's nginx-ingress service IP, so merging this would just switch out these 3 hubs to get rid of the .pilot part of their domain. This does mean our 'primary' cluster becomes a bit more important, so we should make it a little more resilient. See 2i2c-org#1105 and 2i2c-org#1102 Ref 2i2c-org#989
Less prone to k8s API failure this way, although it costs about 70$ a month Ref 2i2c-org#1248 Ref 2i2c-org#1102
Less prone to k8s API failure this way, although it costs about 70$ a month Ref 2i2c-org#1248 Ref 2i2c-org#1102
This hit the neurohackademy hub today, as a transient k8s master failure 'borked' the hub. It failed with:
And then restarted, but then kinda got 'stuck' leading to new servers not being started https://2i2c.freshdesk.com/a/tickets/163. I restarted the pod to unbork it, but we should really be moving this to a regional cluster. |
We should push forward 2i2c-org/team-compass#423 in order to effectively coordinate and schedule the downtime that this migration will inevitably incur |
I raised the priority on 423 (and this one) to reflect this need, @sgibson91. |
Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there. |
Even when that is an option, I think we definitely need to make an iteration on 2i2c-org/team-compass#423 and maybe this one is the forcing function to make that happen 😉. |
I am +0.5 on this only because it means we could call the new cluster something other than |
You are tempting me to reprioritize, @sgibson91 😉!
|
I actually think we should just do it this way now. We can probably copy over data and change ingress/URL points much more "behind the scenes". We could do what I did with the Pangeo hubs and say "we will be running backups and X and Y datetimes,, data created after Y definitely won't be migrated". How we communicate that message to every community on the cluster would still need a rudimentary version of 423 |
I can't quite parse which of the two levels of quoting 'this way' refers to :D Can you clarify, @sgibson91? |
Sorry @yuvipanda , I'm now agreeing with you! |
From @yuvipanda in #423:
I think that would be the rudimentary process we would need to follow. |
This is why I think spinning up a new cluster, duplicating the hubs on it, and then using rsync to move the home directories will be a better scenario. There doesn't need to be a downtime as we just change the IP address the A records point to. And we can minimise the amount of data lost by running a few rsyncs up to the point of switching the IPs. |
Sounds like a reasonable plan to me! |
A related update: We caught a bug in Kubespawner that is causing the hub to be unable to restart after a k8s master outage in jupyterhub/kubespawner#627 I don't know if a release has been made yet though |
- Removes custom version of jupyterhub installed, as that has been merged into latest z2jh - We keep the version of oauthenticator, as I'm not sure it has been merged and released Ref 2i2c-org#1055 Ref 2i2c-org#1102 Ref 2i2c-org#1589
- Removes custom version of jupyterhub installed, as that has been merged into latest z2jh - We keep the version of oauthenticator, as I'm not sure it has been merged and released Ref 2i2c-org#1055 Ref 2i2c-org#1102 Ref 2i2c-org#1589
Huge thanks to @sgibson91 for leading the effort in getting that bug fixed in kubespawner! It has been deployed now, but we should still move at some point soon. |
The pilot-hubs cluster (Running everything under *.pilot.2i2.clcoud) is using a simple, single apiserver zonal k8s cluster. While this is mostly ok, this can mean some downtime and reduced API reliability. It also means no 0 downtime k8s apiserver upgrades, so any upgrades of the apiserver (which happen automatically!) will cause outages.
A regional cluster is highly available, and has 3 apiservers running in a HA configuration. Gives us more reliability. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters has more info.
https://2i2c.freshdesk.com/a/tickets/102 is probably caused by intermittent slowness in k8s apiserver response.
Implementation
pilot-hubs
, as zonal clusters can not be upgraded to regionalThe text was updated successfully, but these errors were encountered: