Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

Open
1 of 3 tasks
yuvipanda opened this issue Mar 15, 2022 · 16 comments
Open
1 of 3 tasks

Comments

@yuvipanda
Copy link
Member

yuvipanda commented Mar 15, 2022

The pilot-hubs cluster (Running everything under *.pilot.2i2.clcoud) is using a simple, single apiserver zonal k8s cluster. While this is mostly ok, this can mean some downtime and reduced API reliability. It also means no 0 downtime k8s apiserver upgrades, so any upgrades of the apiserver (which happen automatically!) will cause outages.

A regional cluster is highly available, and has 3 apiservers running in a HA configuration. Gives us more reliability. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters has more info.

https://2i2c.freshdesk.com/a/tickets/102 is probably caused by intermittent slowness in k8s apiserver response.

Implementation

@choldgraf
Copy link
Member

@yuvipanda +1 from me. Even if it costs marginally more, but increases the reliability and stability of the service, or gives us extra flexibility for upgrades, I think that's preferable.

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Mar 17, 2022
To signal that this is reasonably stable infrastructure,
we're removing 'pilot' from a few domain names. I've setup
a wildcard domain *.2i2c.cloud to point to the pilot-hubs
cluster's nginx-ingress service IP, so merging this would
just switch out these 3 hubs to get rid of the .pilot
part of their domain.

This does mean our 'primary' cluster becomes a bit more
important, so we should make it a little more resilient.
See 2i2c-org#1105
and 2i2c-org#1102

Ref 2i2c-org#989
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 27, 2022
Less prone to k8s API failure this way, although it costs about
70$ a month

Ref 2i2c-org#1248
Ref 2i2c-org#1102
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 27, 2022
Less prone to k8s API failure this way, although it costs about
70$ a month

Ref 2i2c-org#1248
Ref 2i2c-org#1102
@yuvipanda
Copy link
Member Author

This hit the neurohackademy hub today, as a transient k8s master failure 'borked' the hub. It failed with:

[I 2022-07-30 02:24:19.658 JupyterHub log:189] 200 POST /hub/api/users/nicobruno92/activity ([email protected]) 86.49ms
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f8f3d70d6a0>
[E 2022-07-30 02:24:20.351 JupyterHub reflector:351] Watching resources never recovered, giving up
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 285, in _watch_and_update
        resource_version = await self._list_and_update()
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 228, in _list_and_update
        initial_resources_raw = await list_method(**kwargs)
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
        response_data = await self.request(
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
        return (await self.request("GET", url,
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
        r = await self.pool_manager.request(**args)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/client.py", line 535, in _request
        conn = await self._connector.connect(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
        proto = await self._create_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
        _, proto = await self._create_direct_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
        raise last_exc
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
        transp, proto = await self._wrap_create_connection(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
        raise client_error(req.connection_key, exc) from exc
    aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.3.240.1:443 ssl:default [Connect call failed ('10.3.240.1', 443)]
    
[C 2022-07-30 02:24:20.352 JupyterHub spawner:2326] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f8f3e126d60>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7f8f3f9c09c0>
RuntimeError: coroutine ignored GeneratorExit

And then restarted, but then kinda got 'stuck' leading to new servers not being started https://2i2c.freshdesk.com/a/tickets/163.

I restarted the pod to unbork it, but we should really be moving this to a regional cluster.

@sgibson91
Copy link
Member

We should push forward 2i2c-org/team-compass#423 in order to effectively coordinate and schedule the downtime that this migration will inevitably incur

@damianavila
Copy link
Contributor

damianavila commented Aug 1, 2022

I raised the priority on 423 (and this one) to reflect this need, @sgibson91.

@yuvipanda
Copy link
Member Author

Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there.

@damianavila
Copy link
Contributor

Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there.

Even when that is an option, I think we definitely need to make an iteration on 2i2c-org/team-compass#423 and maybe this one is the forcing function to make that happen 😉.

@sgibson91
Copy link
Member

sgibson91 commented Aug 5, 2022

just start up another cluster (same project) that's regional

I am +0.5 on this only because it means we could call the new cluster something other than pilot-hubs and finally eradicate pilot from 2i2c! But agree that 423 will be much more broadly impactful to our operations.

@damianavila
Copy link
Contributor

I am +0.5 on this only because it means we could call the new cluster something other than pilot-hubs and finally eradicate pilot from 2i2c!

You are tempting me to reprioritize, @sgibson91 😉!
"Joke" aside,

But agree that 423 will be much more broadly impactful to our operations.

@sgibson91
Copy link
Member

Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there.

Even when that is an option, I think we definitely need to make an iteration on 2i2c-org/team-compass#423 and maybe this one is the forcing function to make that happen 😉.

I actually think we should just do it this way now. We can probably copy over data and change ingress/URL points much more "behind the scenes". We could do what I did with the Pangeo hubs and say "we will be running backups and X and Y datetimes,, data created after Y definitely won't be migrated".

How we communicate that message to every community on the cluster would still need a rudimentary version of 423

@yuvipanda
Copy link
Member Author

I actually think we should just do it this way now.

I can't quite parse which of the two levels of quoting 'this way' refers to :D Can you clarify, @sgibson91?

@sgibson91
Copy link
Member

Sorry @yuvipanda , I'm now agreeing with you!

@damianavila
Copy link
Contributor

How we communicate that message to every community on the cluster would still need a rudimentary version of 423

From @yuvipanda in #423:

I think a canonical list of community reps + a process to email them is a better fit than one group with everyone.

I think that would be the rudimentary process we would need to follow.
I would add the emails should be sent from support so we have any thread centralized.
I bet it would not be easy to coordinate a good shared time, but maybe I am wrong with my intuition...

@sgibson91
Copy link
Member

I bet it would not be easy to coordinate a good shared time, but maybe I am wrong with my intuition...

This is why I think spinning up a new cluster, duplicating the hubs on it, and then using rsync to move the home directories will be a better scenario. There doesn't need to be a downtime as we just change the IP address the A records point to. And we can minimise the amount of data lost by running a few rsyncs up to the point of switching the IPs.

@damianavila
Copy link
Contributor

Sounds like a reasonable plan to me!

@sgibson91
Copy link
Member

sgibson91 commented Aug 29, 2022

A related update: We caught a bug in Kubespawner that is causing the hub to be unable to restart after a k8s master outage in jupyterhub/kubespawner#627 I don't know if a release has been made yet though

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 8, 2022
- Removes custom version of jupyterhub installed, as that
  has been merged into latest z2jh
- We keep the version of oauthenticator, as I'm not sure
  it has been merged and released

Ref 2i2c-org#1055
Ref 2i2c-org#1102
Ref 2i2c-org#1589
yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Sep 8, 2022
- Removes custom version of jupyterhub installed, as that
  has been merged into latest z2jh
- We keep the version of oauthenticator, as I'm not sure
  it has been merged and released

Ref 2i2c-org#1055
Ref 2i2c-org#1102
Ref 2i2c-org#1589
@yuvipanda
Copy link
Member Author

Huge thanks to @sgibson91 for leading the effort in getting that bug fixed in kubespawner! It has been deployed now, but we should still move at some point soon.

@damianavila damianavila moved this from Needs Shaping / Refinement to Waiting in DEPRECATED Engineering and Product Backlog Sep 13, 2022
@damianavila damianavila moved this from Waiting to Ready to work in DEPRECATED Engineering and Product Backlog Jul 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Ready to work
Development

No branches or pull requests

4 participants