Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

yuvipanda · 2022-03-15T10:44:29Z

The pilot-hubs cluster (Running everything under *.pilot.2i2.clcoud) is using a simple, single apiserver zonal k8s cluster. While this is mostly ok, this can mean some downtime and reduced API reliability. It also means no 0 downtime k8s apiserver upgrades, so any upgrades of the apiserver (which happen automatically!) will cause outages.

A regional cluster is highly available, and has 3 apiservers running in a HA configuration. Gives us more reliability. https://cloud.google.com/kubernetes-engine/docs/concepts/regional-clusters has more info.

https://2i2c.freshdesk.com/a/tickets/102 is probably caused by intermittent slowness in k8s apiserver response.

Implementation

Make sure regional / zonal is toggleable in our GKE terraform code (Move m2lines & LEAP hubs to regional HA clusters #1251)
Create a new k8s regional cluster for pilot-hubs, as zonal clusters can not be upgraded to regional
Migrate all existing hubs to new cluster

The text was updated successfully, but these errors were encountered:

choldgraf · 2022-03-16T22:23:46Z

@yuvipanda +1 from me. Even if it costs marginally more, but increases the reliability and stability of the service, or gives us extra flexibility for upgrades, I think that's preferable.

To signal that this is reasonably stable infrastructure, we're removing 'pilot' from a few domain names. I've setup a wildcard domain *.2i2c.cloud to point to the pilot-hubs cluster's nginx-ingress service IP, so merging this would just switch out these 3 hubs to get rid of the .pilot part of their domain. This does mean our 'primary' cluster becomes a bit more important, so we should make it a little more resilient. See 2i2c-org#1105 and 2i2c-org#1102 Ref 2i2c-org#989

Less prone to k8s API failure this way, although it costs about 70$ a month Ref 2i2c-org#1248 Ref 2i2c-org#1102

yuvipanda · 2022-07-31T19:38:05Z

This hit the neurohackademy hub today, as a transient k8s master failure 'borked' the hub. It failed with:

[I 2022-07-30 02:24:19.658 JupyterHub log:189] 200 POST /hub/api/users/nicobruno92/activity ([email protected]) 86.49ms
ERROR:asyncio:Unclosed client session
client_session: <aiohttp.client.ClientSession object at 0x7f8f3d70d6a0>
[E 2022-07-30 02:24:20.351 JupyterHub reflector:351] Watching resources never recovered, giving up
    Traceback (most recent call last):
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 285, in _watch_and_update
        resource_version = await self._list_and_update()
      File "/usr/local/lib/python3.9/site-packages/kubespawner/reflector.py", line 228, in _list_and_update
        initial_resources_raw = await list_method(**kwargs)
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/api_client.py", line 185, in __call_api
        response_data = await self.request(
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 193, in GET
        return (await self.request("GET", url,
      File "/usr/local/lib/python3.9/site-packages/kubernetes_asyncio/client/rest.py", line 177, in request
        r = await self.pool_manager.request(**args)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/client.py", line 535, in _request
        conn = await self._connector.connect(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 542, in connect
        proto = await self._create_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 907, in _create_connection
        _, proto = await self._create_direct_connection(req, traces, timeout)
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1206, in _create_direct_connection
        raise last_exc
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 1175, in _create_direct_connection
        transp, proto = await self._wrap_create_connection(
      File "/usr/local/lib/python3.9/site-packages/aiohttp/connector.py", line 992, in _wrap_create_connection
        raise client_error(req.connection_key, exc) from exc
    aiohttp.client_exceptions.ClientConnectorError: Cannot connect to host 10.3.240.1:443 ssl:default [Connect call failed ('10.3.240.1', 443)]
    
[C 2022-07-30 02:24:20.352 JupyterHub spawner:2326] Pods reflector failed, halting Hub.
ERROR:asyncio:Task was destroyed but it is pending!
task: <Task pending name='Task-3' coro=<shared_client.<locals>.close_client_task() running at /usr/local/lib/python3.9/site-packages/kubespawner/clients.py:58> wait_for=<Future pending cb=[<TaskWakeupMethWrapper object at 0x7f8f3e126d60>()]>>
Exception ignored in: <coroutine object shared_client.<locals>.close_client_task at 0x7f8f3f9c09c0>
RuntimeError: coroutine ignored GeneratorExit

And then restarted, but then kinda got 'stuck' leading to new servers not being started https://2i2c.freshdesk.com/a/tickets/163.

I restarted the pod to unbork it, but we should really be moving this to a regional cluster.

sgibson91 · 2022-08-01T10:33:42Z

We should push forward 2i2c-org/team-compass#423 in order to effectively coordinate and schedule the downtime that this migration will inevitably incur

damianavila · 2022-08-01T21:21:50Z

I raised the priority on 423 (and this one) to reflect this need, @sgibson91.

yuvipanda · 2022-08-01T23:31:58Z

Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there.

damianavila · 2022-08-02T23:03:21Z

Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there.

Even when that is an option, I think we definitely need to make an iteration on 2i2c-org/team-compass#423 and maybe this one is the forcing function to make that happen 😉.

sgibson91 · 2022-08-05T12:42:30Z

just start up another cluster (same project) that's regional

I am +0.5 on this only because it means we could call the new cluster something other than pilot-hubs and finally eradicate pilot from 2i2c! But agree that 423 will be much more broadly impactful to our operations.

damianavila · 2022-08-05T20:24:21Z

I am +0.5 on this only because it means we could call the new cluster something other than pilot-hubs and finally eradicate pilot from 2i2c!

You are tempting me to reprioritize, @sgibson91 😉!
"Joke" aside,

But agree that 423 will be much more broadly impactful to our operations.

sgibson91 · 2022-08-10T16:50:10Z

Another option that is easier is to just start up another cluster (same project) that's regional, and put all new hubs there. And over time, things just move over there.

Even when that is an option, I think we definitely need to make an iteration on 2i2c-org/team-compass#423 and maybe this one is the forcing function to make that happen 😉.

I actually think we should just do it this way now. We can probably copy over data and change ingress/URL points much more "behind the scenes". We could do what I did with the Pangeo hubs and say "we will be running backups and X and Y datetimes,, data created after Y definitely won't be migrated".

How we communicate that message to every community on the cluster would still need a rudimentary version of 423

yuvipanda · 2022-08-10T18:23:43Z

I actually think we should just do it this way now.

I can't quite parse which of the two levels of quoting 'this way' refers to :D Can you clarify, @sgibson91?

sgibson91 · 2022-08-10T18:42:57Z

Sorry @yuvipanda , I'm now agreeing with you!

damianavila · 2022-08-10T20:41:00Z

How we communicate that message to every community on the cluster would still need a rudimentary version of 423

From @yuvipanda in #423:

I think a canonical list of community reps + a process to email them is a better fit than one group with everyone.

I think that would be the rudimentary process we would need to follow.
I would add the emails should be sent from support so we have any thread centralized.
I bet it would not be easy to coordinate a good shared time, but maybe I am wrong with my intuition...

sgibson91 · 2022-08-11T09:14:18Z

I bet it would not be easy to coordinate a good shared time, but maybe I am wrong with my intuition...

This is why I think spinning up a new cluster, duplicating the hubs on it, and then using rsync to move the home directories will be a better scenario. There doesn't need to be a downtime as we just change the IP address the A records point to. And we can minimise the amount of data lost by running a few rsyncs up to the point of switching the IPs.

damianavila · 2022-08-11T13:50:10Z

Sounds like a reasonable plan to me!

sgibson91 · 2022-08-29T19:05:12Z

A related update: We caught a bug in Kubespawner that is causing the hub to be unable to restart after a k8s master outage in jupyterhub/kubespawner#627 I don't know if a release has been made yet though

- Removes custom version of jupyterhub installed, as that has been merged into latest z2jh - We keep the version of oauthenticator, as I'm not sure it has been merged and released Ref 2i2c-org#1055 Ref 2i2c-org#1102 Ref 2i2c-org#1589

yuvipanda · 2022-09-08T08:22:19Z

Huge thanks to @sgibson91 for leading the effort in getting that bug fixed in kubespawner! It has been deployed now, but we should still move at some point soon.

sgibson91 mentioned this issue Mar 15, 2022

Transient issues spinning up servers related to pod reflector errors #1103

Closed

2 tasks

yuvipanda mentioned this issue Mar 17, 2022

Move a few hubs off *.pilot.2i2c.cloud #1129

Merged

yuvipanda mentioned this issue Apr 26, 2022

[Incident] m2lines hub unavailable for a short period of time #1248

Closed

5 tasks

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 27, 2022

Add an option to create regional clusters on GKE

4b4f457

Less prone to k8s API failure this way, although it costs about 70$ a month Ref 2i2c-org#1248 Ref 2i2c-org#1102

yuvipanda mentioned this issue Apr 27, 2022

Move m2lines & LEAP hubs to regional HA clusters #1251

Merged

yuvipanda added a commit to yuvipanda/pilot-hubs that referenced this issue Apr 27, 2022

Add an option to create regional clusters on GKE

96ed6c5

Less prone to k8s API failure this way, although it costs about 70$ a month Ref 2i2c-org#1248 Ref 2i2c-org#1102

damianavila added this to DEPRECATED Engineering and Product Backlog Aug 1, 2022

damianavila moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Aug 1, 2022

arokem mentioned this issue Aug 9, 2022

[EVENT] NeuroHackademy 2022 #1300

Closed

12 tasks

sgibson91 mentioned this issue Aug 10, 2022

[Incident] OceanHackWeek hub cannot start server #1616

Closed

5 tasks

sgibson91 mentioned this issue Aug 12, 2022

Resilience needed against k8s master outages jupyterhub/kubespawner#627

Closed

yuvipanda mentioned this issue Sep 8, 2022

Bump version of z2jh #1690

Merged

damianavila moved this from Needs Shaping / Refinement to Waiting in DEPRECATED Engineering and Product Backlog Sep 13, 2022

damianavila moved this from Waiting to Ready to work in DEPRECATED Engineering and Product Backlog Jul 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

yuvipanda commented Mar 15, 2022 •

edited

Loading

choldgraf commented Mar 16, 2022

yuvipanda commented Jul 31, 2022

sgibson91 commented Aug 1, 2022

damianavila commented Aug 1, 2022 •

edited

Loading

yuvipanda commented Aug 1, 2022

damianavila commented Aug 2, 2022

sgibson91 commented Aug 5, 2022 •

edited

Loading

damianavila commented Aug 5, 2022

sgibson91 commented Aug 10, 2022

yuvipanda commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

damianavila commented Aug 10, 2022

sgibson91 commented Aug 11, 2022

damianavila commented Aug 11, 2022

sgibson91 commented Aug 29, 2022 •

edited

Loading

yuvipanda commented Sep 8, 2022

Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

Move pilot-hubs cluster to a regional k8s cluster for better availability #1102

Comments

yuvipanda commented Mar 15, 2022 • edited Loading

Implementation

choldgraf commented Mar 16, 2022

yuvipanda commented Jul 31, 2022

sgibson91 commented Aug 1, 2022

damianavila commented Aug 1, 2022 • edited Loading

yuvipanda commented Aug 1, 2022

damianavila commented Aug 2, 2022

sgibson91 commented Aug 5, 2022 • edited Loading

damianavila commented Aug 5, 2022

sgibson91 commented Aug 10, 2022

yuvipanda commented Aug 10, 2022

sgibson91 commented Aug 10, 2022

damianavila commented Aug 10, 2022

sgibson91 commented Aug 11, 2022

damianavila commented Aug 11, 2022

sgibson91 commented Aug 29, 2022 • edited Loading

yuvipanda commented Sep 8, 2022

yuvipanda commented Mar 15, 2022 •

edited

Loading

damianavila commented Aug 1, 2022 •

edited

Loading

sgibson91 commented Aug 5, 2022 •

edited

Loading

sgibson91 commented Aug 29, 2022 •

edited

Loading