Transition LEAP cloud infra to use shared nodes #2209

consideRatio · 2023-02-15T16:20:58Z

I've together with @jbusecke distilled a support ticket to [improve user startup times] and this this request for information among other things to this technical work items.

Node pool changes

There should only be one dask node pool, and it should use n2-highmem-16.
The user node pools small/medium/large/huge should be removed
A new user node pool (medium) should be added with a n2-highmem-16 machine (16 CPU, 128 GB mem)

New profile list

The choices for small-huge is to be replaced with two similar entries with slight differences to start on the 16 core machine.
- This entry should present the image choices already available in the existing profile list
- One version of the entry should be provided for the leap-pangeo-base-access github group, and one for leap-pangeo-full-access. They differ in the sense that one provides the node share options (1,2,4 CPU) and the other high (1,2,4,8,16 CPU)
- The node share options should also set CPU/Memory limits to match the requests. This should also come with a comment that this is to be an exception rather than a default.

Followup of actions

The text was updated successfully, but these errors were encountered:

jbusecke · 2023-02-15T18:48:55Z

Thanks for working on this. Please let me know before any changes are made, so I can make sure that:

The new user groups leap-pangeo-base-access and leap-pangeo-full-access are populated (to avoid interruption of service for users
I give people some heads up warning to not perform any work during the transition period

jbusecke · 2023-02-16T14:48:14Z

Quick update: I have created and populated the new teams in our org: https://github.com/orgs/leap-stc/teams/leap-pangeo-users/teams

These are non-overlapping at the moment.

consideRatio · 2023-02-24T10:52:09Z

@jbusecke I suggest that I work this on Monday morning until mid day, Swedish time UTC+1. Is that okay?

consideRatio · 2023-02-27T07:08:49Z

There is currently three running users so I don't cancel their sessions and start an upgrade. @jbusecke should we aim for next week on monday morning Swedish timezone UTC+1, before you wake up in UTC-6?

jbusecke · 2023-03-07T18:40:48Z

Sorry for the delay. I was really sick all last week and not able to stare at a screen.

I have notified the LEAP community to log out and stop any long running computation before Sun Mar 12, and said that service should be back to normal on Monday after 8AM. Is that correct @consideRatio?

I will make another announcement closer to the weekend. Thank you for working on this!

consideRatio · 2023-03-07T18:50:33Z

@jbusecke absolutely correct!!

And you have populated the github teams leap-pangeo-base-access leap-pangeo-full-access etc with the users already by then, right?

jbusecke · 2023-03-09T15:51:29Z

I still see some discrepancies between the old/new teams, but I will figure this out on my end.

consideRatio · 2023-03-12T16:18:19Z

@jbusecke I've now performed the most disruptive maintenance. As part of this, only users of either leap-pangeo-base-access leap-pangeo-full-access have access now as only these teams are provided with server startup choices.

In this issue, the part about dask-gateway option remains but the other parts are addressed.

As part of doing maintenance in #2237, I also saw an optimization for a workshop a while back. That optimization, after the workshop when nodes wasn't started before users arrived, was likeley slowing down startup of user servers unless they started the tensorflow image specifically. So I think in this setup now, there are three reasons for faster startup.

Node sharing makes nodes likeley to already be running.
The already running nodes likeley already have the image of relevance downloaded to them.
When a node needs to be started, the user arriving to it won't risk needing to wait for the pulling of an unrelated image.

If you wish for even faster startups and better performance, I suggest that the for-this-maintenance agreed CPU limitation is revised. I've thought about this tricky topic a lot and summarized some of those thoughts in #2228 and in this FIXME note.

jbusecke · 2023-03-14T21:44:31Z

So far this is working great. Many thanks @consideRatio. I think the startup times as I (and others experience them) are absolutely sufficient at the moment.

In this issue, the part about dask-gateway option remains but the other parts are addressed.

Is there anything that is required from my side to move this change forward?

consideRatio · 2023-03-14T21:50:53Z

Is there anything that is required from my side to move this change forward?

No I think none!

@jbusecke do you feel strongly about the CPU limits? I want to make sure you have a good user experience, and think limiting CPU can be a significant drawback even for users that requests a lot of CPU to not be limited - this is because they will end up less likeley to fit on a node and then may end up needing to wait for server startup and image pulling etc.

Overall, it seems like a loose / loose / loose in terms of UX / Cost / energy efficiency to me at the moment.

jbusecke · 2023-03-15T00:57:11Z

@jbusecke do you feel strongly about the CPU limits? I want to make sure you have a good user experience, and think limiting CPU can be a significant drawback even for users that requests a lot of CPU to not be limited - this is because they will end up less likeley to fit on a node and then may end up needing to wait for server startup and image pulling etc.

Overall, it seems like a loose / loose / loose in terms of UX / Cost / energy efficiency to me at the moment.

I think I would like to give people some time to give feedback on their experience before changing things more, but I am certainly open to iterate on this further.

I think that from my perspective the dask gateway refactor is of a higher priority. I am not sure if we should track that in a different issue to keep things neat and enable an ongoing discussion here about the CPU limits?

consideRatio · 2023-03-16T17:41:35Z

I think I would like to give people some time to give feedback on their experience before changing things more, but I am certainly open to iterate on this further.

Okay! Let us know if you want to remove the CPU limit or for example have it to be 4X the requested CPU or similar.

I think that from my perspective the dask gateway refactor is of a higher priority. I am not sure if we should track that in a different issue to keep things neat and enable an ongoing discussion here about the CPU limits?

I'm closing this issue now, and dask-gateway work is now represented by #2364 and #2051. I'm not able to give that priority of my own time at the moment =/

consideRatio added this to DEPRECATED Engineering and Product Backlog and Sprint Board Feb 15, 2023

consideRatio self-assigned this Feb 15, 2023

damianavila moved this to Todo 👍 in Sprint Board Feb 15, 2023

consideRatio mentioned this issue Feb 24, 2023

LEAP prometheus server is down/scheduler faiiling #2248

Closed

damianavila moved this to Ready to work in DEPRECATED Engineering and Product Backlog Feb 27, 2023

damianavila moved this from Todo 👍 to In Progress ⚡ in Sprint Board Mar 1, 2023

damianavila moved this from Ready to work to In progress in DEPRECATED Engineering and Product Backlog Mar 1, 2023

damianavila moved this from In progress to Ready to work in DEPRECATED Engineering and Product Backlog Mar 1, 2023

damianavila moved this from In Progress ⚡ to Todo 👍 in Sprint Board Mar 1, 2023

consideRatio mentioned this issue Mar 7, 2023

leap: maintenance notice (basehub: fix override of template_vars) #2318

Merged

damianavila moved this from Todo 👍 to In Progress ⚡ in Sprint Board Mar 8, 2023

consideRatio mentioned this issue Mar 12, 2023

leap: k8s 1.25, node sharing setup, no more pre-pulling, no maintenance notice #2337

Merged

consideRatio mentioned this issue Mar 16, 2023

daskhub / leap: optimize dask-gateway options for a 16 CPU / 128 GB mem worker #2364

Closed

consideRatio closed this as completed Mar 16, 2023

github-project-automation bot moved this from Ready to work to Complete in DEPRECATED Engineering and Product Backlog Mar 16, 2023

github-project-automation bot moved this from In Progress ⚡ to Done 🎉 in Sprint Board Mar 16, 2023

consideRatio mentioned this issue Mar 16, 2023

[Incident] leap clusters prod hub - massive node scale up, hub pod restarted for unknown reason #2126

Closed

5 tasks

damianavila mentioned this issue Apr 4, 2023

Update for 2023 Q1 2i2c-org/team-compass#685

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transition LEAP cloud infra to use shared nodes #2209

Transition LEAP cloud infra to use shared nodes #2209

consideRatio commented Feb 15, 2023 •

edited

Loading

jbusecke commented Feb 15, 2023

jbusecke commented Feb 16, 2023

consideRatio commented Feb 24, 2023

consideRatio commented Feb 27, 2023 •

edited

Loading

jbusecke commented Mar 7, 2023

consideRatio commented Mar 7, 2023

jbusecke commented Mar 9, 2023

consideRatio commented Mar 12, 2023

jbusecke commented Mar 14, 2023

consideRatio commented Mar 14, 2023

jbusecke commented Mar 15, 2023

consideRatio commented Mar 16, 2023 •

edited

Loading

Transition LEAP cloud infra to use shared nodes #2209

Transition LEAP cloud infra to use shared nodes #2209

Comments

consideRatio commented Feb 15, 2023 • edited Loading

Node pool changes

New profile list

Followup of actions

jbusecke commented Feb 15, 2023

jbusecke commented Feb 16, 2023

consideRatio commented Feb 24, 2023

consideRatio commented Feb 27, 2023 • edited Loading

jbusecke commented Mar 7, 2023

consideRatio commented Mar 7, 2023

jbusecke commented Mar 9, 2023

consideRatio commented Mar 12, 2023

jbusecke commented Mar 14, 2023

consideRatio commented Mar 14, 2023

jbusecke commented Mar 15, 2023

consideRatio commented Mar 16, 2023 • edited Loading

consideRatio commented Feb 15, 2023 •

edited

Loading

consideRatio commented Feb 27, 2023 •

edited

Loading

consideRatio commented Mar 16, 2023 •

edited

Loading