-
Notifications
You must be signed in to change notification settings - Fork 67
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Transition LEAP cloud infra to use shared nodes #2209
Comments
Thanks for working on this. Please let me know before any changes are made, so I can make sure that:
|
Quick update: I have created and populated the new teams in our org: https://github.com/orgs/leap-stc/teams/leap-pangeo-users/teams These are non-overlapping at the moment. |
@jbusecke I suggest that I work this on Monday morning until mid day, Swedish time UTC+1. Is that okay? |
There is currently three running users so I don't cancel their sessions and start an upgrade. @jbusecke should we aim for next week on monday morning Swedish timezone UTC+1, before you wake up in UTC-6? |
Sorry for the delay. I was really sick all last week and not able to stare at a screen. I have notified the LEAP community to log out and stop any long running computation before Sun Mar 12, and said that service should be back to normal on Monday after 8AM. Is that correct @consideRatio? I will make another announcement closer to the weekend. Thank you for working on this! |
@jbusecke absolutely correct!! And you have populated the github teams |
I still see some discrepancies between the old/new teams, but I will figure this out on my end. |
@jbusecke I've now performed the most disruptive maintenance. As part of this, only users of either In this issue, the part about dask-gateway option remains but the other parts are addressed. As part of doing maintenance in #2237, I also saw an optimization for a workshop a while back. That optimization, after the workshop when nodes wasn't started before users arrived, was likeley slowing down startup of user servers unless they started the tensorflow image specifically. So I think in this setup now, there are three reasons for faster startup.
If you wish for even faster startups and better performance, I suggest that the for-this-maintenance agreed CPU limitation is revised. I've thought about this tricky topic a lot and summarized some of those thoughts in #2228 and in this FIXME note. |
So far this is working great. Many thanks @consideRatio. I think the startup times as I (and others experience them) are absolutely sufficient at the moment.
Is there anything that is required from my side to move this change forward? |
No I think none! @jbusecke do you feel strongly about the CPU limits? I want to make sure you have a good user experience, and think limiting CPU can be a significant drawback even for users that requests a lot of CPU to not be limited - this is because they will end up less likeley to fit on a node and then may end up needing to wait for server startup and image pulling etc. Overall, it seems like a loose / loose / loose in terms of UX / Cost / energy efficiency to me at the moment. |
I think I would like to give people some time to give feedback on their experience before changing things more, but I am certainly open to iterate on this further. I think that from my perspective the dask gateway refactor is of a higher priority. I am not sure if we should track that in a different issue to keep things neat and enable an ongoing discussion here about the CPU limits? |
Okay! Let us know if you want to remove the CPU limit or for example have it to be 4X the requested CPU or similar.
I'm closing this issue now, and dask-gateway work is now represented by #2364 and #2051. I'm not able to give that priority of my own time at the moment =/ |
I've together with @jbusecke distilled a support ticket to [improve user startup times] and this this request for information among other things to this technical work items.
Node pool changes
n2-highmem-16
.n2-highmem-16
machine (16 CPU, 128 GB mem)New profile list
leap-pangeo-base-access
github group, and one forleap-pangeo-full-access
. They differ in the sense that one provides the node share options (1,2,4 CPU) and the other high (1,2,4,8,16 CPU)Followup of actions
The text was updated successfully, but these errors were encountered: