-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Iterate on what instance sizes we ensure is setup for all clusters #3307
Comments
As another data point for why I think we should be using much smaller instances, I heard this from @jmunroe today:
I think in general our user node sizes are far too big, and we should definitely make them much smaller. The utilization information I provided in #3304 (comment) is based upon reducing the total cost for things like this here, as 5-10$/user/day is an order of magnitude worse than what it should be. |
I have not looked into the details of what each unique user is doing. Assuming that each user is always using the the same share of node, I would though the increased cost is to due to usage being distributed throughout a 24 hour period. 24 user using a hub for the same one hour will have a much cheaper cost than each of those 24 using the same machine 1 hour each spread out over a day. Or perhaps unique users are using the service for much longer periods when they were doing project work. I think that data about usage pattern (and actually amount of RAM used) is in our dataset -- we need to examine it in more detail. |
We went with 4 / 16 / 64 CPU highmem machines. |
This is a followup to #3304 about node pools and associated instance sizes we ensure is setup for all our clusters to enable us flexibly make changes without needing additional infra work.
In #3304 (comment) @yuvipanda asked for another set of instance sizes than initially suggested in the PR. This issue represents looking back and refining the decision we ended up making in the future. What set of instance sizes do we think makes sense to ensure all clusters have available?
In a private message @yuvipanda also wanted something smaller for the dask-gateway workers's instance size(s). Currently we have a single node pool setup for all clusters with daskhub's having either
n2-highmem-16
orr5.4xlarge
. The history about this is that we had many separate node pools before #2687, but the dask-gateway user couldn't choose between them so the cluster-autoscaler decided what to scale up to handle a pending pod. This has historically led to a mix of many instances with different capacities often used inefficiently, for example 8 CPU workers running alone on 16 CPU nodes because. The idea behind using a single node pool was to avoid this and other issues while also making it easier to implement an initial version of #3344.Context
The text was updated successfully, but these errors were encountered: