Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Iterate on what instance sizes we ensure is setup for all clusters #3307

Closed
consideRatio opened this issue Oct 24, 2023 · 3 comments
Closed
Labels
allocation:internal-eng tech:cloud-infra Optimization of cloud infra to reduce costs etc.

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Oct 24, 2023

This is a followup to #3304 about node pools and associated instance sizes we ensure is setup for all our clusters to enable us flexibly make changes without needing additional infra work.

In #3304 (comment) @yuvipanda asked for another set of instance sizes than initially suggested in the PR. This issue represents looking back and refining the decision we ended up making in the future. What set of instance sizes do we think makes sense to ensure all clusters have available?

In a private message @yuvipanda also wanted something smaller for the dask-gateway workers's instance size(s). Currently we have a single node pool setup for all clusters with daskhub's having either n2-highmem-16 or r5.4xlarge. The history about this is that we had many separate node pools before #2687, but the dask-gateway user couldn't choose between them so the cluster-autoscaler decided what to scale up to handle a pending pod. This has historically led to a mix of many instances with different capacities often used inefficiently, for example 8 CPU workers running alone on 16 CPU nodes because. The idea behind using a single node pool was to avoid this and other issues while also making it easier to implement an initial version of #3344.

Context

@yuvipanda
Copy link
Member

As another data point for why I think we should be using much smaller instances, I heard this from @jmunroe today:

I found that during the course itself, the cloud costs are running is $1-$2 / user / day for two-week course itself but rise to more like $5-$10/user/day when the participants are doing project-related work. The tricky bit is during the course itself the number of unique users is a factor of ten larger than those who appear to be using the service for project work after the course.

I think in general our user node sizes are far too big, and we should definitely make them much smaller. The utilization information I provided in #3304 (comment) is based upon reducing the total cost for things like this here, as 5-10$/user/day is an order of magnitude worse than what it should be.

@jmunroe
Copy link
Contributor

jmunroe commented Dec 18, 2023

I have not looked into the details of what each unique user is doing. Assuming that each user is always using the the same share of node, I would though the increased cost is to due to usage being distributed throughout a 24 hour period. 24 user using a hub for the same one hour will have a much cheaper cost than each of those 24 using the same machine 1 hour each spread out over a day. Or perhaps unique users are using the service for much longer periods when they were doing project work.

I think that data about usage pattern (and actually amount of RAM used) is in our dataset -- we need to examine it in more detail.

@consideRatio
Copy link
Contributor Author

We went with 4 / 16 / 64 CPU highmem machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
allocation:internal-eng tech:cloud-infra Optimization of cloud infra to reduce costs etc.
Projects
No open projects
Status: Needs Shaping / Refinement
Development

No branches or pull requests

3 participants