Iterate on what instance sizes we ensure is setup for all clusters #3307

consideRatio · 2023-10-24T06:27:14Z

This is a followup to #3304 about node pools and associated instance sizes we ensure is setup for all our clusters to enable us flexibly make changes without needing additional infra work.

In #3304 (comment) @yuvipanda asked for another set of instance sizes than initially suggested in the PR. This issue represents looking back and refining the decision we ended up making in the future. What set of instance sizes do we think makes sense to ensure all clusters have available?

In a private message @yuvipanda also wanted something smaller for the dask-gateway workers's instance size(s). Currently we have a single node pool setup for all clusters with daskhub's having either n2-highmem-16 or r5.4xlarge. The history about this is that we had many separate node pools before #2687, but the dask-gateway user couldn't choose between them so the cluster-autoscaler decided what to scale up to handle a pending pod. This has historically led to a mix of many instances with different capacities often used inefficiently, for example 8 CPU workers running alone on 16 CPU nodes because. The idea behind using a single node pool was to avoid this and other issues while also making it easier to implement an initial version of #3344.

Context

The text was updated successfully, but these errors were encountered:

yuvipanda · 2023-12-18T20:13:13Z

As another data point for why I think we should be using much smaller instances, I heard this from @jmunroe today:

I found that during the course itself, the cloud costs are running is $1-$2 / user / day for two-week course itself but rise to more like $5-$10/user/day when the participants are doing project-related work. The tricky bit is during the course itself the number of unique users is a factor of ten larger than those who appear to be using the service for project work after the course.

I think in general our user node sizes are far too big, and we should definitely make them much smaller. The utilization information I provided in #3304 (comment) is based upon reducing the total cost for things like this here, as 5-10$/user/day is an order of magnitude worse than what it should be.

jmunroe · 2023-12-18T20:33:13Z

I have not looked into the details of what each unique user is doing. Assuming that each user is always using the the same share of node, I would though the increased cost is to due to usage being distributed throughout a 24 hour period. 24 user using a hub for the same one hour will have a much cheaper cost than each of those 24 using the same machine 1 hour each spread out over a day. Or perhaps unique users are using the service for much longer periods when they were doing project work.

I think that data about usage pattern (and actually amount of RAM used) is in our dataset -- we need to examine it in more detail.

consideRatio · 2024-11-20T14:52:40Z

We went with 4 / 16 / 64 CPU highmem machines.

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Oct 24, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Oct 24, 2023

consideRatio added the tech:cloud-infra Optimization of cloud infra to reduce costs etc. label Oct 24, 2023

yuvipanda mentioned this issue Oct 24, 2023

[Agreement needed]: add docs about the notebook node pool default choices #3304

Merged

8 tasks

consideRatio mentioned this issue Nov 2, 2023

daskhub: provide worker resource options for 16CPU/128GB nodes on GKE/EKS #3344

Merged

yuvipanda added the allocation:internal-eng label Mar 20, 2024

consideRatio closed this as completed Nov 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterate on what instance sizes we ensure is setup for all clusters #3307

Iterate on what instance sizes we ensure is setup for all clusters #3307

consideRatio commented Oct 24, 2023 •

edited

Loading

yuvipanda commented Dec 18, 2023

jmunroe commented Dec 18, 2023

consideRatio commented Nov 20, 2024

Iterate on what instance sizes we ensure is setup for all clusters #3307

Iterate on what instance sizes we ensure is setup for all clusters #3307

Comments

consideRatio commented Oct 24, 2023 • edited Loading

Context

yuvipanda commented Dec 18, 2023

jmunroe commented Dec 18, 2023

consideRatio commented Nov 20, 2024

consideRatio commented Oct 24, 2023 •

edited

Loading