Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem) #2687

consideRatio · 2023-06-21T17:17:36Z

Currently, users of dask-gateway don't specify the machine type, they just create worker pods that get scheduled on some node that is being started up.

When dask-gateway is used, many many nodes can get started. Seeing for example 200 nodes isn't unusual in pangeo-hubs or leap.

I'd like us to transition from having varying amounts of node pools for dask workers in different clusters (n1-standard-4, n1-standard-8, n1-standard-16 for example), to having a single node pool of n2-highmem-16 on GCP and r5.4xlarge on AWS.

Like this:

we can reduce the amount of nodes needed for X workers by providing on average larger nodes
- this can reduce prometheus-server memory use blowing up because it has to scrape fewer nodes separate node-exporter pods for metrics
we can make assumptions on the kind of cpu and memory requests that can make sense to make for users, and we help ensure that pods will fit well (its otherwise easy to request 51% of a nodes capacity etc, making only one pod fit).
- Without helping users like this, they end up making requests like request/limit 4 CPU or similar, which would fit 3.9 pods on a 16 CPU node, which means in practice only 3. I have seen various DaskCluster's created where users end up doing these things which makes them use the resources ineffectively.

Relevant

This would enable dask-gateway config from daskhub / leap: optimize dask-gateway options for a 16 CPU / 128 GB mem worker #2364 to be adopted by all daskhub clusters.

Action points

Transition clusters infra
Document this policy for users

consideRatio · 2023-08-07T09:24:09Z

github-project-automation bot added this to DEPRECATED Engineering and Product Backlog Jun 21, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Jun 21, 2023

consideRatio added tech:dask-gateway tech:prometheus labels Jun 21, 2023

consideRatio mentioned this issue Aug 4, 2023

Phasing out use of n1 machines in favor of n2 machines on GKE #2923

Closed

19 tasks

consideRatio mentioned this issue Aug 11, 2023

gcp/aws, dask worker nodes: towards single r5.4xlarge/n2-highmem-16 dask worker node pool #2974

Merged

consideRatio mentioned this issue Aug 24, 2023

gcp, dask-worker-nodes: pangeo-hubs to use single dask worker node type #3024

Merged

consideRatio closed this as completed in #3024 Aug 25, 2023

github-project-automation bot moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Aug 25, 2023

damianavila added this to Sprint Board Aug 25, 2023

damianavila moved this to Done 🎉 in Sprint Board Aug 25, 2023

damianavila assigned consideRatio Aug 25, 2023

This was referenced Oct 31, 2023

daskhub: provide worker resource options for 16CPU/128GB nodes on GKE/EKS #3344

Merged

Iterate on what instance sizes we ensure is setup for all clusters #3307

Closed

jnywong mentioned this issue May 20, 2024

Create how-to on using dask-gateway for communities 2i2c-org/docs#224

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem) #2687

Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem) #2687

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Aug 7, 2023 •

edited

Loading

Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem) #2687

Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem) #2687

Comments

consideRatio commented Jun 21, 2023 • edited Loading

Relevant

Action points

consideRatio commented Aug 7, 2023 • edited Loading

GCP

AWS

consideRatio commented Jun 21, 2023 •

edited

Loading

consideRatio commented Aug 7, 2023 •

edited

Loading