Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Transition to using only a single node pool for dask-gateway workers (16 CPU, highmem) #2687

Closed
1 of 2 tasks
consideRatio opened this issue Jun 21, 2023 · 1 comment · Fixed by #3024
Closed
1 of 2 tasks

Comments

@consideRatio
Copy link
Contributor

consideRatio commented Jun 21, 2023

Currently, users of dask-gateway don't specify the machine type, they just create worker pods that get scheduled on some node that is being started up.

When dask-gateway is used, many many nodes can get started. Seeing for example 200 nodes isn't unusual in pangeo-hubs or leap.

I'd like us to transition from having varying amounts of node pools for dask workers in different clusters (n1-standard-4, n1-standard-8, n1-standard-16 for example), to having a single node pool of n2-highmem-16 on GCP and r5.4xlarge on AWS.

Like this:

  • we can reduce the amount of nodes needed for X workers by providing on average larger nodes
    • this can reduce prometheus-server memory use blowing up because it has to scrape fewer nodes separate node-exporter pods for metrics
  • we can make assumptions on the kind of cpu and memory requests that can make sense to make for users, and we help ensure that pods will fit well (its otherwise easy to request 51% of a nodes capacity etc, making only one pod fit).
    • Without helping users like this, they end up making requests like request/limit 4 CPU or similar, which would fit 3.9 pods on a 16 CPU node, which means in practice only 3. I have seen various DaskCluster's created where users end up doing these things which makes them use the resources ineffectively.

Relevant

Action points

  • Transition clusters infra
  • Document this policy for users
@consideRatio
Copy link
Contributor Author

consideRatio commented Aug 7, 2023

Clusters not having a single 16 CPU highmem node pool for dask workers

GCP

The machine type proposed should be n2-highmem-16

  • pilot-hubs has a single n1-highmem-4 (now n2-highmem-16)
  • pangeo-hubs has three different n1-standard-[4|8|16]
  • meom-ige has five different n1-standard-[2|8|16|32|64] (now n2-highmem-16)
  • m2lines has four different n1-standard-[2|4|8|16] (now n2-highmem-16)
  • linked-earth has one different e2-highmem-16 (now n2-highmem-16)
  • leap has n2-highmem-16
  • cloudbank has n1-highmem-4 (now n2-highmem-16)
  • awi-ciroh has n2-highmem-16
  • daskhub-template has n2-highmem-16

AWS

The machine type proposed should be r5.4xlarge

  • 2i2c-aws-us has r5.4xlarge
  • carbonplan has four different r5 sizes (now r5.4xlarge)
  • gridsst has r5.4xlarge
  • jupyter-meets-the-earth has three different m5 (r5 is highmem, m5 is not) (now r5.4xlarge)
  • nasa-cryo has r5.4xlarge
  • nasa-ghg has r5.4xlarge
  • nasa-veda has r5.4xlarge
  • openscapes has three different r5 (now r5.4xlarge)
  • smithsonian has r5.4xlarge
  • template has r5.4xlarge
  • victor has r5.4xlarge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
No open projects
Archived in project
Development

Successfully merging a pull request may close this issue.

1 participant