You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently, users of dask-gateway don't specify the machine type, they just create worker pods that get scheduled on some node that is being started up.
When dask-gateway is used, many many nodes can get started. Seeing for example 200 nodes isn't unusual in pangeo-hubs or leap.
I'd like us to transition from having varying amounts of node pools for dask workers in different clusters (n1-standard-4, n1-standard-8, n1-standard-16 for example), to having a single node pool of n2-highmem-16 on GCP and r5.4xlarge on AWS.
Like this:
we can reduce the amount of nodes needed for X workers by providing on average larger nodes
this can reduce prometheus-server memory use blowing up because it has to scrape fewer nodes separate node-exporter pods for metrics
we can make assumptions on the kind of cpu and memory requests that can make sense to make for users, and we help ensure that pods will fit well (its otherwise easy to request 51% of a nodes capacity etc, making only one pod fit).
Without helping users like this, they end up making requests like request/limit 4 CPU or similar, which would fit 3.9 pods on a 16 CPU node, which means in practice only 3. I have seen various DaskCluster's created where users end up doing these things which makes them use the resources ineffectively.
Currently, users of dask-gateway don't specify the machine type, they just create worker pods that get scheduled on some node that is being started up.
When dask-gateway is used, many many nodes can get started. Seeing for example 200 nodes isn't unusual in pangeo-hubs or leap.
I'd like us to transition from having varying amounts of node pools for dask workers in different clusters (n1-standard-4, n1-standard-8, n1-standard-16 for example), to having a single node pool of n2-highmem-16 on GCP and r5.4xlarge on AWS.
Like this:
Relevant
Action points
The text was updated successfully, but these errors were encountered: