Debug node autoprovisioning: did not match Pod's node affinity #677

sgibson91 · 2021-09-13T12:44:21Z

Description

In #670 we enabled node auto-provisioning. In practice what we are seeing when trying to create pods is the following message:

Events:
  Type     Reason             Age   From                 Message
  ----     ------             ----  ----                 -------
  Warning  FailedScheduling   38s   prod-user-scheduler  0/2 nodes are available: 2 node(s) didn't match node selector.
  Warning  FailedScheduling   38s   prod-user-scheduler  0/2 nodes are available: 2 node(s) didn't match node selector.
  Normal   NotTriggerScaleUp  38s   cluster-autoscaler   pod didn't trigger scale-up: 1 node(s) didn't match Pod's node affinity

And this is preventing any new node from coming up.

Value / benefit

We need to spin nodes up!

Implementation details

No response

Tasks to complete

No response

Updates

2021-09-13 - We've decided to manually provision the Pangeo cluster for now, so we have a bit more time to debug this one. Bumped down to medium impact now.

The text was updated successfully, but these errors were encountered:

sgibson91 · 2021-09-13T13:00:42Z

From what I'm understanding of these docs, the auto-provisioner should be creating nodes with the same tolerations/node selectors as the pod that is trying to spin up https://cloud.google.com/kubernetes-engine/docs/how-to/node-auto-provisioning#workload_separation

sgibson91 · 2021-09-13T13:21:46Z

I did the following:

kubectl get pod jupyter-sgibson91 -o yaml > mypod.yaml

Edited the yaml and removed the nodeSelector

kubectl apply -f mypod.yaml

And that seemed to spin up fine

sgibson91 · 2021-09-13T14:09:42Z

So, nodes can't be spun up because the node auto-provisioner is expecting to create nodes with the label node-purpose: core (which it gets from the core pool) and the pods want to be scheduled to node-purpose: user.

In the pangeo hub config file, we've tried setting the following:

singleuser:
  nodeSelector:
    hub.jupyter.org/node-purpose: ""

singleuser:
  nodeSelector: {}

singleuser:
  nodeSelector: null

But none of those were successful at removing the node selector from the user pod.

Instead, I removed the following lines from our basehub chart and that got us to a place where user pods could be scheduled and would start up, but they'd always be assigned to the core pool. It turns out the even the core pool had enough free space for even our largest machines, so I've still not successfully triggered a node auto-provisioning event yet.

https://github.com/2i2c-org/pilot-hubs/blob/658ab0bf507ab35eedd95ac45147e0b0e1babf6e/hub-templates/basehub/values.yaml#L135-L136

tylerpotts · 2021-09-13T16:28:01Z

@sgibson91 For what it's worth, in QHub we haven't tried node auto-provisioning. Instead we have node pools that are explicitly defined and pods get scheduled on them. Wish we could be of more help here

sgibson91 · 2021-09-13T16:30:22Z

Thanks @tylerpotts - that is 2i2c's default too. But when I raised the question about appropriate machine sizes for those pool in #666 we found out Pangeo are using auto-provisioning and we didn't really have any data to hand for optimising the machine sizes to expected load.

tylerpotts · 2021-09-13T16:39:51Z

@sgibson91 We have been recently struggling with the problem of optimizing workload to node size as well. For the most part we have been allocating a single node per user pod/dask pod which has helped somewhat when it comes to the larger scale clusters.

As far as determining the allocatable resources available on the nodes, we have quite a bit of research detailed here that you may find useful: nebari-dev/nebari#792. Unfortunately there doesn't seem to be a linear formula, as kubernetes reserves variable amounts of millicpu and RAM depending on the size of the node

choldgraf · 2021-09-13T17:46:54Z

I've added an update to the top comment, to reflect that we're manually provisioning the Pangeo hub for now!

sgibson91 added type: bug labels Sep 13, 2021

sgibson91 self-assigned this Sep 13, 2021

sgibson91 mentioned this issue Sep 13, 2021

Prepare Pangeo Hub for Ryan's course #652

Closed

8 tasks

This was referenced Sep 13, 2021

Revert auto-provisioning on Pangeo cluster #678

Merged

Collect data and assess node auto-provisioning on Pangeo cluster #671

Closed

choldgraf added impact: med and removed impact: high labels Sep 13, 2021

sgibson91 mentioned this issue Sep 13, 2021

Team Sync - Monday, September 13th 2i2c-org/team-compass#243

Closed

choldgraf removed the impact: med label Oct 28, 2021

sgibson91 removed their assignment Nov 25, 2021

choldgraf removed 🏷️ active deployment labels Sep 16, 2022

choldgraf removed the Bug label Sep 16, 2022

sgibson91 closed this as completed Oct 18, 2023

damianavila assigned sgibson91 Oct 27, 2023

damianavila added this to DEPRECATED Engineering and Product Backlog Oct 27, 2023

github-project-automation bot moved this to Needs Shaping / Refinement in DEPRECATED Engineering and Product Backlog Oct 27, 2023

damianavila added this to Sprint Board Oct 27, 2023

damianavila moved this to Done 🎉 in Sprint Board Oct 27, 2023

damianavila moved this from Needs Shaping / Refinement to Complete in DEPRECATED Engineering and Product Backlog Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Debug node autoprovisioning: did not match Pod's node affinity #677

Debug node autoprovisioning: did not match Pod's node affinity #677

sgibson91 commented Sep 13, 2021 •

edited by choldgraf

Loading

sgibson91 commented Sep 13, 2021 •

edited

Loading

sgibson91 commented Sep 13, 2021

sgibson91 commented Sep 13, 2021 •

edited

Loading

tylerpotts commented Sep 13, 2021

sgibson91 commented Sep 13, 2021

tylerpotts commented Sep 13, 2021

choldgraf commented Sep 13, 2021

Debug node autoprovisioning: did not match Pod's node affinity #677

Debug node autoprovisioning: did not match Pod's node affinity #677

Comments

sgibson91 commented Sep 13, 2021 • edited by choldgraf Loading

Description

Value / benefit

Implementation details

Tasks to complete

Updates

sgibson91 commented Sep 13, 2021 • edited Loading

sgibson91 commented Sep 13, 2021

sgibson91 commented Sep 13, 2021 • edited Loading

tylerpotts commented Sep 13, 2021

sgibson91 commented Sep 13, 2021

tylerpotts commented Sep 13, 2021

choldgraf commented Sep 13, 2021

sgibson91 commented Sep 13, 2021 •

edited by choldgraf

Loading

sgibson91 commented Sep 13, 2021 •

edited

Loading

sgibson91 commented Sep 13, 2021 •

edited

Loading