Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278

VitaliiBabii · 2022-10-28T07:50:24Z

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
1.22.3

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output

$ kubectl version ``` serverVersion: buildDate: "2022-06-16T05:33:55Z" compiler: gc gitCommit: 5824e3251d294d324320db85bf63a53eb0767af2 gitTreeState: clean gitVersion: v1.22.11 goVersion: go1.16.15 major: "1" minor: "22" platform: linux/amd64 ```

What environment is this in?:
We are using AWS EC2 instances to run kubernetes

For GPU we use instances g4dn.2xlarge and g4dn.4xlarge

  Kernel Version:             4.18.0-372.26.1.el8_6.x86_64
  OS Image:                   Rocky Linux 8.6 (Green Obsidian)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.22.5
  Kubelet Version:            v1.22.11
  Kube-Proxy Version:         v1.22.11

What did you expect to happen?:
Cluster Autoscaler scales from 0 GPU Autoscaling Group when new pods request GPU.

What happened instead?:
Cluster Autoscaler do not trigger scale up from 0 nodes.

When there are 0 nodes in ASG and new pods created and requests GPU Cluster Autoscaler(CA) do not trigger scale up.
In CA logs I can see message like
I1027 11:02:58.428887 1 scale_up.go:300] Pod cuda-vector-add can't be scheduled on 2.gpu-small.nodes.aws-us-east, predicate checking error: Insufficient nvidia.com/gpu; predicateName=NodeResourcesFit; reasons: Insufficient nvidia.com/gpu; debugInfo=
After CA restart it starts to work as expected. But after some time CA again stops to scale from 0.
Looks like CA when ASG scaled to 0 at some point stops to see GPU resources.

How to reproduce it (as minimally and precisely as possible):
When ASG scaled to 0 just try to run gpu pod like:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
  annotations:
    workload: nvidia-gpu
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1"
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        - name: NVIDIA_REQUIRE_CUDA
          value: "cuda>=10.0"
      resources:
        limits:
          cpu: 8
          memory: 32Gi
          nvidia.com/gpu: 1
        requests:
          cpu: 4
          memory: 24Gi
          nvidia.com/gpu: 1
  nodeSelector:
    workload: walle-gpu-small
  tolerations:
  - effect: NoSchedule
    key: workload
    operator: Equal
    value: walle-gpu-small
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists

ASG Labels like

k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: true
k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: 1

doesn't work.

The text was updated successfully, but these errors were encountered:

vukor · 2022-10-31T17:02:03Z

The same issue was fixed by recreating cluster-autoscaler pod. CA image:

k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

VitaliiBabii · 2022-11-28T09:01:58Z

The same issue was fixed by recreating cluster-autoscaler pod. CA image:
k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

For us recreating pods fixes issue only for several hours . It looks like some CA cache bug.

fradee · 2022-12-09T09:18:31Z

Any updates?

cha7ri · 2022-12-15T14:44:57Z

We are facing a similar issue (using cluster-autoscaler:v1.21.3 and k8s version v1.21.14)

When ASG size is 0 the CA will not scale it until we restart CA
The same issue happens each time the ASG size is 0

We tried the label below, but it doesn't help

k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: 1

mucahitkantepe · 2022-12-20T10:13:12Z

We are hitting the same issue, any plans?

qianlei90 · 2022-12-21T04:11:27Z

maybe duplicate: #3802

DavidFluck · 2023-01-27T22:24:18Z

I usually try not to pile on, but for what it's worth, I'm seeing this behavior as well with image version k8s.gcr.io/autoscaling/cluster-autoscaler:v1.24.0, also on a g4dn.xlarge running Amazon Linux. (I can get more specific version information soon, if it's helpful.) As with others, restarting the pod fixes the issue for a while, but after a scale down event, CA then can't scale back up. Let me know if there's anything I can do.

zemliany · 2023-02-28T16:35:19Z

any updates?

paleozogt · 2023-02-28T17:04:25Z

@Zemlyanoy looks like a fix got put into master (GH-5214). I'm not sure when its going to be properly released tho.

jcstanaway · 2023-03-07T01:04:37Z

#5214 requires users to add tags to ASGs. Why would the CA rely on ASG tags for GPU resource information when the CA either queries AWS to describe instance types or uses static instance type descriptions and so already has GPU resource information.

k8s-triage-robot · 2023-06-05T01:55:40Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

alantang888 · 2023-06-09T09:09:21Z

Yeah, when ASG have more 1 node or more. Why CA is able to scale up that ASG. But that ASG have 0 node, CA can't??
Those resource info (CPU, Memory, GPU) already recorded on code (Like G4DN).

This should be kind of bug.

Add resource on node tag is not make any sense on this case. Unless that is a new instance CA don't have that info. Or some additional resource not included on CA's code.

k8s-triage-robot · 2023-07-09T10:03:47Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

giany · 2023-08-22T06:56:48Z

Having similar issue but with smart-devices. On my ends error is: reasons: Insufficient smarter-devices/fuse. Full message:

I0822 06:47:05.800276       1 orchestrator.go:466] Pod fcd-tv0b9cba2b-c426-4c28-9975-17d68365c8c5-tzb58 can't be scheduled on eks-Xtra_Large_chipfactory20221020161207768400000001-d4c1fb15-48e2-8613-b42a-92996cf05959, predicate checking error: Insufficient smarter-devices/fuse; predicateName=NodeResourcesFit; reasons: Insufficient smarter-devices/fuse; debugInfo=
I0822 06:47:05.800297       1 orchestrator.go:466] Pod fcd-tv18315f9e-0b57-4521-90b8-862980c705bb-2sbnf can't be scheduled on eks-Xtra_Large_chipfactory20221020161207768400000001-d4c1fb15-48e2-8613-b42a-92996cf05959, predicate checking error: Insufficient smarter-devices/fuse; predicateName=NodeResourcesFit; reasons: Insufficient smarter-devices/fuse; debugInfo=

From what I see all looks fine as long as only one pod is started (CA scales up from 0 correctly). But if there are two pods that require that extra node then CA can't scale up from 0.

k8s-triage-robot · 2024-01-20T07:12:48Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen
Mark this issue as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

k8s-ci-robot · 2024-01-20T07:12:54Z

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue with /reopen

Mark this issue as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

elatt · 2024-02-22T15:52:52Z

Was there ever a resolution to this? I seen comments about a fix to ASG resource tags mentioned but also a call-out that the GPU count information is already available to CA because it is hard-coded. We are using CA 1.22.2. Will upgrading CA and using resource tags resolve this issue?

I ask because I also am confused if this is just a resource tagging issue because in our setup, with no resource tags, a freshly restarted instance of CA behaves fine and it only seems to be after running for some amount of time that it can no longer scale up from zero with Insufficient nvidia.com/gpu as the reason.

It seems like some other bug is occurring which is causing CA to forget that a given node type has a set number of GPUs and you need to restart the daemon to have it pull from code. Could this be related to the comment in the docs

Special note on GPU instances

The device plugin on nodes that provides GPU resources can take some time to advertise the GPU resource to the cluster. This may cause Cluster Autoscaler to unnecessarily scale out multiple times.

Could this also cause CA to incorrectly cache the fact that a node-group has 0 GPUs?

JavierAntonYuste · 2024-03-07T13:39:42Z

We are running CA 1.28.2 and we are experiencing the same behaviour as @elatt. CA seems to be working fine but after some time it is unable to scale up GPU nodes, which is fixed with a restart of it. In our case we are running CA on EKS, and both CloudFormation stack and ASG have the tags k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true' and k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: "1" for helping CA.

Checking the CA log, we obtain the following:

I0307 11:01:42.041800       1 orchestrator.go:546] Pod XXX can't be scheduled on eksctl-nodegroup-spot-gpu-4-16-NodeGroup, predicate checking error: Insufficient nvidia.com/gpu; predicateName=NodeResourcesFit; reasons: Insufficient nvidia.com/gpu; debugInfo=

I am wondering if this could be related with the scale down from/to 0, since I am not understanding why scaling up works when restarting the CA but not after some time. Reopening this issue since I see that others are running into the same problem.

/reopen /remove-lifecycle rotten

Guillermogsjc · 2024-03-07T14:28:35Z

/reopen

please this is currently happening on EKS deployments with ASG with proper tags to help resource estimation for CA (1.28.2 and k8s 1.28).

CA initially after startup assigns existing GPU attributes properly from ASG tags or inference, and scales properly the pod requiring GPU, but after a while it somehow drops the GPU resource attributes of the ASG, and does not scale it anymore.

Our temporal resolution of this is to put a rollout on CA in a cron with sufficient frequency, but it is a mad solution, CA should not lose GPU resource attributes that worked at startup.

JavierAntonYuste · 2024-03-07T14:57:36Z

/reopen

k8s-ci-robot · 2024-03-07T14:57:40Z

@JavierAntonYuste: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

theophanevie · 2024-04-11T15:16:37Z

Same issue here :/

yosefmih · 2024-06-11T18:56:40Z

Still experiencing this bug on CA v1.26.2 :(

VitaliiBabii added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2022

jbartosik added the area/cluster-autoscaler label Nov 2, 2022

paleozogt mentioned this issue Jan 18, 2023

Fix/asg resource tags #5214

Merged

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2023

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2023

k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278

Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278

VitaliiBabii commented Oct 28, 2022 •

edited

Loading

vukor commented Oct 31, 2022

VitaliiBabii commented Nov 28, 2022

fradee commented Dec 9, 2022

cha7ri commented Dec 15, 2022

mucahitkantepe commented Dec 20, 2022

qianlei90 commented Dec 21, 2022

DavidFluck commented Jan 27, 2023 •

edited

Loading

zemliany commented Feb 28, 2023

paleozogt commented Feb 28, 2023

jcstanaway commented Mar 7, 2023

k8s-triage-robot commented Jun 5, 2023

alantang888 commented Jun 9, 2023

k8s-triage-robot commented Jul 9, 2023

giany commented Aug 22, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-ci-robot commented Jan 20, 2024

elatt commented Feb 22, 2024 •

edited

Loading

JavierAntonYuste commented Mar 7, 2024

Guillermogsjc commented Mar 7, 2024 •

edited

Loading

JavierAntonYuste commented Mar 7, 2024

k8s-ci-robot commented Mar 7, 2024

theophanevie commented Apr 11, 2024

yosefmih commented Jun 11, 2024

Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278

Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278

Comments

VitaliiBabii commented Oct 28, 2022 • edited Loading

vukor commented Oct 31, 2022

VitaliiBabii commented Nov 28, 2022

fradee commented Dec 9, 2022

cha7ri commented Dec 15, 2022

mucahitkantepe commented Dec 20, 2022

qianlei90 commented Dec 21, 2022

DavidFluck commented Jan 27, 2023 • edited Loading

zemliany commented Feb 28, 2023

paleozogt commented Feb 28, 2023

jcstanaway commented Mar 7, 2023

k8s-triage-robot commented Jun 5, 2023

alantang888 commented Jun 9, 2023

k8s-triage-robot commented Jul 9, 2023

giany commented Aug 22, 2023

k8s-triage-robot commented Jan 20, 2024

k8s-ci-robot commented Jan 20, 2024

elatt commented Feb 22, 2024 • edited Loading

JavierAntonYuste commented Mar 7, 2024

Guillermogsjc commented Mar 7, 2024 • edited Loading

JavierAntonYuste commented Mar 7, 2024

k8s-ci-robot commented Mar 7, 2024

theophanevie commented Apr 11, 2024

yosefmih commented Jun 11, 2024

VitaliiBabii commented Oct 28, 2022 •

edited

Loading

DavidFluck commented Jan 27, 2023 •

edited

Loading

elatt commented Feb 22, 2024 •

edited

Loading

Guillermogsjc commented Mar 7, 2024 •

edited

Loading