Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278

Closed
VitaliiBabii opened this issue Oct 28, 2022 · 23 comments
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@VitaliiBabii
Copy link

VitaliiBabii commented Oct 28, 2022

Which component are you using?:
cluster-autoscaler

What version of the component are you using?:
1.22.3

Component version:

What k8s version are you using (kubectl version)?:

kubectl version Output
$ kubectl version
```
serverVersion:
  buildDate: "2022-06-16T05:33:55Z"
  compiler: gc
  gitCommit: 5824e3251d294d324320db85bf63a53eb0767af2
  gitTreeState: clean
  gitVersion: v1.22.11
  goVersion: go1.16.15
  major: "1"
  minor: "22"
  platform: linux/amd64
```

What environment is this in?:
We are using AWS EC2 instances to run kubernetes

For GPU we use instances g4dn.2xlarge and g4dn.4xlarge

  Kernel Version:             4.18.0-372.26.1.el8_6.x86_64
  OS Image:                   Rocky Linux 8.6 (Green Obsidian)
  Operating System:           linux
  Architecture:               amd64
  Container Runtime Version:  cri-o://1.22.5
  Kubelet Version:            v1.22.11
  Kube-Proxy Version:         v1.22.11

What did you expect to happen?:
Cluster Autoscaler scales from 0 GPU Autoscaling Group when new pods request GPU.

What happened instead?:
Cluster Autoscaler do not trigger scale up from 0 nodes.

When there are 0 nodes in ASG and new pods created and requests GPU Cluster Autoscaler(CA) do not trigger scale up.
In CA logs I can see message like
I1027 11:02:58.428887 1 scale_up.go:300] Pod cuda-vector-add can't be scheduled on 2.gpu-small.nodes.aws-us-east, predicate checking error: Insufficient nvidia.com/gpu; predicateName=NodeResourcesFit; reasons: Insufficient nvidia.com/gpu; debugInfo=
After CA restart it starts to work as expected. But after some time CA again stops to scale from 0.
Looks like CA when ASG scaled to 0 at some point stops to see GPU resources.

How to reproduce it (as minimally and precisely as possible):
When ASG scaled to 0 just try to run gpu pod like:

apiVersion: v1
kind: Pod
metadata:
  name: cuda-vector-add
  annotations:
    workload: nvidia-gpu
spec:
  restartPolicy: OnFailure
  containers:
    - name: cuda-vector-add
      image: "docker.io/mirrorgooglecontainers/cuda-vector-add:v0.1"
      env:
        - name: NVIDIA_VISIBLE_DEVICES
          value: all
        - name: NVIDIA_DRIVER_CAPABILITIES
          value: "compute,utility"
        - name: NVIDIA_REQUIRE_CUDA
          value: "cuda>=10.0"
      resources:
        limits:
          cpu: 8
          memory: 32Gi
          nvidia.com/gpu: 1
        requests:
          cpu: 4
          memory: 24Gi
          nvidia.com/gpu: 1
  nodeSelector:
    workload: walle-gpu-small
  tolerations:
  - effect: NoSchedule
    key: workload
    operator: Equal
    value: walle-gpu-small
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  - effect: NoSchedule
    key: node.kubernetes.io/memory-pressure
    operator: Exists

ASG Labels like

k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: true
k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: 1

doesn't work.

@VitaliiBabii VitaliiBabii added the kind/bug Categorizes issue or PR as related to a bug. label Oct 28, 2022
@vukor
Copy link

vukor commented Oct 31, 2022

The same issue was fixed by recreating cluster-autoscaler pod. CA image:

k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

@VitaliiBabii
Copy link
Author

The same issue was fixed by recreating cluster-autoscaler pod. CA image:

k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0

For us recreating pods fixes issue only for several hours . It looks like some CA cache bug.

@fradee
Copy link

fradee commented Dec 9, 2022

Any updates?

@cha7ri
Copy link

cha7ri commented Dec 15, 2022

We are facing a similar issue (using cluster-autoscaler:v1.21.3 and k8s version v1.21.14)

  • When ASG size is 0 the CA will not scale it until we restart CA
  • The same issue happens each time the ASG size is 0

We tried the label below, but it doesn't help

k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: 1

@mucahitkantepe
Copy link

We are hitting the same issue, any plans?

@qianlei90
Copy link
Contributor

maybe duplicate: #3802

@DavidFluck
Copy link

DavidFluck commented Jan 27, 2023

I usually try not to pile on, but for what it's worth, I'm seeing this behavior as well with image version k8s.gcr.io/autoscaling/cluster-autoscaler:v1.24.0, also on a g4dn.xlarge running Amazon Linux. (I can get more specific version information soon, if it's helpful.) As with others, restarting the pod fixes the issue for a while, but after a scale down event, CA then can't scale back up. Let me know if there's anything I can do.

@zemliany
Copy link

any updates?

@paleozogt
Copy link

@Zemlyanoy looks like a fix got put into master (GH-5214). I'm not sure when its going to be properly released tho.

@jcstanaway
Copy link

#5214 requires users to add tags to ASGs. Why would the CA rely on ASG tags for GPU resource information when the CA either queries AWS to describe instance types or uses static instance type descriptions and so already has GPU resource information.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jun 5, 2023
@alantang888
Copy link

Yeah, when ASG have more 1 node or more. Why CA is able to scale up that ASG. But that ASG have 0 node, CA can't??
Those resource info (CPU, Memory, GPU) already recorded on code (Like G4DN).

This should be kind of bug.

Add resource on node tag is not make any sense on this case. Unless that is a new instance CA don't have that info. Or some additional resource not included on CA's code.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jul 9, 2023
@giany
Copy link

giany commented Aug 22, 2023

Having similar issue but with smart-devices. On my ends error is: reasons: Insufficient smarter-devices/fuse. Full message:

I0822 06:47:05.800276       1 orchestrator.go:466] Pod fcd-tv0b9cba2b-c426-4c28-9975-17d68365c8c5-tzb58 can't be scheduled on eks-Xtra_Large_chipfactory20221020161207768400000001-d4c1fb15-48e2-8613-b42a-92996cf05959, predicate checking error: Insufficient smarter-devices/fuse; predicateName=NodeResourcesFit; reasons: Insufficient smarter-devices/fuse; debugInfo=
I0822 06:47:05.800297       1 orchestrator.go:466] Pod fcd-tv18315f9e-0b57-4521-90b8-862980c705bb-2sbnf can't be scheduled on eks-Xtra_Large_chipfactory20221020161207768400000001-d4c1fb15-48e2-8613-b42a-92996cf05959, predicate checking error: Insufficient smarter-devices/fuse; predicateName=NodeResourcesFit; reasons: Insufficient smarter-devices/fuse; debugInfo=

From what I see all looks fine as long as only one pod is started (CA scales up from 0 correctly). But if there are two pods that require that extra node then CA can't scale up from 0.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Jan 20, 2024
@elatt
Copy link

elatt commented Feb 22, 2024

Was there ever a resolution to this? I seen comments about a fix to ASG resource tags mentioned but also a call-out that the GPU count information is already available to CA because it is hard-coded. We are using CA 1.22.2. Will upgrading CA and using resource tags resolve this issue?

I ask because I also am confused if this is just a resource tagging issue because in our setup, with no resource tags, a freshly restarted instance of CA behaves fine and it only seems to be after running for some amount of time that it can no longer scale up from zero with Insufficient nvidia.com/gpu as the reason.

It seems like some other bug is occurring which is causing CA to forget that a given node type has a set number of GPUs and you need to restart the daemon to have it pull from code. Could this be related to the comment in the docs

Special note on GPU instances

The device plugin on nodes that provides GPU resources can take some time to advertise the GPU resource to the cluster. This may cause Cluster Autoscaler to unnecessarily scale out multiple times.

Could this also cause CA to incorrectly cache the fact that a node-group has 0 GPUs?

@JavierAntonYuste
Copy link

We are running CA 1.28.2 and we are experiencing the same behaviour as @elatt. CA seems to be working fine but after some time it is unable to scale up GPU nodes, which is fixed with a restart of it. In our case we are running CA on EKS, and both CloudFormation stack and ASG have the tags k8s.io/cluster-autoscaler/node-template/label/nvidia.com/gpu: 'true' and k8s.io/cluster-autoscaler/node-template/resources/nvidia.com/gpu: "1" for helping CA.

Checking the CA log, we obtain the following:

I0307 11:01:42.041800       1 orchestrator.go:546] Pod XXX can't be scheduled on eksctl-nodegroup-spot-gpu-4-16-NodeGroup, predicate checking error: Insufficient nvidia.com/gpu; predicateName=NodeResourcesFit; reasons: Insufficient nvidia.com/gpu; debugInfo=

I am wondering if this could be related with the scale down from/to 0, since I am not understanding why scaling up works when restarting the CA but not after some time. Reopening this issue since I see that others are running into the same problem.

/reopen /remove-lifecycle rotten

@Guillermogsjc
Copy link

Guillermogsjc commented Mar 7, 2024

/reopen

please this is currently happening on EKS deployments with ASG with proper tags to help resource estimation for CA (1.28.2 and k8s 1.28).

CA initially after startup assigns existing GPU attributes properly from ASG tags or inference, and scales properly the pod requiring GPU, but after a while it somehow drops the GPU resource attributes of the ASG, and does not scale it anymore.

Our temporal resolution of this is to put a rollout on CA in a cron with sufficient frequency, but it is a mad solution, CA should not lose GPU resource attributes that worked at startup.

@JavierAntonYuste
Copy link

/reopen

@k8s-ci-robot
Copy link
Contributor

@JavierAntonYuste: You can't reopen an issue/PR unless you authored it or you are a collaborator.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@theophanevie
Copy link

Same issue here :/

@yosefmih
Copy link

Still experiencing this bug on CA v1.26.2 :(

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests