-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scaling from 0 doesn't work for GPU nodes. Reasons: Insufficient nvidia.com/gpu #5278
Comments
The same issue was fixed by recreating cluster-autoscaler pod. CA image:
|
For us recreating pods fixes issue only for several hours . It looks like some CA cache bug. |
Any updates? |
We are facing a similar issue (using
We tried the label below, but it doesn't help
|
We are hitting the same issue, any plans? |
maybe duplicate: #3802 |
I usually try not to pile on, but for what it's worth, I'm seeing this behavior as well with image version |
any updates? |
@Zemlyanoy looks like a fix got put into master (GH-5214). I'm not sure when its going to be properly released tho. |
#5214 requires users to add tags to ASGs. Why would the CA rely on ASG tags for GPU resource information when the CA either queries AWS to describe instance types or uses static instance type descriptions and so already has GPU resource information. |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
Yeah, when ASG have more 1 node or more. Why CA is able to scale up that ASG. But that ASG have 0 node, CA can't?? This should be kind of bug. Add resource on node tag is not make any sense on this case. Unless that is a new instance CA don't have that info. Or some additional resource not included on CA's code. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Having similar issue but with smart-devices. On my ends error is:
From what I see all looks fine as long as only one pod is started (CA scales up from 0 correctly). But if there are two pods that require that extra node then CA can't scale up from 0. |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs. This bot triages issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /close not-planned |
@k8s-triage-robot: Closing this issue, marking it as "Not Planned". In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Was there ever a resolution to this? I seen comments about a fix to ASG resource tags mentioned but also a call-out that the GPU count information is already available to CA because it is hard-coded. We are using CA 1.22.2. Will upgrading CA and using resource tags resolve this issue? I ask because I also am confused if this is just a resource tagging issue because in our setup, with no resource tags, a freshly restarted instance of CA behaves fine and it only seems to be after running for some amount of time that it can no longer scale up from zero with It seems like some other bug is occurring which is causing CA to forget that a given node type has a set number of GPUs and you need to restart the daemon to have it pull from code. Could this be related to the comment in the docs
Could this also cause CA to incorrectly cache the fact that a node-group has 0 GPUs? |
We are running CA 1.28.2 and we are experiencing the same behaviour as @elatt. CA seems to be working fine but after some time it is unable to scale up GPU nodes, which is fixed with a restart of it. In our case we are running CA on EKS, and both CloudFormation stack and ASG have the tags Checking the CA log, we obtain the following:
I am wondering if this could be related with the scale down from/to 0, since I am not understanding why scaling up works when restarting the CA but not after some time. Reopening this issue since I see that others are running into the same problem. /reopen /remove-lifecycle rotten |
/reopen please this is currently happening on EKS deployments with ASG with proper tags to help resource estimation for CA (1.28.2 and k8s 1.28). CA initially after startup assigns existing GPU attributes properly from ASG tags or inference, and scales properly the pod requiring GPU, but after a while it somehow drops the GPU resource attributes of the ASG, and does not scale it anymore. Our temporal resolution of this is to put a rollout on CA in a cron with sufficient frequency, but it is a mad solution, CA should not lose GPU resource attributes that worked at startup. |
/reopen |
@JavierAntonYuste: You can't reopen an issue/PR unless you authored it or you are a collaborator. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Same issue here :/ |
Still experiencing this bug on CA v1.26.2 :( |
Which component are you using?:
cluster-autoscaler
What version of the component are you using?:
1.22.3
Component version:
What k8s version are you using (
kubectl version
)?:kubectl version
OutputWhat environment is this in?:
We are using AWS EC2 instances to run kubernetes
For GPU we use instances g4dn.2xlarge and g4dn.4xlarge
What did you expect to happen?:
Cluster Autoscaler scales from 0 GPU Autoscaling Group when new pods request GPU.
What happened instead?:
Cluster Autoscaler do not trigger scale up from 0 nodes.
When there are 0 nodes in ASG and new pods created and requests GPU Cluster Autoscaler(CA) do not trigger scale up.
In CA logs I can see message like
I1027 11:02:58.428887 1 scale_up.go:300] Pod cuda-vector-add can't be scheduled on 2.gpu-small.nodes.aws-us-east, predicate checking error: Insufficient nvidia.com/gpu; predicateName=NodeResourcesFit; reasons: Insufficient nvidia.com/gpu; debugInfo=
After CA restart it starts to work as expected. But after some time CA again stops to scale from 0.
Looks like CA when ASG scaled to 0 at some point stops to see GPU resources.
How to reproduce it (as minimally and precisely as possible):
When ASG scaled to 0 just try to run gpu pod like:
ASG Labels like
doesn't work.
The text was updated successfully, but these errors were encountered: