-
Notifications
You must be signed in to change notification settings - Fork 312
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unable to run pod on G5 48xlarge instance, other g5 instance works well #634
Comments
@shivamerla Hi, Can you help here? Many thanks :) |
@arpitsharma-vw can you check |
@shivamerla What if the driver is already installed (as is the case with EKS GPU AMI), will the driver component still try and apply the kernel module config? |
Many thanks @shivamerla for your input. I can confirm that we see GSP RM related errors here. But regarding the fix, we have installed the GPU operator via OLM(not Helm). I am afraid that these changes will get wiped out again on the next upgrade. |
Let me explain how this can be done on OpenShift: First, create a oc create configmap kernel-module-params -n nvidia-gpu-operator --from-file=nvidia.conf=./nvidia.conf Then add the following to the
You can do it either via the Web console, or using this command: oc patch clusterpolicy/gpu-cluster-policy -n nvidia-gpu-operator --type='json' -p='[{"op": "add", "path": "/spec/driver/kernelModuleConfig/name", "value":"kernel-module-params"}]' Essentially, the outcome should be the same, no matter if done via Helm or using the method I described. That is, the I believe that the changes will persist as they will be part of the |
Same here. I'm using EKS 1.29 with the latest AMI with "a fix" awslabs/amazon-eks-ami#1494 (comment). Even the DCGM exporter failed to start with a message:
|
After applying a suggest fix with disable of GSP:
|
1. Quick Debug Information
2. Issue or feature description
We have openshift cluster where we have installed nvidia gpu operator. When we run any pod on G5.48xlarge machine, we get error as
Same pod on other machine like g5.4xlarge,g5.12xlarge works well. We see this behaviour recently. Earlier same pod worked on g5.48xlarge instance.
We also see pod from nvidia-dcgm-exporter is failing with following error:
3. Steps to reproduce the issue
Assign pod on g5.48xlarge works, but it doesn't run
4. Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACE
kubectl get ds -n OPERATOR_NAMESPACE
kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
nvidia-smi
from the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
journalctl -u containerd > containerd.log
Logs from nvidia-dcgm-exporter pod
Logs from GPU feature discovery pod:
GPU cluster policy
Collecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: [email protected]
The text was updated successfully, but these errors were encountered: