You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
1. What kops version are you running? The command kops version, will display
this information.
Version 1.23.1 (git-83ccae81a636b8e870e430b6faaeeb5d10d9b832)
2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue? 5. What happened after the commands executed?
I upgraded my cluster to use https://github.com/kubernetes/kops/blob/master/docs/gpu.md instead of some custom nvidia setup, and it has mostly been working great, but occasionally a long-running GPU pod will stop being able to see GPU devices after several hours on a loaded cluster (although existing GPU tasks on that pop will continue to operate fine on the GPU).
The fix in the first two suggest running the nodes with cgroupv2, which I could try with a custom AMI (I'm using the kops default right now), but seems like it might break other things.
(I can manually edit the device plugin daemonset after the cluster is running and apply the changes in that file, but is there a 'kops' way to do it? Are there any customization points on the managed GPU support I could use to modify the config of the DS it creates?)
Any thoughts on other means to address in kops? This is probably more of an upstream issue, but some combination of versions & settings in kops's managed GPU setup seems to be triggering it. I wasn't seeing this on my old cluster, but it was using older k8s/kops, docker, helm-installed nvidia-device-plugin, and a custom AMI with the nvidia stuff setup (which I was hoping to now avoid).
Unfortunately I don't really have better repro steps than run some GPU pods on a loaded cluster for several hours polling nvidia-smi until it fails.
6. What did you expect to happen?
pod should continue being able to see the GPUs on the node.
7. Please provide your cluster manifest. Execute kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.
There are no customization points to the DS right now. We only add those on an as-needed basis. Alternatively, one can build a custom kops client with another manifest.
The 1.25 release will most likely use ubuntu 22.04 by default, which uses cgroupv2. You could try the latest 22.04 AMI with 1.24 alpha to see if that helps. 1.24 also comes with a newer device plugin if that helps.
For anyone else who might run into this, the manifest change to nvidia-device-plugin I mentioned above does seem to be working, haven't seen this in the several days since I applied it.
Totally understand not wanting to add more knobs, but it does make using this feature a bit daunting. Operationally it saves a bunch of effort to use it, but I worry about subtle version issues between driver / image / plugin / container runtime, and having to go back to the 'manual' setup if that happens.
Anyhow, I can just modify the manifest outside of kops for now, and hopefully future images / cgroups changes / nvidia updates will fix this one automatically in the future.
/kind bug
1. What
kops
version are you running? The commandkops version
, will displaythis information.
Version 1.23.1 (git-83ccae81a636b8e870e430b6faaeeb5d10d9b832)
2. What Kubernetes version are you running?
kubectl version
will print theversion if a cluster is running or provide the Kubernetes version specified as
a
kops
flag.Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
3. What cloud provider are you using?
AWS
4. What commands did you run? What is the simplest way to reproduce this issue?
5. What happened after the commands executed?
I upgraded my cluster to use https://github.com/kubernetes/kops/blob/master/docs/gpu.md instead of some custom nvidia setup, and it has mostly been working great, but occasionally a long-running GPU pod will stop being able to see GPU devices after several hours on a loaded cluster (although existing GPU tasks on that pop will continue to operate fine on the GPU).
I believe it is hitting the same problem as these issues:
NVIDIA/nvidia-docker#1618
NVIDIA/k8s-device-plugin#289
NVIDIA/nvidia-docker#966
where a cgroup change sync ends up causing the container engine to undo the work done by the nvidia runtime.
The fix in the first two suggest running the nodes with cgroupv2, which I could try with a custom AMI (I'm using the kops default right now), but seems like it might break other things.
The third issue suggests there is a workaround using nvidia-device-plugin-compat-with-cpumanager.yml (https://github.com/NVIDIA/k8s-device-plugin/blob/master/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml) , which I am trying out now, but will take a few days before I know whether or not it fixes things.
(I can manually edit the device plugin daemonset after the cluster is running and apply the changes in that file, but is there a 'kops' way to do it? Are there any customization points on the managed GPU support I could use to modify the config of the DS it creates?)
Any thoughts on other means to address in kops? This is probably more of an upstream issue, but some combination of versions & settings in kops's managed GPU setup seems to be triggering it. I wasn't seeing this on my old cluster, but it was using older k8s/kops, docker, helm-installed nvidia-device-plugin, and a custom AMI with the nvidia stuff setup (which I was hoping to now avoid).
Unfortunately I don't really have better repro steps than run some GPU pods on a loaded cluster for several hours polling nvidia-smi until it fails.
6. What did you expect to happen?
pod should continue being able to see the GPUs on the node.
7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml
to display your cluster manifest.You may want to remove your cluster name and other sensitive information.
**8. Please run the commands with most verbose logging by adding the
-v 10
flag.9. Anything else do we need to know?
The text was updated successfully, but these errors were encountered: