kops-managed nvidia support, long-running pods eventually can't see GPU anymore #13727

darintay · 2022-06-04T01:36:24Z

/kind bug

1. What kops version are you running? The command kops version, will display
this information.
Version 1.23.1 (git-83ccae81a636b8e870e430b6faaeeb5d10d9b832)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.
Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
5. What happened after the commands executed?

I upgraded my cluster to use https://github.com/kubernetes/kops/blob/master/docs/gpu.md instead of some custom nvidia setup, and it has mostly been working great, but occasionally a long-running GPU pod will stop being able to see GPU devices after several hours on a loaded cluster (although existing GPU tasks on that pop will continue to operate fine on the GPU).

I believe it is hitting the same problem as these issues:
NVIDIA/nvidia-docker#1618
NVIDIA/k8s-device-plugin#289
NVIDIA/nvidia-docker#966
where a cgroup change sync ends up causing the container engine to undo the work done by the nvidia runtime.

The fix in the first two suggest running the nodes with cgroupv2, which I could try with a custom AMI (I'm using the kops default right now), but seems like it might break other things.

The third issue suggests there is a workaround using nvidia-device-plugin-compat-with-cpumanager.yml (https://github.com/NVIDIA/k8s-device-plugin/blob/master/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml) , which I am trying out now, but will take a few days before I know whether or not it fixes things.

(I can manually edit the device plugin daemonset after the cluster is running and apply the changes in that file, but is there a 'kops' way to do it? Are there any customization points on the managed GPU support I could use to modify the config of the DS it creates?)

Any thoughts on other means to address in kops? This is probably more of an upstream issue, but some combination of versions & settings in kops's managed GPU setup seems to be triggering it. I wasn't seeing this on my old cluster, but it was using older k8s/kops, docker, helm-installed nvidia-device-plugin, and a custom AMI with the nvidia stuff setup (which I was hoping to now avoid).

Unfortunately I don't really have better repro steps than run some GPU pods on a loaded cluster for several hours polling nvidia-smi until it fails.

6. What did you expect to happen?
pod should continue being able to see the GPUs on the node.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 2
  name: ***
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: ***
  containerd:
    nvidiaGPU:
      enabled: true
  dnsZone: ***
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    featureGates:
      DevicePlugins: "true"
  kubeControllerManager:
    featureGates:
      DevicePlugins: "true"
  kubeProxy:
    featureGates:
      DevicePlugins: "true"
    metricsBindAddress: 0.0.0.0
  kubeScheduler:
    featureGates:
      DevicePlugins: "true"
  kubelet:
    featureGates:
      DevicePlugins: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.23.5
  masterInternalName: ***
  masterPublicName: ***
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  rollingUpdate:
    maxSurge: 2
    maxUnavailable: 0
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.96.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  - cidr: 172.20.64.0/19
    name: us-west-2b
    type: Public
    zone: us-west-2b
  - cidr: 172.20.32.0/19
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public


---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-05-23T23:32:52Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: ***
  name: ***
spec:
  machineType: p3.2xlarge
  maxSize: 5
  minSize: 0
  nodeLabels:
    gpu: "1"
    kops.k8s.io/instancegroup: ***
  role: Node
  rootVolumeSize: 100
  subnets:
  - us-west-2c

**8. Please run the commands with most verbose logging by adding the -v 10 flag.
9. Anything else do we need to know?

The text was updated successfully, but these errors were encountered:

olemarkus · 2022-06-04T04:49:46Z

There are no customization points to the DS right now. We only add those on an as-needed basis. Alternatively, one can build a custom kops client with another manifest.

The 1.25 release will most likely use ubuntu 22.04 by default, which uses cgroupv2. You could try the latest 22.04 AMI with 1.24 alpha to see if that helps. 1.24 also comes with a newer device plugin if that helps.

darintay · 2022-06-08T18:10:07Z

Thanks for the response!

For anyone else who might run into this, the manifest change to nvidia-device-plugin I mentioned above does seem to be working, haven't seen this in the several days since I applied it.

Totally understand not wanting to add more knobs, but it does make using this feature a bit daunting. Operationally it saves a bunch of effort to use it, but I worry about subtle version issues between driver / image / plugin / container runtime, and having to go back to the 'manual' setup if that happens.

Anyhow, I can just modify the manifest outside of kops for now, and hopefully future images / cgroups changes / nvidia updates will fix this one automatically in the future.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2022

darintay closed this as completed Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kops-managed nvidia support, long-running pods eventually can't see GPU anymore #13727

kops-managed nvidia support, long-running pods eventually can't see GPU anymore #13727

darintay commented Jun 4, 2022

olemarkus commented Jun 4, 2022

darintay commented Jun 8, 2022

kops-managed nvidia support, long-running pods eventually can't see GPU anymore #13727

kops-managed nvidia support, long-running pods eventually can't see GPU anymore #13727

Comments

darintay commented Jun 4, 2022

olemarkus commented Jun 4, 2022

darintay commented Jun 8, 2022