Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kops-managed nvidia support, long-running pods eventually can't see GPU anymore #13727

Closed
darintay opened this issue Jun 4, 2022 · 2 comments
Labels
kind/bug Categorizes issue or PR as related to a bug.

Comments

@darintay
Copy link

darintay commented Jun 4, 2022

/kind bug

1. What kops version are you running? The command kops version, will display
this information.

Version 1.23.1 (git-83ccae81a636b8e870e430b6faaeeb5d10d9b832)

2. What Kubernetes version are you running? kubectl version will print the
version if a cluster is running or provide the Kubernetes version specified as
a kops flag.

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:58:47Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

3. What cloud provider are you using?
AWS

4. What commands did you run? What is the simplest way to reproduce this issue?
5. What happened after the commands executed?

I upgraded my cluster to use https://github.com/kubernetes/kops/blob/master/docs/gpu.md instead of some custom nvidia setup, and it has mostly been working great, but occasionally a long-running GPU pod will stop being able to see GPU devices after several hours on a loaded cluster (although existing GPU tasks on that pop will continue to operate fine on the GPU).

I believe it is hitting the same problem as these issues:
NVIDIA/nvidia-docker#1618
NVIDIA/k8s-device-plugin#289
NVIDIA/nvidia-docker#966
where a cgroup change sync ends up causing the container engine to undo the work done by the nvidia runtime.

The fix in the first two suggest running the nodes with cgroupv2, which I could try with a custom AMI (I'm using the kops default right now), but seems like it might break other things.

The third issue suggests there is a workaround using nvidia-device-plugin-compat-with-cpumanager.yml (https://github.com/NVIDIA/k8s-device-plugin/blob/master/deployments/static/nvidia-device-plugin-compat-with-cpumanager.yml) , which I am trying out now, but will take a few days before I know whether or not it fixes things.

(I can manually edit the device plugin daemonset after the cluster is running and apply the changes in that file, but is there a 'kops' way to do it? Are there any customization points on the managed GPU support I could use to modify the config of the DS it creates?)

Any thoughts on other means to address in kops? This is probably more of an upstream issue, but some combination of versions & settings in kops's managed GPU setup seems to be triggering it. I wasn't seeing this on my old cluster, but it was using older k8s/kops, docker, helm-installed nvidia-device-plugin, and a custom AMI with the nvidia stuff setup (which I was hoping to now avoid).

Unfortunately I don't really have better repro steps than run some GPU pods on a loaded cluster for several hours polling nvidia-smi until it fails.

6. What did you expect to happen?
pod should continue being able to see the GPUs on the node.

7. Please provide your cluster manifest. Execute
kops get --name my.example.com -o yaml to display your cluster manifest.
You may want to remove your cluster name and other sensitive information.

apiVersion: kops.k8s.io/v1alpha2
kind: Cluster
metadata:
  creationTimestamp: null
  generation: 2
  name: ***
spec:
  api:
    dns: {}
  authorization:
    rbac: {}
  channel: stable
  cloudProvider: aws
  configBase: ***
  containerd:
    nvidiaGPU:
      enabled: true
  dnsZone: ***
  etcdClusters:
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: main
  - etcdMembers:
    - instanceGroup: master-us-west-2a
      name: a
    - instanceGroup: master-us-west-2b
      name: b
    - instanceGroup: master-us-west-2c
      name: c
    name: events
  iam:
    allowContainerRegistry: true
    legacy: false
  kubeAPIServer:
    featureGates:
      DevicePlugins: "true"
  kubeControllerManager:
    featureGates:
      DevicePlugins: "true"
  kubeProxy:
    featureGates:
      DevicePlugins: "true"
    metricsBindAddress: 0.0.0.0
  kubeScheduler:
    featureGates:
      DevicePlugins: "true"
  kubelet:
    featureGates:
      DevicePlugins: "true"
  kubernetesApiAccess:
  - 0.0.0.0/0
  kubernetesVersion: 1.23.5
  masterInternalName: ***
  masterPublicName: ***
  networkCIDR: 172.20.0.0/16
  networking:
    kubenet: {}
  nonMasqueradeCIDR: 100.64.0.0/10
  rollingUpdate:
    maxSurge: 2
    maxUnavailable: 0
  sshAccess:
  - 0.0.0.0/0
  subnets:
  - cidr: 172.20.96.0/19
    name: us-west-2a
    type: Public
    zone: us-west-2a
  - cidr: 172.20.64.0/19
    name: us-west-2b
    type: Public
    zone: us-west-2b
  - cidr: 172.20.32.0/19
    name: us-west-2c
    type: Public
    zone: us-west-2c
  topology:
    dns:
      type: Public
    masters: public
    nodes: public


---
apiVersion: kops.k8s.io/v1alpha2
kind: InstanceGroup
metadata:
  creationTimestamp: "2022-05-23T23:32:52Z"
  generation: 1
  labels:
    kops.k8s.io/cluster: ***
  name: ***
spec:
  machineType: p3.2xlarge
  maxSize: 5
  minSize: 0
  nodeLabels:
    gpu: "1"
    kops.k8s.io/instancegroup: ***
  role: Node
  rootVolumeSize: 100
  subnets:
  - us-west-2c

**8. Please run the commands with most verbose logging by adding the -v 10 flag.
9. Anything else do we need to know?

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jun 4, 2022
@olemarkus
Copy link
Member

There are no customization points to the DS right now. We only add those on an as-needed basis. Alternatively, one can build a custom kops client with another manifest.

The 1.25 release will most likely use ubuntu 22.04 by default, which uses cgroupv2. You could try the latest 22.04 AMI with 1.24 alpha to see if that helps. 1.24 also comes with a newer device plugin if that helps.

@darintay
Copy link
Author

darintay commented Jun 8, 2022

Thanks for the response!

For anyone else who might run into this, the manifest change to nvidia-device-plugin I mentioned above does seem to be working, haven't seen this in the several days since I applied it.

Totally understand not wanting to add more knobs, but it does make using this feature a bit daunting. Operationally it saves a bunch of effort to use it, but I worry about subtle version issues between driver / image / plugin / container runtime, and having to go back to the 'manual' setup if that happens.

Anyhow, I can just modify the manifest outside of kops for now, and hopefully future images / cgroups changes / nvidia updates will fix this one automatically in the future.

@darintay darintay closed this as completed Jun 8, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug.
Projects
None yet
Development

No branches or pull requests

3 participants