Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancement of nvidia-device-plugin-daemonset in gpu-cluster.md #138

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

JoeyC-Dev
Copy link
Contributor

@JoeyC-Dev JoeyC-Dev commented Jan 14, 2025

Proposed change: Improve the Nvidia device plugin deamonset for better deployment.

Supporting point:

  1. Currently, if the user is using Spot node, the current version of deamonset will FAIL to deploy due to missing tolerations to relative taint.
  2. Also, the current daemonset will also try to deploy the Pod to AMD GPU / non-GPU node due to missing nodeSelector.
  3. As of now, the latest version of plugin is v0.17.0. Update image version relatively: https://github.com/NVIDIA/k8s-device-plugin/releases
  4. The use of label/taint is following the proposed/supported list:
  5. Weirdly, currently in the document is asking for creating namespace gpu-operator, but there is no actual use in the daemonset. Consider this is a manual set-up, I kept it as in namespace gpu-operator and change the yaml relatively.
    1. Create a namespace using the [`kubectl create namespace`][kubectl-create] command.
    ```bash
    kubectl create namespace gpu-operator
    ```

Having checked, the propsed yaml is working, and it will deploy the ds on Nvidia GPU node.
image
image

Successful deployment:

kubectl logs nvidia-device-plugin-daemonset-7fzlq -n gpu-operator
I0114 09:03:01.133253       1 main.go:235] "Starting NVIDIA Device Plugin" version=<
        d475b2cf
        commit: d475b2cfcf12b983a4975d4fc59d91af432cf28e
 >
I0114 09:03:01.133332       1 main.go:238] Starting FS watcher for /var/lib/kubelet/device-plugins
I0114 09:03:01.133367       1 main.go:245] Starting OS watcher.
I0114 09:03:01.133653       1 main.go:260] Starting Plugins.
I0114 09:03:01.133684       1 main.go:317] Loading configuration.
I0114 09:03:01.134351       1 main.go:342] Updating config with default resource matching patterns.
I0114 09:03:01.134496       1 main.go:353] 
Running with config:
{
  "version": "v1",
  "flags": {
    "migStrategy": "none",
    "failOnInitError": false,
    "mpsRoot": "",
    "nvidiaDriverRoot": "/",
    "nvidiaDevRoot": "/",
    "gdsEnabled": false,
    "mofedEnabled": false,
    "useNodeFeatureAPI": null,
    "deviceDiscoveryStrategy": "auto",
    "plugin": {
      "passDeviceSpecs": false,
      "deviceListStrategy": [
        "envvar"
      ],
      "deviceIDStrategy": "uuid",
      "cdiAnnotationPrefix": "cdi.k8s.io/",
      "nvidiaCTKPath": "/usr/bin/nvidia-ctk",
      "containerDriverRoot": "/driver-root"
    }
  },
  "resources": {
    "gpus": [
      {
        "pattern": "*",
        "name": "nvidia.com/gpu"
      }
    ]
  },
  "sharing": {
    "timeSlicing": {}
  },
  "imex": {}
}
I0114 09:03:01.134503       1 main.go:356] Retrieving plugins.
I0114 09:03:01.162986       1 server.go:195] Starting GRPC server for 'nvidia.com/gpu'
I0114 09:03:01.164044       1 server.go:139] Starting to serve 'nvidia.com/gpu' on /var/lib/kubelet/device-plugins/nvidia-gpu.sock
I0114 09:03:01.166773       1 server.go:146] Registered device plugin for 'nvidia.com/gpu' with Kubelet

Copy link
Contributor

@JoeyC-Dev : Thanks for your contribution! The author(s) have been notified to review your proposed change.

Copy link

Learn Build status updates of commit 07ab281:

✅ Validation status: passed

File Status Preview URL Details
articles/aks/gpu-cluster.md ✅Succeeded

For more details, please refer to the build report.

For any questions, please:

@ttorble
Copy link
Contributor

ttorble commented Jan 14, 2025

@schaffererin

Can you review the proposed changes?

IMPORTANT: When the changes are ready for publication, adding a #sign-off comment is the best way to signal that the PR is ready for the review team to merge.

#label:"aq-pr-triaged"
@MicrosoftDocs/public-repo-pr-review-team

Copy link

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

@v-dirichards
Copy link
Contributor

@schaffererin This PR hasn’t had any updates for a while. If it's ready for review, could you sign off? Or should it be closed?

#label:"aq-pr-triaged","aq-followed-up"

@github-actions github-actions bot removed the inactive label Jan 29, 2025
Copy link

This pull request has been inactive for at least 14 days. If you are finished with your changes, don't forget to sign off. See the contributor guide for instructions.
Get Help
Docs Support Teams Channel
Resolve Merge Conflict

@JoeyC-Dev
Copy link
Contributor Author

This PR is critical for users who aren't expert on Kubernetes as they don't know why their driver cannot be installed.
Can anyone kindly approve this PR?

@MicrosoftDocs/public-repo-pr-review-team

@github-actions github-actions bot removed the inactive label Feb 12, 2025
@Court72
Copy link
Contributor

Court72 commented Feb 17, 2025

I sent an email to the content owner today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants