[PoC] Run GPU workload in Gardener cluster and provide concept how to enable GPU in Kyma Runtime #18771

pbochynski · 2025-01-17T08:32:17Z

Users want to run their applications on GPU. In order to execute code that requires GPU you need proper drivers installed on the node. Investigate what is needed and propose a concept of automating this process. These are the aspects to cover:

How to build NVIDIA drivers?
- start with: https://github.com/gardenlinux/gardenlinux-nvidia-installer
- get gardenlinux version
- get nvidia driver version (not straightforward, as you need to derive it from kernel version of gardenlinux and corresponding debian version)
Where to push the installer, how to deploy it to Kyma Runtimes?
How to install nvidia driver on all gpu nodes?
- label GPU nodes to prepare proper node selector
- run daemon set on labeled nodes
How to handle garden linux upgrades?

pbochynski · 2025-01-24T14:31:37Z

Progress update

I was able to build and run nvidia drivers using fork of https://github.com/gardenlinux/gardenlinux-nvidia-installer.
Fork link: https://github.com/pbochynski/gardenlinux-nvidia-installer
Changes:

added workflow to build and push installer to ghcr.
modified sample values to use that image and node affinity based on machine type
updated readme to reflect newest gardenlinux version with matching nvidia driver and how to setup image pull secret

License analysis

The drivers are not distributed with gardenlinux due to the NVIDIA license. The statements in the license clearly say that
NVIDIA grants you a non-exclusive, revocable, non-transferable and non-sublicensable license to deploy, for your own use, the SOFTWARE on infrastructure you own or lease and you may not sell, rent, sublicense, distribute or transfer the SOFTWARE or provide commercial hosting services with the SOFTWARE

Given that, I would rather avoid distributing the driver using docker images. We can protect images with the secret, but our users have access to the image pull secret and we cannot fully control who has access to the image and can download it. Nevertheless, that approach is suitable only for our own teams. We cannot redistribute drivers to external customers.

Recommendation

I suggest building Kyma module to download, compile, and install the driver when needed. The daemonset can be created using gardenlinux docker image that contains all kernel header files required for compilation.
To mitigate a problem with nvidia servers unavailability and speed up node startup time, we can use S3 (BTP Object Store) for caching. Cache would be provided by the cluster owner, and this way, we do not redistribute the software to other entities.

pbochynski · 2025-01-24T14:32:46Z

Another ides from @a-thaler:
verify if we can provide GPU usage metrics.
Check this blogpost: https://blog.kubecost.com/blog/nvidia-gpu-usage/

a-thaler · 2025-02-10T16:45:28Z

Based on the outcomes, we agreed to establish a new tutorial like sample application https://github.com/kyma-project/gpu-driver which can be deployed in a manual way.
The application will.

be configured to run on the desired nodes by some labelSelector
introspect the underlying node for the OS version
downloads the driver fitting to the OS version (overridable by configuration)
build the driver binary
and installs the driver

On an OS upgrade, the application will restart and automatically apply the proper driver dependent on the new OS version.

The tool will evolve into a Kyma module in mid-term.

a-thaler · 2025-02-21T17:19:22Z

Challenges:

For every kernel version, a dedicated base image needs to be used for the pod installing the driver
The image has dependencies like glibc and zlib known for frequent severities
The pod has elevated rights on the node

-> for the upgrade phase multiple daemonsets must be running and daemonsets must have long-running pods which is not needed and a problem in regards to security

Proposal: shift to a simple operator which

has a mapping from kernel version to image version
on all nodes being in scope of a configured nodeSelector and not having a custom annotation, the operator will
- check if it has GPUs
- detect the kernel version
- spin a short-living job with the fitting image
- on successful completion of the job it will put the custom annotate to indicate that the driver is installed

a-thaler mentioned this issue Jan 16, 2025

Support GPU types #18770

Open

14 tasks

pbochynski self-assigned this Jan 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PoC] Run GPU workload in Gardener cluster and provide concept how to enable GPU in Kyma Runtime #18771

[PoC] Run GPU workload in Gardener cluster and provide concept how to enable GPU in Kyma Runtime #18771

pbochynski commented Jan 17, 2025

pbochynski commented Jan 24, 2025

pbochynski commented Jan 24, 2025

a-thaler commented Feb 10, 2025

a-thaler commented Feb 21, 2025 •

edited

Loading

[PoC] Run GPU workload in Gardener cluster and provide concept how to enable GPU in Kyma Runtime #18771

[PoC] Run GPU workload in Gardener cluster and provide concept how to enable GPU in Kyma Runtime #18771

Comments

pbochynski commented Jan 17, 2025

pbochynski commented Jan 24, 2025

Progress update

License analysis

Recommendation

pbochynski commented Jan 24, 2025

a-thaler commented Feb 10, 2025

a-thaler commented Feb 21, 2025 • edited Loading

a-thaler commented Feb 21, 2025 •

edited

Loading