Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nvidia arm64 & GPU operator test #583

Draft
wants to merge 4 commits into
base: flatcar-master
Choose a base branch
from

Conversation

jepio
Copy link
Member

@jepio jepio commented Feb 27, 2025

  • Add SkipFunc implementation for skipping test on unsupported instance types
  • Add GPU operator test (includes nvidia-runtime sysext test)
  • Add Arm64 support to both tests
  • Add AWS support

This relies on the nvidia-runtime sysext from the bakery.

Signed-off-by: Jeremi Piotrowski <[email protected]>
@jepio jepio requested a review from Copilot February 27, 2025 18:41

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Overview

This pull request adds support for NVIDIA GPU testing by introducing a SkipFunc for unsupported instance types, adding a GPU operator test (including an NVIDIA runtime sysext test), and extending support to the ARM64 architecture and AWS platform.

  • Introduces skipOnNonGpu to conditionally skip tests on unsupported instances.
  • Adds a new test (cl.misc.nvidia.operator) with a complete GPU operator installation and validation workflow.
  • Updates existing NVIDIA installation test to incorporate ARM64 support via template configuration.

Reviewed Changes

File Description
kola/tests/misc/nvidia.go Added new constants, skip logic, GPU operator test implementation, and expanded platform/architecture support

Copilot reviewed 1 out of 1 changed files in this pull request and generated no comments.

Comments suppressed due to low confidence (2)

kola/tests/misc/nvidia.go:162

  • The multi-line helm installation command uses backticks, which preserve literal newlines. Verify that the shell execution handles these newlines as intended, or consider converting it to a single-line command.
_ = c.MustSSH(m, `curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/master/scripts/get-helm-3 \

kola/tests/misc/nvidia.go:101

  • The SSH check in waitForNvidiaDriver only verifies for the substring 'active (exited)', which may be too specific if the nvidia service enters other valid states. Consider broadening the check or adding comments to clarify the expected state.
out, err := c.SSH(*m, "systemctl status nvidia.service")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant