Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

updating AMD bootc image #767

Merged
merged 3 commits into from
Aug 29, 2024

Conversation

yevgeny-shnaidman
Copy link
Contributor

  1. using OOT driver and firmware instead of in-tree
  2. moving to DKMS and ROCM 6.1.2 version

1) using OOT driver and firmware instead of in-tree
2) moving to DKMS and ROCM 6.1.2 version

Signed-off-by: Yevgeny Shnaidman <[email protected]>
@yevgeny-shnaidman
Copy link
Contributor Author

/cc @fabiendupont

The multi-stage build has too many stages. During the installation of
the `amggpu-dkms` package, the modules are built and installed in
`/lib/modules/${KERNEL_VERSION}`. If the installation of the package is
done in the `driver-toolkit` image, the extra dependencies are very
limited. This change removes the `source` stage and installs the
`amdgpu-dkms` package on top of `driver-toolkit`.

The `amdgpu-dkms` packages installs the modules in
`/lib/modules/${KERNEL_VERSION}/extra` and these are the only modules in
that folder. The `amdgpu-dkms-firmware` package is installed as a
dependency of `admgpu-dkms` and it installs the firwmare files in
`/lib/firmware/updates/amdgpu·`. So, this change removes the in-tree
`amdgpu` modules and firmware, then copies the ones generated by DKMS in
the `builder` stage.

The change also moves the repository definitions to the `repos.d` folder
and adds the AMD public key to verify the signatures of the AMD RPMs.

The users call a wrapper script called `ilab` to hide the `instructlab`
container image and the command line options. This change copies the
file from `nvidia-bootc` and adjusts the logic. The main change is that
`/dev/kfd` and `/dev/dri` devices are passed to the container, instead
of `nvidia.com/gpu=all`. The `ilab` wrapper is copied in the `amd-bootc`
image.

The Makefile is also modified to reflect these changes.

Signed-off-by: Fabien Dupont <[email protected]>
@fabiendupont
Copy link
Contributor

Instead of commenting here, I submitted a pull request in @yevgeny-shnaidman's repository: yevgeny-shnaidman#1.

@fabiendupont
Copy link
Contributor

LGTM

ARG INSTRUCTLAB_IMAGE="quay.io/ai-lab/instructlab-amd:latest"
ARG BASEIMAGE="quay.io/centos-bootc/centos-bootc:stream9"
ARG DRIVER_TOOLKIT_IMAGE="quay.io/ai-lab/nvidia-builder:latest"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this make sense? Using nvidia-builder on amd-bootc?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a driver-toolkit image, which contains all the package needed for kernel module compilation. So, it is related to a kernel version, and does not contain any packages that might be specific to nvidia. The name unfortunate, but this is what we have right now

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have discussed moving the bootc images out of the ai-lab-recipes repository. One of the changes would be to have a quay.io/<org>/driver-toolkit:<kernel-version>. The driver toolkit images could be hosted in the quay.io/fedora and quay.io/centos orgs.

@rhatdan
Copy link
Member

rhatdan commented Aug 29, 2024

/lgtm

@rhatdan rhatdan merged commit fc758b5 into containers:main Aug 29, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants