container-images: add a kompute variant #235

slp · 2024-10-04T15:34:16Z

Add a container variant that enables llama.cpp's kompute backend to enable GPU acceleration using Vulkan.

Tested with krunkit on Apple Silicon with mistral-7b-instruct-v0.2.Q4_0.gguf and Wizard-Vicuna-13B-Uncensored.Q4_0.gguf (>=13B models benefit the most being offloaded to GPU vs. running them on the CPU).

TODO:

~~Teach ramalama to choose the best container image for the context.~~
Ensure every operation works transparently when operating on a container.
~~Add some Q4_0 models to shortnames.conf~~
~~Expose shortnames.conf into the container.~~
~~Expose llama.cpp's server port when running in a container.~~

rhatdan · 2024-10-04T15:43:11Z

Would it make more sense to layer this on the original image?

from quay.io/ramalama/ramalama:latest
...

ericcurtin · 2024-10-04T16:15:44Z

That's not a bad idea @rhatdan, we can just replace the CPU-only binaries with these kompute ones, less duplication.

We do have an issue though, been trying to make these ubi9-based as Scott McCarthy requested, I think that makes sense, but UBI images are missing access to a small set of required packages, we can make it work by pulling the CentOS Stream ones via something like:

FROM quay.io/ramalama/ramalama:latest

ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/BaseOS/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y vulkan-headers vulkan-loader-devel vulkan-tools glslc \
      glslang && \
    dnf copr enable -y slp/mesa-krunkit epel-9-$(uname -m) && \
    dnf install -y mesa-vulkan-drivers-24.1.5-101 && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

ENV GGML_CCACHE=0

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_KOMPUTE=1 -DGGML_CCACHE=0 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
    cd whisper.cpp && \
    git reset --hard ${WHISPER_CPP_SHA} && \
    make -j $(nproc) && \
    mv main /usr/bin/whisper-main && \
    mv server /usr/bin/whisper-server && \
    cd / && \
    rm -rf whisper.cpp

But then it becomes kinda a hybrid UBI9/Stream image.

ericcurtin · 2024-10-04T16:24:24Z

Could you turn on x86_64 EPEL9 builds for this also @slp ? :

https://copr.fedorainfracloud.org/coprs/slp/mesa-krunkit/

I wanna try that out with an x86_64 GPU 😄

rhatdan · 2024-10-04T16:59:29Z

I think EPEL would be a better solution then centos-stream if they are all available.

ericcurtin · 2024-10-04T17:01:48Z

Then aren't in EPEL @rhatdan :'( They are in AppStream/BaseOS but not UBI

ericcurtin · 2024-10-04T17:03:36Z

UBI repos seem to be a subset of RHEL versions of the repos:

https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/x86_64/appstream/os/Packages/
https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/x86_64/baseos/os/Packages/

not sure exactly what defines a package being RHEL-only or RHEL+UBI accessible

slp · 2024-10-04T17:25:42Z

I'm afraid the situation the Vulkan-related packages in CentOS Stream 9 is kind of broken (i.e. spirv-headers-devel is nowhere to be found, even though some packages that depend on it are present in the repos). I can't even rebuild the same Mesa spec in my COPR.

I think F40 is the only option for now. We can easily switch to Stream 9 once the Vulkan situation is fixed, and eventually to ubi9.

ericcurtin · 2024-10-04T21:28:24Z

@slp that package in particular looks like it's in EPEL9

ericcurtin · 2024-10-05T15:38:14Z

This seemed to work reasonably ok:

FROM quay.io/ramalama/ramalama:latest

ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae
# renovate: datasource=git-refs depName=ggerganov/whisper.cpp packageName=https://github.com/ggerganov/whisper.cpp gitRef=master versioning=loose type=digest
ARG WHISPER_CPP_SHA=5caa19240d55bfd6ee316d50fbad32c6e9c39528

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/BaseOS/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y vulkan-headers vulkan-loader-devel vulkan-tools glslc \
      glslang epel-release && \
    dnf install -y spirv-headers-devel && \
    dnf copr enable -y slp/mesa-krunkit epel-9-$(uname -m) && \
    dnf install -y mesa-vulkan-drivers-24.1.5-101 && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

ENV GGML_CCACHE=0

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_KOMPUTE=1 -DGGML_CCACHE=0 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

RUN git clone https://github.com/ggerganov/whisper.cpp.git && \
    cd whisper.cpp && \
    git reset --hard ${WHISPER_CPP_SHA} && \
    make -j $(nproc) && \
    mv main /usr/bin/whisper-main && \
    mv server /usr/bin/whisper-server && \
    cd / && \
    rm -rf whisper.cpp

rhatdan · 2024-10-05T16:14:37Z

Why are you rebuilding the whisper-server? Isn't the one in the parent layer the same?

Or do you have to compile it differently for kompute?

rhatdan · 2024-10-05T16:16:49Z

Who is using the dnf install -y spirv-headers-devel package?

Should this be just used during build and removed before finished?

ericcurtin · 2024-10-06T12:05:18Z

So it would be more like this:

FROM quay.io/ramalama/ramalama:latest

RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/AppStream/$(uname -m)/os/
RUN dnf config-manager --add-repo https://mirror.stream.centos.org/9-stream/BaseOS/$(uname -m)/os/
RUN curl -o /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official http://mirror.centos.org/centos/RPM-GPG-KEY-CentOS-Official
RUN rpm --import /etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-Official

RUN dnf install -y vulkan-headers vulkan-loader-devel vulkan-tools glslc \
      glslang epel-release && \
    dnf install -y spirv-headers-devel && \
    dnf copr enable -y slp/mesa-krunkit epel-9-$(uname -m) && \
    dnf install -y mesa-vulkan-drivers-24.1.5-101 && \
    dnf clean all && \
    rm -rf /var/cache/*dnf*

RUN git clone --recurse-submodules https://github.com/ggerganov/llama.cpp && \
    cd llama.cpp && \
    git reset --hard ${LLAMA_CPP_SHA} && \
    cmake -B build -DCMAKE_INSTALL_PREFIX:PATH=/usr -DGGML_KOMPUTE=1 -DGGML_CCACHE=0 && \
    cmake --build build --config Release -j $(nproc) && \
    cmake --install build && \
    cd / && \
    rm -rf llama.cpp

But I guess there may be other reasons @slp cannot build the latest variant for EPEL9 I guess... I notice only an older version is build for EPEL9 aarch64...

If RHEL9/RHEL10 won't be suitable for this kind of thing, it would be nice to document somewhere was packages are missing in both RHEL9 and RHEL10 and if they are in EPEL, etc.

ericcurtin · 2024-10-06T12:06:25Z

Do RHEL/RHEL AI customers have access to full-RHEL containers? Not just the subset of packages in UBI, this is also something I am curious about...

rhatdan · 2024-10-07T12:30:00Z

Yes they have full access to RHEL content.

Install Vulkan dependencies in the container and build llama.cpp with GGML_KOMPUTE=1 to enable the kompute backend. This enables users with a Vulkan-capable GPU to optionally offload part of the workload to it. Signed-off-by: Sergio Lopez <[email protected]>

slp · 2024-10-09T14:02:09Z

v2:

Extend the existing Containerfile instead of building a new one.
Use ubi9 instead of Fedora 40.
Add a cli option to request GPU offloading.

ericcurtin · 2024-10-09T14:12:43Z

container-images/cpuonly/Containerfile

@@ -6,17 +6,15 @@ ARG HUGGINGFACE_HUB_VERSION=0.25.2
 ARG OMLMD_VERSION=0.1.5
 # renovate: datasource=github-releases depName=tqdm/tqdm extractVersion=^v(?<version>.*)
 ARG TQDM_VERSION=4.66.5
-ARG LLAMA_CPP_SHA=70392f1f81470607ba3afef04aa56c9f65587664
+ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae


This is a downgrade. Do we need to downgrade?

ramalama/model.py

ericcurtin · 2024-10-09T14:25:50Z

ramalama/model.py

@@ -100,6 +99,18 @@ def run(self, args):
        if not args.ARGS:
            exec_args.append("-cnv")

+        if args.gpu:
+            if sys.platform == "darwin":


We introduced -ngl 99 on macOS because Tim deBoer only saw max utilization of his GPU with ngl 99

But... I never saw this on my M3 pro, seemed the same regardless, so I suspect your change here is right

ramalama/model.py

ericcurtin

I'm curious if we need the llama.cpp downgrade and would be nice to see the total size difference between cpuonly and a cpuonly+kompute container image

slp · 2024-10-09T14:48:17Z

I'm curious if we need the llama.cpp downgrade and would be nice to see the total size difference between cpuonly and a cpuonly+kompute container image

Sadly, we do need the downgrade because the kompute backend it's broken in master. This happens frequently with pretty much every backend except cpu. I want to take a look to see if I can fix it myself in master, but can't make any promises.

As for the size, this is what I'm getting here:

localhost/ramalama-cpuonly           latest               56a6236affe0  About a minute ago  660 MB
localhost/ramalama-kompute           latest               c312022f66a5  About an hour ago   862 MB

slp · 2024-10-09T14:50:05Z

Should I add a commit renaming the directory? Feels weird keeping the name cpuonly

rhatdan · 2024-10-09T14:59:49Z

Yes rename, I would prefer to keep one image for CPU and Kompute with the limited size difference.

Add a "--gpu" that allows users to request the workload to be offloaded to the GPU. This works natively on macOS using Metal and in containers using Vulkan with llama.cpp's Kompute backend. Signed-off-by: Sergio Lopez <[email protected]>

To be able to properly expose the port outside the container, we need to pass "--host 0.0.0.0" to llama.cpp. Signed-off-by: Sergio Lopez <[email protected]>

ericcurtin · 2024-10-09T15:35:51Z

cpuonly + kompute (merged image) = generic

Just a suggestion...

Now that it also provides Vulkan support, the "cpuonly" name no longer describes it properly. Let's rename it to "generic". Signed-off-by: Sergio Lopez <[email protected]>

rhatdan · 2024-10-09T15:48:20Z

I would rather name it ramalama or vulcan.
If ramalama then others can build their Vendor specific images from it.

ericcurtin · 2024-10-16T11:23:04Z

ramalama/model.py

+            # any additional arguments.
+            pass
+        elif sys.platform == "linux" and Path("/dev/dri").exists():
+            if "q4_0.gguf" not in model_path.lower():


This is interesting... Should this condition apply to all GPU frameworks, CUDA, ROCm and Vulkan/Kompute or just Kompute?

Only for kompute, so we need to make model.py aware of the variant it's running.

I'd rather we just removed the q4_0.gguf check... If the model fails, it fails...

Many models are not named like this, none of the ollama ones are for example...

I agree, lets remove the check.

slp force-pushed the kompute-backend branch from ae74b9a to 3e8429d Compare October 9, 2024 13:56

slp force-pushed the kompute-backend branch from 3e8429d to 659a9d4 Compare October 9, 2024 13:57

slp force-pushed the kompute-backend branch from 659a9d4 to 6afae3e Compare October 9, 2024 14:11

slp marked this pull request as ready for review October 9, 2024 14:12

ericcurtin reviewed Oct 9, 2024

View reviewed changes

ramalama/model.py Show resolved Hide resolved

slp force-pushed the kompute-backend branch from 6afae3e to 507590b Compare October 9, 2024 14:17

ericcurtin reviewed Oct 9, 2024

View reviewed changes

ramalama/model.py Outdated Show resolved Hide resolved

ericcurtin approved these changes Oct 9, 2024

View reviewed changes

slp force-pushed the kompute-backend branch from 507590b to 170175f Compare October 9, 2024 14:30

slp force-pushed the kompute-backend branch from 750a249 to a609a53 Compare October 9, 2024 14:59

slp added 2 commits October 9, 2024 17:22

Add a cli option to enable GPU offload

800cacc

Add a "--gpu" that allows users to request the workload to be offloaded to the GPU. This works natively on macOS using Metal and in containers using Vulkan with llama.cpp's Kompute backend. Signed-off-by: Sergio Lopez <[email protected]>

serve: bind to all interfaces if running as a container

cd57bf4

To be able to properly expose the port outside the container, we need to pass "--host 0.0.0.0" to llama.cpp. Signed-off-by: Sergio Lopez <[email protected]>

slp force-pushed the kompute-backend branch from a609a53 to cd57bf4 Compare October 9, 2024 15:22

Rename cpuonly container to generic

627d8e0

Now that it also provides Vulkan support, the "cpuonly" name no longer describes it properly. Let's rename it to "generic". Signed-off-by: Sergio Lopez <[email protected]>

slp mentioned this pull request Oct 9, 2024

Initial version of aarch64 container with Vulkan #270

Closed

ericcurtin mentioned this pull request Oct 14, 2024

wrong krunkit warning when no models yet running? #301

Closed

ericcurtin reviewed Oct 16, 2024

View reviewed changes

ericcurtin mentioned this pull request Oct 20, 2024

Kompute Containerfile #334

Merged

slp closed this Oct 31, 2024

slp mentioned this pull request Oct 31, 2024

Enable containers on macOS to use the GPU #397

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

container-images: add a kompute variant #235

container-images: add a kompute variant #235

slp commented Oct 4, 2024 •

edited

Loading

rhatdan commented Oct 4, 2024

ericcurtin commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 4, 2024 •

edited

Loading

rhatdan commented Oct 4, 2024

ericcurtin commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 4, 2024 •

edited

Loading

slp commented Oct 4, 2024

ericcurtin commented Oct 4, 2024

ericcurtin commented Oct 5, 2024

rhatdan commented Oct 5, 2024

rhatdan commented Oct 5, 2024

ericcurtin commented Oct 6, 2024 •

edited

Loading

ericcurtin commented Oct 6, 2024

rhatdan commented Oct 7, 2024

slp commented Oct 9, 2024

ericcurtin Oct 9, 2024

ericcurtin Oct 9, 2024 •

edited

Loading

ericcurtin left a comment

slp commented Oct 9, 2024

slp commented Oct 9, 2024

rhatdan commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

rhatdan commented Oct 9, 2024

ericcurtin Oct 16, 2024

slp Oct 16, 2024

ericcurtin Oct 17, 2024

rhatdan Oct 17, 2024

container-images: add a kompute variant #235

container-images: add a kompute variant #235

Conversation

slp commented Oct 4, 2024 • edited Loading

rhatdan commented Oct 4, 2024

ericcurtin commented Oct 4, 2024 • edited Loading

ericcurtin commented Oct 4, 2024 • edited Loading

rhatdan commented Oct 4, 2024

ericcurtin commented Oct 4, 2024 • edited Loading

ericcurtin commented Oct 4, 2024 • edited Loading

slp commented Oct 4, 2024

ericcurtin commented Oct 4, 2024

ericcurtin commented Oct 5, 2024

rhatdan commented Oct 5, 2024

rhatdan commented Oct 5, 2024

ericcurtin commented Oct 6, 2024 • edited Loading

ericcurtin commented Oct 6, 2024

rhatdan commented Oct 7, 2024

slp commented Oct 9, 2024

ericcurtin Oct 9, 2024

Choose a reason for hiding this comment

ericcurtin Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

ericcurtin left a comment

Choose a reason for hiding this comment

slp commented Oct 9, 2024

slp commented Oct 9, 2024

rhatdan commented Oct 9, 2024

ericcurtin commented Oct 9, 2024

rhatdan commented Oct 9, 2024

ericcurtin Oct 16, 2024

Choose a reason for hiding this comment

slp Oct 16, 2024

Choose a reason for hiding this comment

ericcurtin Oct 17, 2024

Choose a reason for hiding this comment

rhatdan Oct 17, 2024

Choose a reason for hiding this comment

slp commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 4, 2024 •

edited

Loading

ericcurtin commented Oct 6, 2024 •

edited

Loading

ericcurtin Oct 9, 2024 •

edited

Loading