-
Notifications
You must be signed in to change notification settings - Fork 65
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
container-images: add a kompute variant #235
Conversation
Would it make more sense to layer this on the original image? from quay.io/ramalama/ramalama:latest |
That's not a bad idea @rhatdan, we can just replace the CPU-only binaries with these kompute ones, less duplication. We do have an issue though, been trying to make these ubi9-based as Scott McCarthy requested, I think that makes sense, but UBI images are missing access to a small set of required packages, we can make it work by pulling the CentOS Stream ones via something like:
But then it becomes kinda a hybrid UBI9/Stream image. |
Could you turn on x86_64 EPEL9 builds for this also @slp ? : https://copr.fedorainfracloud.org/coprs/slp/mesa-krunkit/ I wanna try that out with an x86_64 GPU 😄 |
I think EPEL would be a better solution then centos-stream if they are all available. |
Then aren't in EPEL @rhatdan :'( They are in AppStream/BaseOS but not UBI |
UBI repos seem to be a subset of RHEL versions of the repos: https://cdn-ubi.redhat.com/content/public/ubi/dist/ubi9/9/x86_64/appstream/os/Packages/ not sure exactly what defines a package being RHEL-only or RHEL+UBI accessible |
I'm afraid the situation the Vulkan-related packages in CentOS Stream 9 is kind of broken (i.e. I think F40 is the only option for now. We can easily switch to Stream 9 once the Vulkan situation is fixed, and eventually to ubi9. |
@slp that package in particular looks like it's in EPEL9 |
This seemed to work reasonably ok:
|
Why are you rebuilding the whisper-server? Isn't the one in the parent layer the same? Or do you have to compile it differently for kompute? |
Who is using the dnf install -y spirv-headers-devel package? Should this be just used during build and removed before finished? |
So it would be more like this:
But I guess there may be other reasons @slp cannot build the latest variant for EPEL9 I guess... I notice only an older version is build for EPEL9 aarch64... If RHEL9/RHEL10 won't be suitable for this kind of thing, it would be nice to document somewhere was packages are missing in both RHEL9 and RHEL10 and if they are in EPEL, etc. |
Do RHEL/RHEL AI customers have access to full-RHEL containers? Not just the subset of packages in UBI, this is also something I am curious about... |
Yes they have full access to RHEL content. |
Install Vulkan dependencies in the container and build llama.cpp with GGML_KOMPUTE=1 to enable the kompute backend. This enables users with a Vulkan-capable GPU to optionally offload part of the workload to it. Signed-off-by: Sergio Lopez <[email protected]>
v2:
|
@@ -6,17 +6,15 @@ ARG HUGGINGFACE_HUB_VERSION=0.25.2 | |||
ARG OMLMD_VERSION=0.1.5 | |||
# renovate: datasource=github-releases depName=tqdm/tqdm extractVersion=^v(?<version>.*) | |||
ARG TQDM_VERSION=4.66.5 | |||
ARG LLAMA_CPP_SHA=70392f1f81470607ba3afef04aa56c9f65587664 | |||
ARG LLAMA_CPP_SHA=2a24c8caa6d10a7263ca317fa7cb64f0edc72aae |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a downgrade. Do we need to downgrade?
ramalama/model.py
Outdated
@@ -100,6 +99,18 @@ def run(self, args): | |||
if not args.ARGS: | |||
exec_args.append("-cnv") | |||
|
|||
if args.gpu: | |||
if sys.platform == "darwin": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We introduced -ngl 99 on macOS because Tim deBoer only saw max utilization of his GPU with ngl 99
But... I never saw this on my M3 pro, seemed the same regardless, so I suspect your change here is right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious if we need the llama.cpp downgrade and would be nice to see the total size difference between cpuonly and a cpuonly+kompute container image
Sadly, we do need the downgrade because the kompute backend it's broken in master. This happens frequently with pretty much every backend except cpu. I want to take a look to see if I can fix it myself in master, but can't make any promises. As for the size, this is what I'm getting here:
|
Should I add a commit renaming the directory? Feels weird keeping the name |
Yes rename, I would prefer to keep one image for CPU and Kompute with the limited size difference. |
Add a "--gpu" that allows users to request the workload to be offloaded to the GPU. This works natively on macOS using Metal and in containers using Vulkan with llama.cpp's Kompute backend. Signed-off-by: Sergio Lopez <[email protected]>
To be able to properly expose the port outside the container, we need to pass "--host 0.0.0.0" to llama.cpp. Signed-off-by: Sergio Lopez <[email protected]>
cpuonly + kompute (merged image) = generic Just a suggestion... |
Now that it also provides Vulkan support, the "cpuonly" name no longer describes it properly. Let's rename it to "generic". Signed-off-by: Sergio Lopez <[email protected]>
I would rather name it ramalama or vulcan. |
# any additional arguments. | ||
pass | ||
elif sys.platform == "linux" and Path("/dev/dri").exists(): | ||
if "q4_0.gguf" not in model_path.lower(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is interesting... Should this condition apply to all GPU frameworks, CUDA, ROCm and Vulkan/Kompute or just Kompute?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only for kompute, so we need to make model.py
aware of the variant it's running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd rather we just removed the q4_0.gguf check... If the model fails, it fails...
Many models are not named like this, none of the ollama ones are for example...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree, lets remove the check.
Add a container variant that enables llama.cpp's kompute backend to enable GPU acceleration using Vulkan.
Tested with krunkit on Apple Silicon with mistral-7b-instruct-v0.2.Q4_0.gguf and Wizard-Vicuna-13B-Uncensored.Q4_0.gguf (>=13B models benefit the most being offloaded to GPU vs. running them on the CPU).
TODO:
Teach ramalama to choose the best container image for the context.Add some Q4_0 models to shortnames.confExpose shortnames.conf into the container.Expose llama.cpp's server port when running in a container.