Add GPU support to ggml #915
Replies: 48 comments 161 replies
-
Would it be possible to use https://github.com/openai/triton/ to generate the specific backend GPU code? From what I can tell it generates CUDA code for you. The only drawback right now is that it does not support the pre-Volta series, but there is a PR working on that triton-lang/triton#1505. |
Beta Was this translation helpful? Give feedback.
-
Over at the python side SHARK has universal support using vulkan and they recently implemented llama here : https://github.com/nod-ai/llama Perhaps a similar approach could be adopted leveraging vulkan as a universal backend? Vulkan should be sufficiently universal across all platforms you are trying to target. |
Beta Was this translation helpful? Give feedback.
-
What about openCL ? |
Beta Was this translation helpful? Give feedback.
-
I'm very interested in this (amd gpu owner...) I'm mainly using llama.cpp because it runs on wsl and despite having the python implementation running with rocm on linux I work mainly on windows with wsl. I have no experience with ml and have been away from c/c++ for many years, but I'm mostly exploring model inference and webgpu documentation, although I don't expect to make any contributions in the short term (I lack technical knowledge at the moment), but I was already planning to test some things on my own. For users with nvidia hardware I don't think there is much to gain except perhaps greater speed due to c and a much "cleaner" environment than python, but for amd and apus users including intel integrated graphics and even integrated devices with gpus there can be a lot of gain. |
Beta Was this translation helpful? Give feedback.
-
In my experience, the prompt processing appears to be the main bottleneck for speed. Accelerating prompt processing with cublas on tensor cores could speed up the matrix multiplication considerably. However, transfering the matrices to the GPU appears to be the main bottleneck in the case of using GPU accelerated prompt processing. |
Beta Was this translation helpful? Give feedback.
-
Another amd gpu here, wanting windows support for gpu inference, I’ve been looking at https://github.com/KomputeProject/kompute The most interesting area of research to me is running larger models that can’t fit into vram, something like DeepSpeed ZeRO-3 |
Beta Was this translation helpful? Give feedback.
-
WONNX has WebGPU kernels for operation if one wants to support WebGPU. |
Beta Was this translation helpful? Give feedback.
-
What about supporting TPU or NPU like in the rk chips used in the khadas boards or rock pi? |
Beta Was this translation helpful? Give feedback.
-
Right have you considered simply exporting to ONNX and using ONNX Runtime? Perhaps you have, what were reasons why this would not make sense? |
Beta Was this translation helpful? Give feedback.
-
one idea is using like bgfx framework, even though it's only designed for games, you can fork it and use the layer of generic shaders it provides to compile it into a platform-specific shader, like mac using metal, win using d3d, linux using vulkan, or even compile it into webgpu. |
Beta Was this translation helpful? Give feedback.
-
If this happens, please use HIP instead of CUDA. The code should be almost the same, but it will make it much easier for AMD users to run. |
Beta Was this translation helpful? Give feedback.
-
I'd like to propose to use MLIR as the text format. It's a flexible intermediate representation to which most of the machine learning ecosystem seems to be gravitating to right now. The approach could be that we define a ggml specific dialect that maps to the internal graph representation as closely as possible. People could then build transforms from the ggml MLIR dialect to other MLIR dialects, for example:
Obviously, MLIR is a huge dependency, but implementing a minimal MLIR text generator should be possible with just the C++ standard library. |
Beta Was this translation helpful? Give feedback.
-
Hear me out on something that could make a lot of sense: Intel Arc A770 GPUs are somewhat under appreciated when it comes to performance per $. They're also almost completely ignored by the trendy side of AI research. Doing some tests with some multimodal VQA models, I get about 1/3 the performance of a 4090 but the card only costs $379. I will admit to not having performance/watt benchmarks but for most consumer/edge situations this isn't a huge deal, it's more about the cost of the card for most folks. With Intel trying to break into the GPU market, the A770 is priced to sell, and did I mention it's 16GB? I would love to see this card be the first platform targeted by a ggml gpu stack. With all the optimizations it would be fun to see how much that "1/3 of a 4090" number could be improved upon and at $379 this could be a huge unlock for the community. |
Beta Was this translation helpful? Give feedback.
-
I feel what is great about ggml is the fact that it can run on any CPU with a C++ compiler, e.g. a Raspberry Pi. However, on GPUs, I don't know how likely a ggml-based inference code can beat PyTorch or Triton in terms of performance. |
Beta Was this translation helpful? Give feedback.
-
Recently, I tried to use amp with Windows to do some matrix calculations, but it turned out to be harder than I thought. |
Beta Was this translation helpful? Give feedback.
-
A coworker and I have been playing around with the cuBLAS version of llama.cpp on a Jetson Xavier NX development kit. We're getting ~600ms/token on a Xavier NX, but we aren't seeing a significant performance improvement vs. compiling for cuBLAS and without. Thank you for your excellent implementation, @slaren -- I have learned a lot from you! One thought that stuck out to us was this comment from @Dampfinchen.
NVIDIA Tegra devices have shared DRAM between the CPU and GPU (so as long as we allocate the memory in the correct mode). So I think this means that on these devices, we don't need to spend time copying memory from the CPU to the GPU, and all of those calls to Is that worthwhile? I'm in new territory (and not entirely sure of the best way to tell if I'm in a shared-memory context or not), so any feedback will be helpful. Either way, I'll probably start hacking on this tomorrow and see if I can't get something working... |
Beta Was this translation helpful? Give feedback.
-
Would like to note that with #801 merged in main an NVME RAID0 set should be enough to feed a PCIe bus for a future GPU addon: offloading the GPU's FB would make llama.cpp run ahead of so many others in the field. Yes, the GPU would not be at 100% most of the time especially on older systems but such is the case with HPC scaling. |
Beta Was this translation helpful? Give feedback.
-
Has anyone seen https://github.com/Noeda/rllama
Is that possible with GGML CPP as does hybrid-GPU-CPU inference increase speed or is it butchered by the memory transfer? |
Beta Was this translation helpful? Give feedback.
-
This project called MLC-LLM just released (https://github.com/mlc-ai/mlc-llm), and is able to target compilation of LLM models to GPU shaders for CUDA, Vulkan and Metal. |
Beta Was this translation helpful? Give feedback.
-
Hello, That allow us to split algorithm & implementation: And battle-tested: Adobe: Photoshop, Google: Pixel Phones, Youtube, Samsung, ... Pure C++, and can target virtually any LLVM-backend and supports: Write the front-end once. and specialize scheduler per platform or/and use auto-scheduler to warm up an optimization. my 2 cents |
Beta Was this translation helpful? Give feedback.
-
howdy yall! i made most of a GUI for the LLAMA with ability to preload onto GPU! .. seems like the input / output parameters need some tweaking to really function fully but if anyone can put the finnishing touches (maybe multi threading to let it run fast, (edit the folder to be set by user, etc.) i think it can be sweet! honestly whats keeping me from using Llama is not having a gui.. i dont like talking in the cmd. so if someone can add finishing touches to this Id be stoked!~ |
Beta Was this translation helpful? Give feedback.
-
Hello, I am little confused by all this. Pardon my ignorance but
can someone explain to me? I apologize these questions may seem very basic to experts but I wasn't able to find answers. |
Beta Was this translation helpful? Give feedback.
-
I've noticed that clblast version of ggml performs about the same as the cpu version with q4_0 formatted files (haven't test other file formats), and debugging it, I had noticed that it doesn't use the gpu a lot, and trying to force it to use it more, by modifying code, seemed to make performance worse. so I've implemented my own opencl kernel that focuses only on mul-mat... and I managed to get a much larger performance increase, on my fastest system an rtx 4070ti has an 80% increase in performance over it's 8 core ryzen 9 5900hx cpu. on older systems gtx 2070 is about 11% increase over it's i7-9750h, and a gtx 1070 has an 22% increase over it's i7-6700T. and that does not include optimizations to use local memory in the opencl kernel, which will likely increase it further. to get that performance increase though, I had to reformat the data. I originally tried it using the ggml structures block_q8_0 and block_q4_0, but breaking the structure out into two separate arrays, one for qs and another for d, improved performance, and in the kernel using those arrays as 8-component vectors made a large improvement. this is what the vector dot product looks like in opencl: float vec32_dot_q4_q8_raw8(
__global uchar4 *xqs,
__global half *xd,
__global char8 *yqs,
__global half *yd,
const unsigned int nb)
{
float8 fsum = 0;
for (unsigned int j = 0; j < nb; j++) {
float8 sum = 0;
for (int i = 0; i < 4; i++) {
uchar4 q4 = *xqs;
float8 _xqs;
_xqs.even = convert_float4(q4 & (uchar4)(0xf));
_xqs.odd = convert_float4(q4 >> (uchar4)(4));
_xqs -= (float8)(8);
_xqs *= convert_float8(*yqs);
sum += _xqs;
xqs++; yqs++;
}
fsum += sum * vload_half(0,xd) * vload_half(0,yd);
xd++; yd++;
}
fsum.s0123 += fsum.s4567;
fsum.s01 += fsum.s23;
return fsum.s0 + fsum.s1;
} I could share the code if anyone is interested, but it is really hacky atm, and I don't know if the performance I am experiencing with clblast is normal or if I did something wrong, and I am also thinking of trying to optimize it further using opencl local memory. |
Beta Was this translation helpful? Give feedback.
-
@JohannesGaessler I did notice that with the latest version of llama.cpp at the time of this writing, the VRAM usage while offloading layers to the GPU using CUDA increased. I had 15 layers (13B model, RTX 2060, ggml 5_1) before with a VRAM usage of around 3400 MB. Now at the same amount of layers it needs 3900 MB. This slows down generation quite a lot as I have to use less layers now. |
Beta Was this translation helpful? Give feedback.
-
With HIP Windows SDK now being public 🎉, would this mean it is now possible to add AMD Windows HIP support to llama? See https://www.amd.com/en/developer/rocm-hub/hip-sdk.html https://rocm.docs.amd.com/en/latest/release/windows_support.html |
Beta Was this translation helpful? Give feedback.
-
Has anyone considered transpiling the ggml format to Futhark? |
Beta Was this translation helpful? Give feedback.
-
Might be a source of inspiration / I’m sure they accept pull requests. I
was under the impression ml weights were basically DSP graphs.
…On Thu, 10 Aug 2023 at 23:34, Ian Scrivener ***@***.***> wrote:
According to the *Futhark* bio it is an "*ongoing research project*" from
the UNinversity of Copenhagen - "a small programming language designed to
be compiled to efficient parallel code"... "use the compute power of the
GPU to accelerate data-parallel array computations".. "*not intended to
replace existing general-purpose languages*"
—
Reply to this email directly, view it on GitHub
<#915 (reply in thread)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABSNQQBX4FRASL7OTH3LR4TXUVHWXANCNFSM6AAAAAAW3XHQYE>
.
You are receiving this because you commented.Message ID: <ggerganov/llama.
***@***.***>
|
Beta Was this translation helpful? Give feedback.
-
Likely GGML Vulkan based routines would of been best as it would be the only single source multiplatform compatible solution as just look at the plethora of suggestions that most have specific platform support problems. |
Beta Was this translation helpful? Give feedback.
-
I'm extremely in favor of a WebGPU backend. Being able to run ggml graphs in WebGPU would mean being able to run them with GPU performance in the browser. Combined with Chrome's ttsEngine API, that means being able to install an extension that runs state-of-the-art Text to Speech models locally as the backend for your screen-reader. Compared to the current crappy robot-voices you get from your OS and from popular screen reader extensions, I can't overstate how huge an improvement that would be. |
Beta Was this translation helpful? Give feedback.
-
|
Beta Was this translation helpful? Give feedback.
-
Intro
This issue is more suitable for the https://github.com/ggerganov/ggml repo, but adding it here for more visibility.
First, I don't see adding a GPU framework that is tightly integrated with
ggml
anytime soon because it usually comes with a lot of maintenance drawbacks, architecture changes and issues. However, there is an alternative approach that might be relatively easy to implement and I think would be a very cool way for new developers to join in and help.Description
ggml
produces computation graphs which are basically directed acyclic graphs (DAGs) that can be easily exported, iterated, etc. A graph contains the information about all necessary tensor operations and buffers needed to evaluate the model. The idea is to first add basicggml
functionality for exporting the graphs in some trivial text format that can be parsed as a second step by a separateggml
tool. Having the exported graphs, one can process them and construct hardware-specific code for evaluating them. This way, we keep implementing existing and new transformer models as we currently do - with a focus for CPU execution, but we gain the benefit of being able to export the computation graphs and translate them for GPU execution.For example, a
ggml-cuda
tool can parse the exported graph and construct the necessary CUDA kernels and GPU buffers to evaluate it on a NVIDIA GPU. Another tool, for exampleggml-mps
, can do similar stuff but for Metal Performance Shaders. Or maybe even aggml-webgpu
tool.This approach preserves the cross-platform nature of
ggml
and allows custom hardware support, via compiler-like translation of the exported computation graphs.Still, the most difficult part of implementing the respective kernels for the targeted backend remains the biggest obstacle.
However, I think this decoupled approach of the implementation would make the development process much easier and can potentially allow for some interesting optimizations. My biggest fear of adding a tightly integrated GPU backend to
ggml
is that I don't know the important details for supporting the respective backend, which could lead to bad software design decisions that in turn can have negative side-effects even on the core CPU implementation.With the proposed approach in this issue, we eliminate this risk and allow multiple independent implementations to be provided without any negative side effects on the core
ggml
implementation.Another cool thing about this idea is that there could be separate leading developers for each backend.
So if you have a good knowledge and understanding about a certain hardware architecture, you are one step away from initiating the kernel "translation" process and making a very significant contribution to the project.
Guiding principles
I don't know all the specifics of a good GPU implementation, but I believe one could try to adopt the fundamental principles of
ggml
.For example, there could be a single memory buffer allocated and all the tensors can be distributed within that memory buffer at certain offsets. Each graph operation will correspond to a kernel with source tensors as input and a destination tensor for output which will be all part of that single memory buffer allocated at the start of the execution.
Additionally, I think we don't need to explicitly add 3rd party dependencies (e.g. CUDA SDK, OpenCL, etc.) to
ggml
to achieve that. The newggml
translation tools will simply read a computation graph and generate code for a certain GPU backend, which will be up to the user to compile and run.The existing CPU code for each tensor operation is your reference implementation. Ideally, you would always want to implement the same computation in the corresponding new kernel and after that, you can try to optimize it for the specifics of the hardware. This is especially true for the 4-bit kernels.
All computations and buffers remain on the GPU. Avoid back-and-forth copies of data to the CPU RAM at all cost.
Taking shortcuts and making custom hacks in favor of better performance is very welcome. "General-purpose" is "bad". For example, we can have a tool like
ggml-cuda-llama
which is a very customggml
translator to CUDA backend which works only with LLaMA graphs and nothing else, but does some very LLaMA-specific optimizations. This is fine.Keep things minimalistic and don't over-engineer. For example, a CUDA translation tool will output a single C++ (or some other language) file with all the kernels and backend initialization code embedded in it. A simple C-style function for evaluation can be exported so that we can call this from other code bases. The actual translation tool should also be implemented as a single source file in a preferred language. (this guiding principle has to be defined a bit better, but we will figure it out as we go)
The GPU "translators" will likely remain second-class citizens from
ggml
point of view and they will need to adapt to the core CPU implementation - not the other way around.Why?
Currently,
ggml
is one of the few ML frameworks that provides efficient 4-bit quantization and demonstrates effective application for quantized transformer evaluation. The code is compact, easily comprehensible with very little bloat. I thinkggml
has a slight leading edge in this regard compared to other general purpose frameworks and if we utilize it now, it has the potential of becoming a very respectable machine learning framework in the future with a focus for on-device inference.Note that there is a very large dose of "reinventing the wheel" in the outlined strategy. Therefore, if you want to get involved, it's very important to have the right mindset. Definitely do not approach this with: "this has already been done in another project" , "we should do all those things that project X does" or "this is not going to scale well for all those reasons", etc.
I think the right mindset to approach this is: "let's try to hack something fast, small and cool and see where it goes"
Links
Thoughts about Inference at the edge
Starting point for exporting
ggml
graphs: .dot file of ggml_graph can not be generated to .png file #589 (comment)Sample computation graph for single-layer LLaMA 7B:
Update 28 May 2023:
This is the pattern that we should follow and try to apply to LLM inference
Beta Was this translation helpful? Give feedback.
All reactions