-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml backends interface, ggml-cuda refactor #2230
Conversation
refactor ggml-cuda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll try to review this PR regularly but feel free to @ me if you want me to look at something in particular.
LDFLAGS += -lcublas -lculibos -lcudart -lcublasLt -lpthread -ldl -lrt -L/usr/local/cuda/lib64 -L/opt/cuda/lib64 -L$(CUDA_PATH)/targets/x86_64-linux/lib | ||
OBJS += ggml-cuda.o | ||
NVCC = nvcc | ||
NVCCFLAGS = --forward-unknown-to-host-compiler | ||
ifdef LLAMA_DEBUG | ||
NVCCFLAGS += -lineinfo |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on what the intended use of -lineinfo
is this could be added unconditionally (I don't think it makes a difference for performance). When I use Nsight Compute I typically use it without LLAMA_DEBUG
since that also affects compiler optimizations.
} | ||
|
||
tensor->data = (char*)cpu_buffer->data + cpu_buffer->offset; | ||
cpu_buffer->offset = aligned_offset(cpu_buffer->data, cpu_buffer->offset + ggml_nbytes(tensor), TENSOR_ALIGNMENT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know the GCC equivalent but CUDA has a compiler hint that lets you specify memory alignment. Perhaps something like this could be useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That looks interesting, but I imagine that you would have to use it in the kernels for the compiler to actually notice anything. It may be worth checking if adding that to all the data pointers in the kernels improves performance. The tensor allocator in the ggml-cuda backend always aligns pointers 128 bytes, so it is safe to assume that tensor data pointers are aligned to at least that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
About using this on the CPU code, there are AVX instructions both for aligned and unaligned load and stores. We always use the unaligned instructions because in practice in current CPUs it doesn't seem to make much of a difference in performance, and depending on the types and row sizes the data may not always be aligned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You could maybe add __forceinline__
to the helper functions but I would assume the compiler does it anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not convinced that it makes a difference, I trust that the compiler is smart enough to inline device functions. So unless there is a measurable difference, my preference is to leave these decisions to the compiler.
// reduce warps | ||
T warp_reduction = warp_reduce_all<op_t>(val); | ||
|
||
__syncthreads(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think synchronization is needed here. each warp writes to a different location so there should be no data race.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a weird data race that can happen here, I think it is related to the read of lane_result
later. I am not sure if any of the kernels here are affected by this, but it was an issue in a different kernel that it is not here yet.
const int num_blocks = (k + CUDA_QUANTIZE_BLOCK_SIZE - 1) / CUDA_QUANTIZE_BLOCK_SIZE; | ||
quantize_q8_1<<<num_blocks, CUDA_QUANTIZE_BLOCK_SIZE, 0, stream>>>(x, vy, ndata, k); | ||
quantize_q8_1<<<num_blocks, CUDA_QUANTIZE_BLOCK_SIZE, 0, stream>>>(x, vy, k); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remember to re-add the 3b fix added in #2144 .
ggml_type_name(t0), ggml_type_name(t1), ggml_type_name(t2)); | ||
} | ||
|
||
GGML_ASSERT(dispatch.d[t0][t1][t2] && "Unsupported type combination"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this check do compared to the previous one?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The previous check prints the types so that I can know what is the problem and will be eventually removed. The GGML_ASSERT
is to crash the program when that happens and will remain. I will probably change this part significantly to account for different number of arguments to ops.
GGML_ASSERT(dst->backend != GGML_BACKEND_GPU_SPLIT); | ||
GGML_ASSERT(!use_src1 || src1->backend != GGML_BACKEND_GPU_SPLIT); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you intend to actually allow the use of split tensors for src1
and dst
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, all of that code has been removed for now for simplicity, but I will add it again. I think I will do this with a special type of ggml_buffer
that allocates split tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the exact design for the VRAM scratch buffer? is it manually being reset like it's the case for the RAM scratch buffers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will be addressed as proposed here: ggml-org/ggml#288
After that, scratch buffers will not be necessary, since the compute buffers will already be as small as they can be.
Updated the description with more details. |
Do you mean to say that you will implement it? I'm asking because I was thinking of doing this myself and would like to avoid duplicate work. |
@JohannesGaessler if you want to do it yourself, that would be great! I have plenty to do already. |
Please correct me if I'm wrong, but it looks like the Do you have a plan on how to accomodate backends in development and "partial backends", so to say? |
@0cc4m I covered it briefly in the description:
For now, this is not supported, but I would like to support it eventually. |
It may be replaced eventually by a Vulkan backend, or it may stay alongside it. But even the Vulkan backend will stay partial for quite a while. Is your plan to implement this fallback before this PR is ready to merge? |
I will think about it. My plan is to do it later, the scope of this PR is already quite large as it is, but maybe I can support that case without too many changes. |
Removing other backends in the name of a cuda refactor is not acceptable. I support the goal of this PR, but it has to be less disruptive. |
Mmmh the idea really makes a lot of sense and I like it, but I feel like different implementations may have different needs that aren't covered by this and are still going to fall back to |
@niansa I would like to avoid that, which is why I am asking for your feedback. If there is something that isn't covered by this API, please let me know. |
I think as of right now the OpenCL implementation still has all data except for the weights in RAM. In other words, the results get copied to RAM after each operation. Would it be possible to just forward the non GPU accelerated tensors to the CPU backend for the OpenCL backend? |
Fallback to CPU would work exactly like that, the hard part is making it in a way that doesn't litter the user code with backend-specific details, and also keeps the implementation of the backends as simple as possible. |
OpenCL is currently relied on by AMD and Intel users for speed increases using their dedicated GPU. Removing it would block off entire vendors on Windows and should not be considered acceptable. Especially users with slower CPU's would be stuck on older versions of the project. We have a lot of AMD users currently relying on Llamacpp to use their GPU for AI as an easy solution. |
What would prevent us from keeping the status quo for OpenCL? From what I can tell the entry points for OpenCL kernels is still in the corresponding kernels in |
It could work if we keep two versions of the model loading code in llama.cpp, one for the new interface and another for OpenCL. It's something I would prefer to avoid, ultimately the goal is to simplify the code and remove as many of the special cases as possible, but as a temporary solution it may be better than removing the OpenCL backend. |
It would be interesting to see performance on native Linux, and also native Windows. For me under WSL with a RTX 3080, this is significantly faster than master. I think this mainly is because it requires less synchronization, so kernel launches can be queued and this helps hiding the latency of launching a kernel. I expect that the effect will be minimal, if any, under native Linux, though. 7B q4_0:
|
Maintaining two versions of the model loading code in llama.cpp may not be ideal, but it would prevent disruptions and provide a smoother transition instead of leaving openCL users stuck with older versions of the project |
With all layers offloaded there should only be 2 calls to |
The concurrent kernel execution is not enabled currently: Lines 1 to 2 in 83595ec
This is because currently it creates way too many streams and events, and it may actually harm performance. But it will help once that is fixed. |
Regardless of the reasons, simple testing seems to confirm that the refactor provides a speedup:
Edit: my numbers are for native Linux 6.3.5-2. |
I forgot to mention: I can't compile the PR unless I modify it. The problem seems to be line 17 in Compilation log
|
Weird, what version of the CUDA Toolkit are you using? It works for me with 12.2:
|
I'm on version 12.1:
|
Interestingly it looks like it was only made Anyway should be fixed now. |
I know the ROCm port is not merged, but I also see it would completely break CUDA on AMD GPUs through ROCm because the hipBLAS API doesn't support CUBLAS_COMPUTE_16F or CUBLAS_COMPUTE_32F_FAST_TF32 |
Just define them to |
I will look into migrating the Metal and MPI backends to fit into the proposed backend interface soon and will open a PR to this one. @slaren Might be a good idea to move the branch in this repo, so that anyone who wants to propose a change can PR it here - otherwise we'll have to keep an eye for PRs into your fork I started implementing a custom cloud CI yesterday, and I hope I am able to finish it today. After that will start looking into this work in more details. Edit: on second thought, maybe keep the PR as it is, because we would lose the discussion if you recreate the PR. Whatever you decide is better |
I think it is more important to keep all the PRs here, so I'll open a new PR to move the branch to this repo. I'll add a link to this PR so that the current discussion isn't lost. |
Continued in #2239 |
Continued in #2239
This PR adds a common interface to the compute backends.
Breaking changes
ggml_context
allocates memory from aggml_buffer
that contains a buffer in device memory for the tensor data, and a buffer in system memory for the tensor structs (and in the future also other data such as the graphs)data
member ofggml_tensor
is a backend-specific pointer that should not be accessed directly. To access the data,ggml_backend_set_tensor
andggml_backend_get_tensor
must be used instead. Functions such asggml_new_f32
andggml_set_f32
can also be used as before.data
member directly, but you shouldn't do that if you want to support other backendsparams
buffer toggml_tensor
for the op parameters that currently are stored in a tensor. For example, forggml_rope
this buffer is used to store the valuesn_past, n_dims, mode, n_ctx
. The goal is to make these parameters easily accessible from the CPU, and reduce the overhead of creating a new tensor for them.Brief example:
Backend implementation
Backends should implement the functions defined in the
ggml_backend_interface
struct. Currently there are implementations for the CPU and CUDA backends.Computation using multiple backends
It is still possible to offload some parts of the graph to the GPU while keeping others on the CPU. This is done using
ggml_graph_splits
. See the llama.cpp code for an example, will update this later with more details.Notes/limitations