-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ggml backends interface, ggml-cuda refactor #2239
Conversation
This sentence concerns me a bit: "For prompt processing, it is usually worth to upload the weights every time even when using partial offloading, but that's currently not supported. So prompt processing will use the same CPU/GPU layer split as generation. Eventually I would like to support this again." Does that mean when the PR is merged, people using partial offloading (like me running only 10 layers on the GPU with a 13B model) will be experiencing significant longer prompt evaluation times? If that's the case, perhaps it would be better to wait until the performance is atleast on the same level before merging. IMO LLama.cpp's most important feature that sets it apart from GPTQ and BNB is the possibility to run models too large for the hardware at great performance, it would be sad to lose that. Then of course, there is the concern from AMD and Intel GPU owners. OpenCl has been an important backend for them to get decent speed on their hardware. So they would not benefit anymore from CPU performance improvements and newer quantization format in newer releases if this PR gets merged. Which would be a shame. Please keep in mind the Vulkan backend might take a long time to develop as it is a very complex task. If you are eager to supersede OpenCL in such a short time, perhaps you could lend a hand! Aside from these concerns, I am very impressed by the awesome work you and others are doing and I fully support this PR! I'm just pointing out the effects it may have on people enjoying the software. |
I understand your concerns. I will add OpenCL support again in the same way that it works currently. Once we get closer to merging this, I will run some tests to see exactly what is the performance impact on the prompt processing speed and re-evaluate from there. Don't get me wrong, I agree that these features are important, and I want to continue to support them. The issue is that PRs like this that affect a lot of code can have a significant maintenance overhead, because other changes merged in the meanwhile will often cause conflicts that must be resolved manually. Ultimately, it is a matter of whether we want to prioritize development speed or maintaining a stable master branch. |
Weighing in on the issue of breaking things versus shipping this feature faster. I weigh heavily towards shipping faster and fixing things after. Developers should in general lock in on a commit (GPT4All does this) before this PR is merged so that they have a stable version until performance is stabilized afterwards and bugs are squashed. The problem with trying to cater to solving these problems like slower prompt processing or CPU offloading is that it hinders long-term progress for a short-term benefit. On the plus side, this PR will enable a future where more models can more easily be supported due to the backend setup. The faster it’s shipped, the faster we get performance optimization and generally better tech to support our LLMs. |
renamed ggml_backend functions changed ggml_buffer and ggml_backend to always be used as pointers rename ggml_tensor::params -> op_params
@JohannesGaessler how confident are you in the VRAM scratch buffer size estimation for context sizes > 2048? I have been working on ggml-org/ggml#288 and I get these sizes for the compute buffer (n_batch = 512):
For smaller context sizes, the calculated memory is smaller (which is expected as this should be more efficient than scratch buffers), but for larger contexts it is bigger. It is likely that there are bugs in my implementation, though, but perplexity seems fine. |
I believe that @ikawrakow has also observed an issue with the currently estimated VRAM sizes for larger contexts (#2295 I haven't looked at details yet, so I could be wrong). But I guess your sizes might be actually the correct ones. |
See #2056 . I tested the VRAM scratch sizes with a precision of 1 MB, meaning I determined the lowest scratch size that still produces correct results. I tested up to 8192 context if possible. I observed that the amount of VRAM required increases linearly with context. I chose the VRAM scratch sizes on master to be the minimum required scratch size +25% at 2048 context. It's possible that the implementation on master is suboptimal and thus needs more VRAM for that reason. |
Thank you. It is not clear how you estimated the memory, it may be possible that we are using different parameters. What I do is construct a graph with these parameters and calculate the size of the compute buffer required to execute it: int n_tokens = std::min((int)hparams.n_ctx, params.n_batch);
int n_past = hparams.n_ctx - n_tokens; This should be equivalent to computing a prompt of |
What I did is hard-code a size for the VRAM scratch buffer, compile the code, and then test whether the results are correct. The values in the tables of in the linked PR are the minimum hard-coded values that still produced correct results. |
I understand that, what I am wondering about is how you ran the tests exactly. If what you did is different than this, then different results are expected:
|
If I remember correctly I was calculating perplexity on the first 100 lines of Wikitext as a test. |
Ok, that should be equivalent when
So it should be a reduction in memory usage over master in all cases. |
ggml-backend.c
Outdated
} | ||
|
||
void ggml_graph_allocate_tensors_n(struct ggml_cgraph ** graphs, int n_graphs) { | ||
} | ||
static bool ggml_is_view(struct ggml_tensor * t) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this name intentional? It seems confusing considering the collision with GGML_OP_VIEW
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I use view generally to mean any operation that shares the memory of its parent tensor. In think the name makes sense, but I am open to suggestions.
break; | ||
} | ||
// TODO: make a list of operations that can be safely made inplace | ||
if (parent->data != NULL && parent->n_children == 1 && parent->n_views == 0 && ggml_are_same_layout(node, parent) && node->op != GGML_OP_MUL_MAT) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not used in LLaMA but GGML_CONV_2D
should also not be done inplace.
@ggerganov or anyone familiar with the training code, is there any way to tell what tensors must be kept for the backwards pass? maybe checking if llama.cpp/examples/train-text-from-scratch/train-text-from-scratch.cpp Lines 1604 to 1634 in 24baa54
|
If you keep all tensors with For example, let's take a look at the backward pass for Lines 15187 to 15195 in 24baa54
# x: src0
# y: src1
# z: tensor
# x': src0->grad
# y': src1->grad
# z': tensor->grad
# forward
z = x + y
# backward
x' += z'
y' += z' I.e. to compute In contrast, Lines 15247 to 15263 in 24baa54
# forward
z = x * y
# backward
x' += y*z'
y' += x*z' So here we need I think @xaedes has manually determined which tensors need to be kept. Not sure if there is better solution except for building the backwards graph and checking which tensors from the forward pass are used. |
If it is determined manually, I could add a parameter to the allocator to specify the list of tensors that must not be freed. That should achieve the same goal of minimizing memory usage, while still keeping the data of the important tensors. It could possibly be a list in
This would be interesting for a future refactor of the training code, but I don't understand it well enough to do this at this point. |
Ok, we can do that for now - it should be better than the current approach, thought it might be possible to improve it even further in the future to do everything automatically. Remind me, at what point does the allocator run - during |
Currently, it is called manually, but the idea is it should be done automatically after building the graph. Essentially, when you create a context, you can specify the tensor allocation mode. For the graph allocator ( Lines 1192 to 1197 in d273bfd
|
Closing this as it is too outdated to continue from this point. |
Previous discussion: #2230
This PR adds a common interface to the compute backends.
Breaking changes
ggml_context
allocates memory from aggml_buffer
that contains a buffer in device memory for the tensor data, and a buffer in system memory for the tensor structs (and in the future also other data such as the graphs)data
member ofggml_tensor
is a backend-specific pointer that should not be accessed directly. To access the data,ggml_backend_set_tensor
andggml_backend_get_tensor
must be used instead. Functions such asggml_new_f32
andggml_set_f32
can also be used as before.data
member directly, but you shouldn't do that if you want to support other backendsparams
buffer toggml_tensor
for the op parameters that currently are stored in a tensor. For example, forggml_rope
this buffer is used to store the valuesn_past, n_dims, mode, n_ctx
. The goal is to make these parameters easily accessible from the CPU, and reduce the overhead of creating a new tensor for them.Brief example:
Backend implementation
Backends should implement the functions defined in the
ggml_backend_interface
struct. Currently there are implementations for the CPU and CUDA backends.Computation using multiple backends
It is still possible to offload some parts of the graph to the GPU while keeping others on the CPU. This is done using
ggml_graph_splits
. See the llama.cpp code for an example, will update this later with more details.Notes/limitations