Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

metal : bug with ggml_cont #3012

Closed
wants to merge 5 commits into from
Closed

metal : bug with ggml_cont #3012

wants to merge 5 commits into from

Conversation

ggerganov
Copy link
Member

@ggerganov ggerganov commented Sep 4, 2023

This is supposed to work, but it breaks the results with Metal enabled:

diff --git a/llama.cpp b/llama.cpp
index c97c146..097de72 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -2418,11 +2418,11 @@ static struct ggml_cgraph * llm_build_llama(
 
             // split cached V into n_head heads
             struct ggml_tensor * V =
-                ggml_view_3d(ctx0, kv_self.v,
+                ggml_cont(ctx0, ggml_view_3d(ctx0, kv_self.v,
                         n_past + N, n_embd_head, n_head_kv,
                         ggml_element_size(kv_self.v)*n_ctx,
                         ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
-                        ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il);
+                        ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il));
             offload_func_v(V);
             ggml_set_name(V, "V");

Need to understand why and fix it

@ggerganov ggerganov added bug Something isn't working high priority Very important issue labels Sep 4, 2023
@ggerganov
Copy link
Member Author

ggerganov commented Sep 4, 2023

@lshzh-ww

Disabling the concurrency optimization makes it work, but I can't find what is the problem.
Any ideas?

I suspect it might be related to the recent change for tensor views, but not sure: 06abf8e
cc @slaren for insights

@ggerganov
Copy link
Member Author

This fixes it:

--- a/llama.cpp
+++ b/llama.cpp
@@ -2423,6 +2423,7 @@ static struct ggml_cgraph * llm_build_llama(
                         ggml_element_size(kv_self.v)*n_ctx,
                         ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
                         ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il));
+            V->src[1] = KQ_soft_max;
             offload_func_v(V);
             ggml_set_name(V, "V");

It seems the concurrency detection does not take into account the indirect dependency of V on v.
It thinks that kv_self.v is a leaf and basically puts all GGML_OP_CONT at the start of the graph execution.

What would be a good solution for this problem?
Should we finally obsolete ggml_build_forward() and add ggml_depend()?

Comment on lines 543 to +544
for (int j = n_start; j < i; j++) {
if (nodes_unused[j] && gf->nodes[j]->op != GGML_OP_RESHAPE \
&& gf->nodes[j]->op != GGML_OP_VIEW \
&& gf->nodes[j]->op != GGML_OP_TRANSPOSE \
&& gf->nodes[j]->op != GGML_OP_PERMUTE) {
if (nodes_unused[j] && gf->nodes[j]->view_src == NULL) {
Copy link
Member Author

@ggerganov ggerganov Sep 4, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need confirmation if this is correct.

Btw, I see that check_mem is always set to false.
Are there cases where we would still need it?
If it's not needed, should we remove it?

@slaren
Copy link
Member

slaren commented Sep 4, 2023

Makes sense, I had a similar problem when implementing parallel execution with CUDA, and at the time my solution was this:
https://github.com/ggerganov/llama.cpp/blob/d273bfd2c98148fab309f978c44ec2e8c24d1c4d/llama.cpp#L1294-L1305
And then using k and v instead of kv_self.

I don't really like the solution of ggml_depend, I think that eventually we should find a way to do this automatically, but for now it should do.

@ggerganov ggerganov changed the title meta : bug with ggml_cont metal : bug with ggml_cont Sep 4, 2023
@ggerganov ggerganov removed the high priority Very important issue label Sep 4, 2023
@@ -5213,6 +5213,8 @@ struct ggml_tensor * ggml_view_tensor(
result->nb[i] = src->nb[i];
}

result->op = GGML_OP_VIEW;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added your proposal since it better expresses the KV cache dependency, until we figure out a better solution.

For this to work though, we need to set the ggml_view_tensor op to GGML_OP_VIEW as you have recently noted. I guess this change is fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case, we should also set src[0] to src in ggml_view_tensor, and add the additional dependency as src[1] instead in llm_build_graph. Otherwise, it would break old code that assumes that the source is in src[0], such as this in ggml-cuda:
https://github.com/ggerganov/llama.cpp/blob/bd33e5ab92e7f214205792fc1cd9ca28e810f897/ggml-cuda.cu#L6539-L6544
ggml_graph_import may also be an issue, and I am not sure that it will import the second dependency in src[1] correctly, since it just creates a new tensor calling ggml_view_4d. Newer code should use view_src instead.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah good points. I guess I'll come back to this later and make a better solution.
Made an issue so we don't forget: ggml-org/ggml#502

@YixinSong-e
Copy link

Makes sense, I had a similar problem when implementing parallel execution with CUDA, and at the time my solution was this:

https://github.com/ggerganov/llama.cpp/blob/d273bfd2c98148fab309f978c44ec2e8c24d1c4d/llama.cpp#L1294-L1305

And then using k and v instead of kv_self.
I don't really like the solution of ggml_depend, I think that eventually we should find a way to do this automatically, but for now it should do.

Hello ! Could you share your code for implementing parallel execution with CUDA?

@cebtenzzre
Copy link
Collaborator

Hello ! Could you share your code for implementing parallel execution with CUDA?

See PR #2239, which the linked code snippet is from.

@ggerganov ggerganov closed this Apr 29, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants