metal : bug with ggml_cont #3012

ggerganov · 2023-09-04T16:48:26Z

This is supposed to work, but it breaks the results with Metal enabled:

diff --git a/llama.cpp b/llama.cpp
index c97c146..097de72 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -2418,11 +2418,11 @@ static struct ggml_cgraph * llm_build_llama(
 
             // split cached V into n_head heads
             struct ggml_tensor * V =
-                ggml_view_3d(ctx0, kv_self.v,
+                ggml_cont(ctx0, ggml_view_3d(ctx0, kv_self.v,
                         n_past + N, n_embd_head, n_head_kv,
                         ggml_element_size(kv_self.v)*n_ctx,
                         ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
-                        ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il);
+                        ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il));
             offload_func_v(V);
             ggml_set_name(V, "V");

Need to understand why and fix it

ggerganov · 2023-09-04T17:22:37Z

@lshzh-ww

Disabling the concurrency optimization makes it work, but I can't find what is the problem.
Any ideas?

I suspect it might be related to the recent change for tensor views, but not sure: 06abf8e
cc @slaren for insights

ggerganov · 2023-09-04T17:45:35Z

This fixes it:

--- a/llama.cpp
+++ b/llama.cpp
@@ -2423,6 +2423,7 @@ static struct ggml_cgraph * llm_build_llama(
                         ggml_element_size(kv_self.v)*n_ctx,
                         ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
                         ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il));
+            V->src[1] = KQ_soft_max;
             offload_func_v(V);
             ggml_set_name(V, "V");

It seems the concurrency detection does not take into account the indirect dependency of V on v.
It thinks that kv_self.v is a leaf and basically puts all GGML_OP_CONT at the start of the graph execution.

What would be a good solution for this problem?
Should we finally obsolete ggml_build_forward() and add ggml_depend()?

ggerganov · 2023-09-04T17:50:36Z

ggml-metal.m

                    for (int j = n_start; j < i; j++) {
-                        if (nodes_unused[j] && gf->nodes[j]->op != GGML_OP_RESHAPE \
-                                            && gf->nodes[j]->op != GGML_OP_VIEW \
-                                            && gf->nodes[j]->op != GGML_OP_TRANSPOSE \
-                                            && gf->nodes[j]->op != GGML_OP_PERMUTE) {
+                        if (nodes_unused[j] && gf->nodes[j]->view_src == NULL) {


Need confirmation if this is correct.

Btw, I see that check_mem is always set to false.
Are there cases where we would still need it?
If it's not needed, should we remove it?

slaren · 2023-09-04T18:01:20Z

Makes sense, I had a similar problem when implementing parallel execution with CUDA, and at the time my solution was this:
https://github.com/ggerganov/llama.cpp/blob/d273bfd2c98148fab309f978c44ec2e8c24d1c4d/llama.cpp#L1294-L1305
And then using k and v instead of kv_self.

I don't really like the solution of ggml_depend, I think that eventually we should find a way to do this automatically, but for now it should do.

ggerganov · 2023-09-04T18:47:03Z

ggml.c

@@ -5213,6 +5213,8 @@ struct ggml_tensor * ggml_view_tensor(
        result->nb[i] = src->nb[i];
    }

+    result->op = GGML_OP_VIEW;


I added your proposal since it better expresses the KV cache dependency, until we figure out a better solution.

For this to work though, we need to set the ggml_view_tensor op to GGML_OP_VIEW as you have recently noted. I guess this change is fine

In this case, we should also set src[0] to src in ggml_view_tensor, and add the additional dependency as src[1] instead in llm_build_graph. Otherwise, it would break old code that assumes that the source is in src[0], such as this in ggml-cuda:
https://github.com/ggerganov/llama.cpp/blob/bd33e5ab92e7f214205792fc1cd9ca28e810f897/ggml-cuda.cu#L6539-L6544
ggml_graph_import may also be an issue, and I am not sure that it will import the second dependency in src[1] correctly, since it just creates a new tensor calling ggml_view_4d. Newer code should use view_src instead.

Ah good points. I guess I'll come back to this later and make a better solution.
Made an issue so we don't forget: ggml-org/ggml#502

YixinSong-e · 2023-09-10T04:57:38Z

Makes sense, I had a similar problem when implementing parallel execution with CUDA, and at the time my solution was this:

https://github.com/ggerganov/llama.cpp/blob/d273bfd2c98148fab309f978c44ec2e8c24d1c4d/llama.cpp#L1294-L1305

And then using k and v instead of kv_self.
I don't really like the solution of ggml_depend, I think that eventually we should find a way to do this automatically, but for now it should do.

Hello ! Could you share your code for implementing parallel execution with CUDA?

cebtenzzre · 2023-09-10T14:54:24Z

Hello ! Could you share your code for implementing parallel execution with CUDA?

See PR #2239, which the linked code snippet is from.

llama : add ggml_cont to trigger bug with Metal

ad80e5a

ggerganov force-pushed the metal-cont-bug branch from 4a6ad90 to ad80e5a Compare September 4, 2023 16:50

ggerganov added bug Something isn't working high priority Very important issue labels Sep 4, 2023

ggerganov added 3 commits September 4, 2023 20:48

ggml : just in case

7704db2

metal : more readable kernel

ebd3467

metal : utilize view_src to see of tensor is a view

60c2ef6

ggerganov commented Sep 4, 2023

View reviewed changes

ggerganov changed the title ~~meta : bug with ggml_cont~~ metal : bug with ggml_cont Sep 4, 2023

ggerganov removed the high priority Very important issue label Sep 4, 2023

llama : better express the KV cache dependencies in the graph

f3a84b2

ggerganov commented Sep 4, 2023

View reviewed changes

ggerganov marked this pull request as ready for review September 4, 2023 18:47

ggerganov requested a review from slaren September 4, 2023 18:47

ggerganov mentioned this pull request Sep 4, 2023

ggml : better way to express implicit node dependencies in a graph ggml-org/ggml#502

Open

ggerganov closed this Apr 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

metal : bug with ggml_cont #3012

metal : bug with ggml_cont #3012

ggerganov commented Sep 4, 2023 •

edited

Loading

ggerganov commented Sep 4, 2023 •

edited

Loading

ggerganov commented Sep 4, 2023

ggerganov Sep 4, 2023 •

edited

Loading

slaren commented Sep 4, 2023 •

edited

Loading

ggerganov Sep 4, 2023

slaren Sep 4, 2023

ggerganov Sep 4, 2023

YixinSong-e commented Sep 10, 2023

cebtenzzre commented Sep 10, 2023

metal : bug with ggml_cont #3012

metal : bug with ggml_cont #3012

Conversation

ggerganov commented Sep 4, 2023 • edited Loading

ggerganov commented Sep 4, 2023 • edited Loading

ggerganov commented Sep 4, 2023

ggerganov Sep 4, 2023 • edited Loading

Choose a reason for hiding this comment

slaren commented Sep 4, 2023 • edited Loading

ggerganov Sep 4, 2023

Choose a reason for hiding this comment

slaren Sep 4, 2023

Choose a reason for hiding this comment

ggerganov Sep 4, 2023

Choose a reason for hiding this comment

YixinSong-e commented Sep 10, 2023

cebtenzzre commented Sep 10, 2023

ggerganov commented Sep 4, 2023 •

edited

Loading

ggerganov commented Sep 4, 2023 •

edited

Loading

ggerganov Sep 4, 2023 •

edited

Loading

slaren commented Sep 4, 2023 •

edited

Loading