-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
metal : bug with ggml_cont #3012
Conversation
4a6ad90
to
ad80e5a
Compare
This fixes it: --- a/llama.cpp
+++ b/llama.cpp
@@ -2423,6 +2423,7 @@ static struct ggml_cgraph * llm_build_llama(
ggml_element_size(kv_self.v)*n_ctx,
ggml_element_size(kv_self.v)*n_ctx*n_embd_head,
ggml_element_size(kv_self.v)*n_ctx*n_embd_gqa*il));
+ V->src[1] = KQ_soft_max;
offload_func_v(V);
ggml_set_name(V, "V"); It seems the concurrency detection does not take into account the indirect dependency of What would be a good solution for this problem? |
for (int j = n_start; j < i; j++) { | ||
if (nodes_unused[j] && gf->nodes[j]->op != GGML_OP_RESHAPE \ | ||
&& gf->nodes[j]->op != GGML_OP_VIEW \ | ||
&& gf->nodes[j]->op != GGML_OP_TRANSPOSE \ | ||
&& gf->nodes[j]->op != GGML_OP_PERMUTE) { | ||
if (nodes_unused[j] && gf->nodes[j]->view_src == NULL) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need confirmation if this is correct.
Btw, I see that check_mem
is always set to false
.
Are there cases where we would still need it?
If it's not needed, should we remove it?
Makes sense, I had a similar problem when implementing parallel execution with CUDA, and at the time my solution was this: I don't really like the solution of |
@@ -5213,6 +5213,8 @@ struct ggml_tensor * ggml_view_tensor( | |||
result->nb[i] = src->nb[i]; | |||
} | |||
|
|||
result->op = GGML_OP_VIEW; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added your proposal since it better expresses the KV cache dependency, until we figure out a better solution.
For this to work though, we need to set the ggml_view_tensor
op to GGML_OP_VIEW
as you have recently noted. I guess this change is fine
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this case, we should also set src[0]
to src
in ggml_view_tensor
, and add the additional dependency as src[1]
instead in llm_build_graph
. Otherwise, it would break old code that assumes that the source is in src[0]
, such as this in ggml-cuda
:
https://github.com/ggerganov/llama.cpp/blob/bd33e5ab92e7f214205792fc1cd9ca28e810f897/ggml-cuda.cu#L6539-L6544
ggml_graph_import
may also be an issue, and I am not sure that it will import the second dependency in src[1]
correctly, since it just creates a new tensor calling ggml_view_4d
. Newer code should use view_src
instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah good points. I guess I'll come back to this later and make a better solution.
Made an issue so we don't forget: ggml-org/ggml#502
Hello ! Could you share your code for implementing parallel execution with CUDA? |
See PR #2239, which the linked code snippet is from. |
This is supposed to work, but it breaks the results with Metal enabled:
Need to understand why and fix it