Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ggml : add GPU support for Mamba models #6758

Open
ggerganov opened this issue Apr 19, 2024 · 32 comments
Open

ggml : add GPU support for Mamba models #6758

ggerganov opened this issue Apr 19, 2024 · 32 comments
Labels
enhancement New feature or request help wanted Extra attention is needed Nvidia GPU Issues specific to Nvidia GPUs

Comments

@ggerganov
Copy link
Owner

ggerganov commented Apr 19, 2024

Recently, initial Mamba support (CPU-only) has been introduced in #5328 by @compilade

In order to support running these models efficiently on the GPU, we seem to be lacking kernel implementations for the following 2 ops:

  • GGML_OP_SSM_CONV
  • GGML_OP_SSM_SCAN

Creating this issue to keep track of this and give more visibility of this feature. Help with implementing the missing kernels for CUDA and Metal (and other backends potentially) is welcome. We can also discuss if anything else is required to better support this architecture in llama.cpp

@ggerganov ggerganov added enhancement New feature or request help wanted Extra attention is needed Nvidia GPU Issues specific to Nvidia GPUs labels Apr 19, 2024
@ggerganov ggerganov moved this to Todo in ggml : roadmap Apr 19, 2024
@jploski
Copy link
Contributor

jploski commented Jun 1, 2024

I tried to add these both operations (in a very naive form, just the boilerplate and copy/paste of the CPU implementation, replacing memcpy by for loops, and using 1 block with WARP_SIZE==32 threads) here:

jploski@677ad0a

While it works in the sense that the implementation generates plausible-looking output, it differs from the CPU version. The initial prompt seems to be processed incorrectly, with the first generated token already looking off, and the generation not influenced by the content of the prompt. Furthermore, changing batch size (the -b parameter) affects the generation (it shouldn't). So maybe something else needs to be improved besides the new ops implementation.

Sample correct output from CPU version:

/mnt/seagate/dalai/llama.cpp.bak/bin/main -m /mnt/f2fs/mamba/tinystories.gguf -t 1 -p 'Sara and Ben are playing in the snow.' --temp 0.0 -c 1024 --keep -1 --repeat_penalty 1.3 -n 1024
Sara and Ben are playing in the snow. They make a big snowman with a hat, a scarf and a carrot nose. They put on their warm clothes and go inside to get some cold chocolate.
"Look at my snowman!" Sara says. "He is so cool! He can talk like us."
Ben looks at his snowman and smiles. He likes the hat, the scarf and the carrot nose too. He wants to make a new friend.
Sara nods and takes her hat from the closet. She puts it on Ben's head and says, "Hello, Mr. Snowman! Do you want to play with me?"
Ben looks at Sara and smiles. He likes his snowman. He says, "Yes, Mr. Snowman! I like your hat, scarf and carrot nose too."
They hug each other and play together in the snow. They make a big snowman, a blue snowman, a red snowman and a yellow snowman. They are happy to have fun with their new friend. [end of text]

llama_print_timings:        load time =     299.08 ms
llama_print_timings:      sample time =     167.65 ms /   206 runs   (    0.81 ms per token,  1228.74 tokens per second)
llama_print_timings: prompt eval time =     123.71 ms /    11 tokens (   11.25 ms per token,    88.92 tokens per second)
llama_print_timings:        eval time =    5023.01 ms /   205 runs   (   24.50 ms per token,    40.81 tokens per second)
llama_print_timings:       total time =    5394.42 ms /   216 tokens

Sample incorrect output from the GPU version:

Sara and Ben are playing in the snow. not too sure to eat it.swed wasmometimes been
"I am so happy when a room is gone," was were, into them witholidated. They had done something for their family.
They were out at the park andsled down together best best of amidt- covari.
The sun was in his eye not long before they were been to one. He wasmnoted to each other and this is where he wasered been been been been been been been been been done.
He had been been beby, so to all who have been been there been been been been been been taken out of them best part onlys in the long- march with us.
The end. [end of text]

llama_print_timings:        load time =    1359.73 ms
llama_print_timings:      sample time =     117.06 ms /   147 runs   (    0.80 ms per token,  1255.71 tokens per second)
llama_print_timings: prompt eval time =      52.04 ms /    11 tokens (    4.73 ms per token,   211.40 tokens per second)
llama_print_timings:        eval time =     911.71 ms /   146 runs   (    6.24 ms per token,   160.14 tokens per second)
llama_print_timings:       total time =    1135.02 ms /   157 tokens

The GPU output looks especially mangled in the example above, but with a sinlge-token prompt consisting of a space it almost looks "normal" for this model, which makes me suspect that maybe it's something unrelated to the new CUDA kernels:

 avorable, a little girl was walking in the park. She saw something strange and she wanted to take a look. She walked closer and saw that it was a big, round, green frog! The frog hopped around and said "Hello!"
The girl was so surprised but also excited. She asked the frog if he could help her. The frog smiled and said "Of course I can help you!" He took out some food from his bag and gave it to the girl.
"Thank you very much," she said with a big smile on her face. The frog then hopped away, but he was still there when she got home.
The girl was so happy that she had helped the frog. She thanked him again and went back to playing in the park. [end of text]

(The same also happens when I reduce the CUDA kernels to use just a single thread rather than 32.)

@jploski
Copy link
Contributor

jploski commented Jun 1, 2024

Another observation: with one offloaded layer (-ngl 1) I get the same output as for -ngl 0 (= CPU). But as the number of offloaded layers is increased, the output changes each time.

@slaren
Copy link
Collaborator

slaren commented Jun 1, 2024

You can add tests for the ops in test-backend-ops to compare the GPU and CPU implementations.

@jploski
Copy link
Contributor

jploski commented Jun 1, 2024

You can add tests for the ops in test-backend-ops to compare the GPU and CPU implementations.

Thanks. I added a test for each of the new ops here: 35f2f86

The CPU vs. GPU comparison ("test-backend-ops test -o SSM_CONV", "test-backend-ops test -o SSM_SCAN") passes. Maybe because of non-representative input. I set "sq", whatever it means, to zero because initializing it to any other value triggered an assertion on the CPU backend. (I should mention that I have little clue about how the Mamba algorithm works in theory or practice; I just attempted a blind port FWIW.)

@compilade
Copy link
Collaborator

@jploski Thank you for doing this!

I set "sq", whatever it means, to zero because initializing it to any other value triggered an assertion on the CPU backend

Zero is correct, sq is the seq_id of the tokens corresponding to each part of the input. It was added for simultaneous sequence processing.

But note that this is being changed in #7531. sq no longer exists there, because equal-length simultaneous sequences allow directly using the fourth tensor dimension to separate the states for each sequence, which simplifies SSM_SCAN and SSM_CONV a bit.

llama.cpp/ggml.c

Lines 16349 to 16418 in 61200ef

static void ggml_compute_forward_ssm_scan_f32(
const struct ggml_compute_params * params,
struct ggml_tensor * dst) {
if (params->type == GGML_TASK_TYPE_INIT || params->type == GGML_TASK_TYPE_FINALIZE) {
return;
}
const struct ggml_tensor * src0 = dst->src[0]; // s
const struct ggml_tensor * src1 = dst->src[1]; // x
const struct ggml_tensor * src2 = dst->src[2]; // dt
const struct ggml_tensor * src3 = dst->src[3]; // A
const struct ggml_tensor * src4 = dst->src[4]; // B
const struct ggml_tensor * src5 = dst->src[5]; // C
const int ith = params->ith;
const int nth = params->nth;
const int64_t nc = src0->ne[0]; // d_state
const int64_t nr = src0->ne[1]; // d_inner
const int64_t n_t = src1->ne[1]; // number of tokens per sequence
const int64_t n_s = src0->ne[2]; // number of sequences in the batch
GGML_ASSERT(ggml_nelements(src1) == ggml_nelements(dst));
GGML_ASSERT(src0->nb[0] == sizeof(float));
GGML_ASSERT(src1->nb[0] == sizeof(float));
GGML_ASSERT(src2->nb[0] == sizeof(float));
GGML_ASSERT(src3->nb[0] == sizeof(float));
GGML_ASSERT(src4->nb[0] == sizeof(float));
GGML_ASSERT(src5->nb[0] == sizeof(float));
// required for the dot product between s and C
GGML_ASSERT(src0->nb[1] == src0->ne[0]*sizeof(float));
// rows per thread
const int dr = (nr + nth - 1)/nth;
// row range for this thread
const int ir0 = dr*ith;
const int ir1 = MIN(ir0 + dr, nr);
const int ir = ir1 - ir0;
for (int i3 = 0; i3 < n_s; ++i3) {
for (int i2 = 0; i2 < n_t; ++i2) {
float * y = (float *) ((char *) dst->data + ir0*( dst->nb[0]) + i2*( dst->nb[1]) + i3*( dst->nb[2])); // {d_inner, n_t, n_s}
float * s = (float *) ((char *) src0->data + ir0*(src0->nb[1]) + i3*(src0->nb[2])); // {d_state, d_inner, n_s}
float * x = (float *) ((char *) src1->data + ir0*(src1->nb[0]) + i2*(src1->nb[1]) + i3*(src1->nb[2])); // {d_inner, n_t, n_s}
float * dt = (float *) ((char *) src2->data + ir0*(src2->nb[0]) + i2*(src2->nb[1]) + i3*(src2->nb[2])); // {d_inner, n_t, n_s}
float * A = (float *) ((char *) src3->data + ir0*(src3->nb[1])); // {d_state, d_inner}
float * B = (float *) ((char *) src4->data + i2*(src4->nb[1]) + i3*(src4->nb[2])); // {d_state, n_t, n_s}
float * C = (float *) ((char *) src5->data + i2*(src5->nb[1]) + i3*(src5->nb[2])); // {d_state, n_t, n_s}
// d_inner
for (int i1 = 0; i1 < ir; ++i1) {
// ref: https://github.com/state-spaces/mamba/blob/34076d664838588a3c97727b263478ab9f621a07/mamba_ssm/ops/triton/selective_state_update.py#L78
float dt_soft_plus = dt[i1] <= 20.0f ? log1pf(expf(dt[i1])) : dt[i1];
float x_dt = x[i1] * dt_soft_plus;
float sumf = 0.0f;
// d_state
for (int i0 = 0; i0 < nc; ++i0) {
int i = i0 + i1*nc;
// state = prev_state * dA + dB * x
float state = (s[i] * expf(dt_soft_plus * A[i])) + (B[i0] * x_dt);
// y = rowwise_dotprod(state, C)
sumf += state * C[i0];
s[i] = state;
}
y[i1] = sumf;
}
}
}
}

The CPU vs. GPU comparison ("test-backend-ops test -o SSM_CONV", "test-backend-ops test -o SSM_SCAN") passes. Maybe because of non-representative input.

Hmm. If both result in the same output with random input (for all tensors apart from sq), maybe the problem isn't with the operators, but how the data gets in and out of them in practice?

I wonder if using ctx_layer instead of ctx_split for the tensors that get passed to ggml_ssm_conv and ggml_ssm_scan would help?

diff --git a/llama.cpp b/llama.cpp
index 841be1de..0c710f8d 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -6099,7 +6099,7 @@ static bool llm_load_tensors(
 
                         layer.ssm_in = ml.create_tensor(ctx_split, tn(LLM_TENSOR_SSM_IN, "weight", i), {n_embd, 2*d_inner});
 
-                        layer.ssm_conv1d = ml.create_tensor(ctx_split, tn(LLM_TENSOR_SSM_CONV1D, "weight", i), {d_conv, d_inner});
+                        layer.ssm_conv1d = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_SSM_CONV1D, "weight", i), {d_conv, d_inner});
                         layer.ssm_conv1d_b = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_SSM_CONV1D, "bias", i), {d_inner});
 
                         layer.ssm_x = ml.create_tensor(ctx_split, tn(LLM_TENSOR_SSM_X, "weight", i), {d_inner, dt_rank + 2*d_state});
@@ -6108,7 +6108,7 @@ static bool llm_load_tensors(
                         layer.ssm_dt_b = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_SSM_DT, "bias", i), {d_inner});
 
                         // no "weight" suffix for these
-                        layer.ssm_a = ml.create_tensor(ctx_split, tn(LLM_TENSOR_SSM_A, i), {d_state, d_inner});
+                        layer.ssm_a = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_SSM_A, i), {d_state, d_inner});
                         layer.ssm_d = ml.create_tensor(ctx_layer, tn(LLM_TENSOR_SSM_D, i), {d_inner});
 
                         // out_proj

I have no idea what exactly this would do, but I guess it's worth a try?

@jploski
Copy link
Contributor

jploski commented Jun 1, 2024

I have no idea what exactly this would do, but I guess it's worth a try?

Replacing ctx_split by ctx_layer did not change anything.

Maybe it's good to wait with further tests until #7531 is merged (there's slight hope that the observed issue might go away...)

@jploski
Copy link
Contributor

jploski commented Jun 2, 2024

I have no idea what exactly this would do, but I guess it's worth a try?

Replacing ctx_split by ctx_layer did not change anything.

Maybe it's good to wait with further tests until #7531 is merged (there's slight hope that the observed issue might go away...)

So I was impatient, and applied my changes to #7531 here:

697fab6

The result is as before. The CPU output is the same as before (meaning that #7531 did not break anything.) The CPU vs. GPU backend tests for ssm_conv and ssm_scan with random data pass, but the actual generation derails (even faster than before, namely already with -ngl 1). The differences feel like there's some unintended "latent space exploration" going on, but with more GPU layers they get worse.

CPU:

Sara and Ben are playing in the snow. They make a big snowman with a hat, a scarf and a carrot nose. They put on their warm clothes and go inside to get some cold chocolate.
"Look at my snowman!" Sara says. "He is so cool! He can talk like us."
Ben looks at his snowman and smiles. He likes the hat, the scarf and the carrot nose too. He wants to make a new friend.

GPU (-ngl 1):

Sara and Ben are playing in the snow. They make a big snowman with a hat, a scarf and a carrot nose. They put on their warm clothes and go inside to have some hot chocolate.
"Look at my snowman!" Sara says. "He is so cool! He can do anything."
Ben looks around his room and sees that the snowman has no hat or scarf. He feels sad for him.

@slaren
Copy link
Collaborator

slaren commented Jun 2, 2024

ctx_split should only be used for matrices. Unless using CUDA with -sm row, it is the same as ctx_layer. If the results look reasonable and the tests pass, they might not be anything wrong, the results from different backends are always slightly different due to small floating point differences. perplexity is good for verifying that.

@ggerganov
Copy link
Owner Author

Btw, on master when building with CUDA and running with -ngl 0 the ppl is bogus:

LLAMA_CUDA=1 make -j && ./perplexity -m models/mamba-130m/ggml-model-f16.gguf -f build-cublas/wikitext-2-raw/wiki.test.raw -ngl 0

ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 2060 SUPER, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors: offloading 0 repeating layers to GPU
llm_load_tensors: offloaded 0/25 layers to GPU
llm_load_tensors:        CPU buffer size =   256.96 MiB
.................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:  CUDA_Host KV buffer size =    10.69 MiB
llama_new_context_with_model: KV self size  =   10.69 MiB, K (f32):    1.69 MiB, V (f32):    9.00 MiB
llama_new_context_with_model:  CUDA_Host  output buffer size =     0.77 MiB
llama_new_context_with_model:      CUDA0 compute buffer size =   184.98 MiB
llama_new_context_with_model:  CUDA_Host compute buffer size =    10.07 MiB
llama_new_context_with_model: graph nodes  = 896
llama_new_context_with_model: graph splits = 292

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 807.662 ms
perplexity: calculating perplexity over 650 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 1.21 seconds per pass - ETA 3.27 minutes
[1]54658547752546633926049792.0000,[2]179461831369242333449027584.0000,[3]31554836485675804180611072.0000,[4]6706279437817913437847552.0000

If I build without CUDA, it is OK:

make -j && ./perplexity -m models/mamba-130m/ggml-model-f16.gguf -f build-cublas/wikitext-2-raw/wiki.test.raw

llm_load_tensors: ggml ctx size =    0.12 MiB
llm_load_tensors:        CPU buffer size =   256.96 MiB
.................................................
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 2048
llama_new_context_with_model: n_ubatch   = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =    10.69 MiB
llama_new_context_with_model: KV self size  =   10.69 MiB, K (f32):    1.69 MiB, V (f32):    9.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.77 MiB
llama_new_context_with_model:        CPU compute buffer size =    99.71 MiB
llama_new_context_with_model: graph nodes  = 896
llama_new_context_with_model: graph splits = 1

system_info: n_threads = 16 / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | 
perplexity: tokenizing the input ..
perplexity: tokenization took 805.807 ms
perplexity: calculating perplexity over 650 chunks, n_ctx=512, batch_size=2048, n_seq=4
perplexity: 1.43 seconds per pass - ETA 3.85 minutes
[1]17.9926,[2]26.3160,[3]26.9249,[4]27.3857,[5]25.7235,[6]24.7161,[7]24.1991,[8]24.1345

@jploski
Copy link
Contributor

jploski commented Jun 2, 2024

ctx_split should only be used for matrices. Unless using CUDA with -sm row, it is the same as ctx_layer. If the results look reasonable and the tests pass, they might not be anything wrong, the results from different backends are always slightly different due to small floating point differences. perplexity is good for verifying that.

Unfortunately, since the model output deteriorates to the point of not generating eos token, I am pretty sure it's not a slight difference in this case (even without inspecting the perplexity metric; as @ggerganov pointed out, there seems to be something wrong with the perplexity calculation for CUDA version as well).

@compilade
Copy link
Collaborator

compilade commented Jun 2, 2024

The CPU vs. GPU backend tests for ssm_conv and ssm_scan with random data pass

@jploski Try making the tests process more tokens and more sequences at a time.

Patch for the tests (click to expand)
diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index e902a72e..ecfcdbc6 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1562,18 +1562,26 @@ struct test_leaky_relu : public test_case {
 // GGML_OP_SSM_CONV
 struct test_ssm_conv : public test_case {
     const ggml_type type;
+    const int64_t d_conv;
+    const int64_t d_inner;
+    const int64_t n_seq_tokens;
+    const int64_t n_seqs;
 
     std::string vars() override {
-        return VARS_TO_STR4(type, 3, 1536, 4);
+        return VARS_TO_STR5(type, d_conv, d_inner, n_seq_tokens, n_seqs);
     }
 
-    test_ssm_conv(ggml_type type = GGML_TYPE_F32)
-        : type(type) {}
+    test_ssm_conv(ggml_type type = GGML_TYPE_F32,
+            int64_t d_conv = 4,
+            int64_t d_inner = 1536,
+            int64_t n_seq_tokens = 7,
+            int64_t n_seqs = 2)
+        : type(type), d_conv(d_conv), d_inner(d_inner), n_seq_tokens(n_seq_tokens), n_seqs(n_seqs) {}
 
     ggml_tensor * build_graph(ggml_context * ctx) override {
-        ggml_tensor * s = ggml_new_tensor_3d(ctx, type, 3, 1536, 1);
-        ggml_tensor * x = ggml_new_tensor_2d(ctx, type, 1536, 1);
-        ggml_tensor * c = ggml_new_tensor_2d(ctx, type, 4, 1536);
+        ggml_tensor * s = ggml_new_tensor_3d(ctx, type, d_conv - 1, d_inner, n_seqs);
+        ggml_tensor * x = ggml_new_tensor_3d(ctx, type, d_inner, n_seq_tokens, n_seqs);
+        ggml_tensor * c = ggml_new_tensor_2d(ctx, type, d_conv, d_inner);
         ggml_tensor * out = ggml_ssm_conv(ctx, s, x, c);
         return out;
     }
@@ -1582,21 +1590,29 @@ struct test_ssm_conv : public test_case {
 // GGML_OP_SSM_SCAN
 struct test_ssm_scan : public test_case {
     const ggml_type type;
+    const int64_t d_state;
+    const int64_t d_inner;
+    const int64_t n_seq_tokens;
+    const int64_t n_seqs;
 
     std::string vars() override {
-        return VARS_TO_STR4(type, 16, 1536, 2);
+        return VARS_TO_STR5(type, d_state, d_inner, n_seq_tokens, n_seqs);
     }
 
-    test_ssm_scan(ggml_type type = GGML_TYPE_F32)
-        : type(type) {}
+    test_ssm_scan(ggml_type type = GGML_TYPE_F32,
+            int64_t d_state = 16,
+            int64_t d_inner = 1536,
+            int64_t n_seq_tokens = 7,
+            int64_t n_seqs = 2)
+        : type(type), d_state(d_state), d_inner(d_inner), n_seq_tokens(n_seq_tokens), n_seqs(n_seqs) {}
 
     ggml_tensor * build_graph(ggml_context * ctx) override {
-        ggml_tensor * s = ggml_new_tensor_3d(ctx, type, 16, 1536, 1);
-        ggml_tensor * x = ggml_new_tensor_2d(ctx, type, 1536, 2);
-        ggml_tensor * dt = ggml_new_tensor_2d(ctx, type, 1536, 2);
-        ggml_tensor * A = ggml_new_tensor_2d(ctx, type, 16, 1536);
-        ggml_tensor * B = ggml_new_tensor_2d(ctx, type, 16, 2);
-        ggml_tensor * C = ggml_new_tensor_2d(ctx, type, 16, 2);
+        ggml_tensor * s = ggml_new_tensor_3d(ctx, type, d_state, d_inner, n_seqs);
+        ggml_tensor * x = ggml_new_tensor_3d(ctx, type, d_inner, n_seq_tokens, n_seqs);
+        ggml_tensor * dt = ggml_new_tensor_3d(ctx, type, d_inner, n_seq_tokens, n_seqs);
+        ggml_tensor * A = ggml_new_tensor_2d(ctx, type, d_state, d_inner);
+        ggml_tensor * B = ggml_new_tensor_3d(ctx, type, d_state, n_seq_tokens, n_seqs);
+        ggml_tensor * C = ggml_new_tensor_3d(ctx, type, d_state, n_seq_tokens, n_seqs);
         ggml_tensor * out = ggml_ssm_scan(ctx, s, x, dt, A, B, C);
         return out;
     }

Looking at the tests, it doesn't really compare all outputs, since in #7531 the state inputs are also written back. I think I should change the output of the SSM operators back to concatenated tensors, just to make the output comparisons easier, and to make the graph dependencies more representative of the actual data dependencies.

Btw, on master when building with CUDA and running with -ngl 0 the ppl is bogus:

@ggerganov My hypothesis in this case is that it's probably related to where the KV cache is located.

llama_kv_cache_init:  CUDA_Host KV buffer size =    10.69 MiB
llama_new_context_with_model: KV self size  =   10.69 MiB, K (f32):    1.69 MiB, V (f32):    9.00 MiB

vs

llama_kv_cache_init:        CPU KV buffer size =    10.69 MiB
llama_new_context_with_model: KV self size  =   10.69 MiB, K (f32):    1.69 MiB, V (f32):    9.00 MiB

But maybe CUDA_Host is the CPU? I'm not used to CUDA device names.
(EDIT: that name is defined in ggml-cuda.cu.)

@slaren
Copy link
Collaborator

slaren commented Jun 2, 2024

CUDA_Host is just a pinned CPU buffer.

@jploski
Copy link
Contributor

jploski commented Jun 2, 2024

The CPU vs. GPU backend tests for ssm_conv and ssm_scan with random data pass

@jploski Try making the tests process more tokens and more sequences at a time.

Thanks, I applied your patch, and the test for ssm_conv fails now, which sounds like good progress!

/mnt/seagate/dalai/llama.cpp> bin/test-backend-ops test -o SSM_CONV
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:   no
ggml_cuda_init: CUDA_USE_TENSOR_CORES: yes
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes
Testing 2 backends

Backend 1/2 (CPU)
  Skipping CPU backend
Backend 2/2 (CUDA0)
  Backend name: CUDA0
  SSM_CONV(type=f32,d_conv=4,d_inner=1536,n_seq_tokens=7,n_seqs=2): [SSM_CONV] NMSE = 0.207277691 > 0.000000100 FAIL
  1207/1208 tests passed
  Backend CUDA0: FAIL

@slaren
Copy link
Collaborator

slaren commented Jun 2, 2024

This should fix the CUDA ppl. Not sure if both conts are actually needed.

diff --git a/llama.cpp b/llama.cpp
index 841be1de..b311467a 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -10499,6 +10499,9 @@ struct llm_build_context {
             struct ggml_tensor * x = ggml_view_2d(ctx0, xz, d_inner, xz->ne[1], xz->nb[1], 0);
             struct ggml_tensor * z = ggml_view_2d(ctx0, xz, d_inner, xz->ne[1], xz->nb[1], ggml_element_size(xz)*d_inner);

+            x = ggml_cont(ctx0, x);
+            z = ggml_cont(ctx0, z);
+
             // conv
             {
                 // Custom operator which is needed only to ease simultaneous sequence processing.

@jploski
Copy link
Contributor

jploski commented Jun 2, 2024

This should fix the CUDA ppl. Not sure if both conts are actually needed.

I tried applying the patch to my PR #7531 branch around here: https://github.com/jploski/llama.cpp/blob/mamba_cuda_pr7531/llama.cpp#L8694 - and as a result perplexity calculation on CPU appeared ok. But with this fix the GPU-based generation fails as follows (so did not commit it):

Sara and Ben are playing in the snow. They maketerminate called after throwing an instance of 'std::out_of_range'
  what():  vector::_M_range_check: __n (which is 2) >= this->size() (which is 2)
Aborted

I got the failing ssm_conv test working now - even if I increase n_tokens from 7 to 1024 and n_seqs to 8. The fail was due to a misplaced parenthesis (jploski@982a22e), which was a mistake I made today.

However, even with the test working, the output qualitatively degrades with an increasing number of GPU layers (as was the case with the original non-PR-7531 version, which did not suffer from that parenthesis bug).

So I'd say the overall state of the GPU port is now as bad as yesterday, but not worse - and we have more representative tests for ssm_conv and ssm_scan in place thanks to @compilade.

@compilade
Copy link
Collaborator

compilade commented Jun 3, 2024

I managed to do the equivalent of GGML_OP_SSM_CONV with ggml_concat, ggml_im2col, reshapes and a ggml_mul_mat!!! And the memory usage is reasonable!

llama.cpp/llama.cpp

Lines 8739 to 8751 in 8fb57ac

// => { d_conv * d_inner, n_seq_tokens, n_seqs}
x = ggml_im2col(ctx,
ggml_new_tensor_2d(ctx, GGML_TYPE_F16, d_conv, d_inner),
conv_x, 1, 0, 0, 0, 1, 0, false, GGML_TYPE_F32);
x = ggml_reshape_4d(ctx, x, d_conv, 1, d_inner, n_seq_tokens * n_seqs);
// => {1, 1, d_inner, n_seq_tokens * n_seqs}
x = ggml_mul_mat(ctx, ggml_reshape_3d(ctx, model.layers[il].ssm_conv1d, d_conv, 1, d_inner), x);
x = ggml_reshape_3d(ctx, x, d_inner, n_seq_tokens, n_seqs);
// Alternatively, this does the same as the above
// x = ggml_ssm_conv(ctx, conv_x, model.layers[il].ssm_conv1d);

But performance is measurably worse for prompt processing (around 85% of what it was with ggml_ssm_conv for pp512 on my CPU with a 100k parameter Mamba model). However, text generation speed is very similar as before.
I think ggml_mul_mat might not be optimized for small row sizes (4 elements per row in this case). Or maybe the overhead is due to im2col. Not sure.

I'm not sure if I should remove GGML_OP_SSM_CONV yet.

@jploski note that I also changed ggml_ssm_scan to make it not modify its inputs. This should make tests more representative, because otherwise they did not check the final state. It's now including the final state in its output, similarly to how it is in master, but a bit simpler, due to the equal-sequence-length batching.

@jploski
Copy link
Contributor

jploski commented Jun 3, 2024

I managed to do the equivalent of GGML_OP_SSM_CONV with ggml_concat, ggml_im2col, reshapes and a ggml_mul_mat!!! And the memory usage is reasonable!

I updated my branch to catch up, but the non-ssm_conv implementation now fails on the GPU because apparently MUL_MAT is not completely implemented (or maybe you could work around with some transpose, didn't examine it further):

ggml_cuda_compute_forward: cannot compute node_31: src0->ne[3] = 1, src1->ne[3] = 2 - fallback to CPU
ggml_backend_cuda_graph_compute: op not supported node_31 (MUL_MAT)
GGML_ASSERT: /mnt/seagate/dalai/llama.cpp/ggml-cuda.cu:2675: ok

I imagine that a version with a more coarse-granular op might offer more potential for optimization on the GPU, keyword "fused kernel" (not claiming that I know how to do it for CUDA specifically, but generally knowing "what comes next" in a computation pipeline aids optimization).

@compilade
Copy link
Collaborator

compilade commented Jun 3, 2024

I updated my branch to catch up, but the non-ssm_conv implementation now fails on the GPU because apparently MUL_MAT is not completely implemented (or maybe you could work around with some transpose, didn't examine it further)

@jploski Thank you for trying this.

I did not expect it would fail like this. It seems this is also a problem on SYCL and Vulkan, which also expect ne[3] to be the same on both tensors when doing MUL_MAT.

A transpose can't work around this. The first tensor in that MUL_MAT really has to be broadcast on the second tensor over the fourth dimension.

I guess it will likely be simpler (and faster) with GGML_OP_SSM_CONV then.

I'm not sure if I should or not fuse the concat into it. The implementation is much simpler when it's separate (the shift doesn't need to move the state, it's only +1 on a pointer, no need to shift a temporary buffer), but it's also slighly slower (97% of tg128 and pp512) on CPU, although it might be due to something else (EDIT: when making CONCAT work with a transposed src1, there is practically almost no performance difference (on CPU) vs the previous SSM_CONV which fused the concat).

@ggerganov
Copy link
Owner Author

I did not expect it would fail like this. It seems this is also a problem on SYCL and Vulkan, which also expect ne[3] to be the same on both tensors when doing MUL_MAT.

The op can be extended to support broadcast in all backends. If it's not too much hassle, try to keep the im2col + mul_mat version around (even behind ifdef if necessary) and will try to make it work on the GPU at some point

@compilade
Copy link
Collaborator

compilade commented Jun 3, 2024

I think it's also possible to replace that ggml_mul_mat with a ggml_sum_rows and ggml_mul, but again, it's less performant than fusing it all in ggml_ssm_conv.

(around 95% tg128 and 81% pp512 on my CPU, when comparing MUL+SUM_ROWS with SSM_CONV, while MUL_MAT gets around 94% tg128 and 79% pp512, so strangely, MUL+SUM_ROWS is slightly faster than MUL_MAT in this case, with a very small model like https://huggingface.co/delphi-suite/v0-mamba-100k, and also with bigger models, I think)

@jploski
Copy link
Contributor

jploski commented Jun 3, 2024

I compared the tensor contents of CPU and GPU versions to find the source of first discrepancy. While small rounding errors which might account for confusing hot chocolate with cold chocolate are evident from the get go, the first GPU tensor with a really alarming difference was the one resulting from ggml_silu(ctx, z). I wrapped it with ggml_cont to fix:

7509b9e

With that (and using ssm_conv for reasons explained previously) the tensor looks almost identical. Although the facts about Sara's and Ben's encounter with the abominable snowman still do not match between CPU and GPU, I no longer get strange tokens, and it remains on topic regardless of the CPU/GPU layer split.

@jploski
Copy link
Contributor

jploski commented Aug 25, 2024

I pushed a new branch https://github.com/jploski/llama.cpp/tree/falcon_mamba_cuda, which is based on the recent master of llama.cpp with reapplied patches from my original attempt in June (https://github.com/jploski/llama.cpp/tree/mamba_cuda_pr7531)

With one small additional fix (fae826f) this implementation is now working. I tested it with https://huggingface.co/tiiuae/falcon-mamba-7b-instruct. It produces coherent output in f16, q8_0 and q5_k_m quantizations.

Maybe someone with CUDA experience could have a look at the ggml/src/ggml-cuda/ssm_scan.cu and ggml/src/ggml-cuda/ssm_conv.cu regarding grid configuration and memory access patterns in those kernels (remember I just copy-pasted the CPU version without regard for what is good for CUDA's parallel execution; so it could probably be optimized for performance).

@piDack
Copy link
Contributor

piDack commented Aug 26, 2024

I optimized the 10x performance (A100) based on @jploski ’s work.the pr is #9186.

@uniartisan
Copy link
Contributor

uniartisan commented Nov 2, 2024

Maybe it's just a small promotion, we have implemented initial SYCL support and CUDA support for RWKV, everyone is welcome to use and improve! #10133

@A3shTnT
Copy link
Contributor

A3shTnT commented Nov 26, 2024

Can someone tell me the typical sizes for input x (B, L, N) and state (B, D, N)? The values of D and N. I am not familiar with the MAMBA model, but I would like to try writing the CUDA kernel for ssm_conv and ssm_scan.

@jploski
Copy link
Contributor

jploski commented Nov 26, 2024

Can someone tell me the typical sizes for input x (B, L, N) and state (B, D, N)? The values of D and N. I am not familiar with the MAMBA model, but I would like to try writing the CUDA kernel for ssm_conv and ssm_scan.

Those kernels are already implemented in PR #9186

@A3shTnT
Copy link
Contributor

A3shTnT commented Nov 27, 2024

Can someone tell me the typical sizes for input x (B, L, N) and state (B, D, N)? The values of D and N. I am not familiar with the MAMBA model, but I would like to try writing the CUDA kernel for ssm_conv and ssm_scan.

Those kernels are already implemented in PR #9186

I have looked at the code of PR9186 and I think the CUDA code in the scan section is more like CPU code than GPU code. This part of CUDA uses one warp and one block to loop within the block. Perhaps we can consider dividing dimension D into multiple blocks and multiple warp, and calculate them simultaneously. In addition, I saw in the MAMBA paper that placing state on shared memory can also reduce the read and write of state, so I believe there are some possibilities for optimization. I tried running mamba-130M, which has a D of 1536 and N of 16, and mamba-370M, which has a D of 2048 (if I remember correctly, the running data is not with me now), so I think splitting D is a reasonable optimization solution.

@david68cu
Copy link

Hello,

I’d like to take ownership of this issue and work on resolving it. I’ve reviewed the CONTRIBUTING.md file and familiarized myself with the project structure and guidelines.

If there are any specific details or considerations I should keep in mind while working on this, please let me know. Otherwise, I’ll take the necessary steps to address this issue effectively.

Could you please assign this issue to me?

@Gidraulght
Copy link

I've been trying to run falcon-mamba-7b-instruct-Q4_K_M.gguf
Sometimes it runs for a few paragraphs, sometimes crashes outright.
This is the error I get:
D:\a\llama.cpp\llama.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:4660: GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)) failed

I start web interface with:
llama-server -m "C:\Program Files\llama-b4273-bin-win-vulkan-x64\tensorblock\falcon-mamba-7b-instruct-GGUF\falcon-mamba-7b-instruct-Q4_K_M.gguf" --port 8080 --no-context-shift

I am using AMD GPU and have downloaded the latest release of llama.cpp to date. Version b4273. Would be nice if somebody took a look at it. Sorry if this is the wrong place to post this. Not a developer.

@TimexPeachtree
Copy link

I've been trying to run falcon-mamba-7b-instruct-Q4_K_M.gguf Sometimes it runs for a few paragraphs, sometimes crashes outright. This is the error I get: D:\a\llama.cpp\llama.cpp\ggml\src\ggml-vulkan\ggml-vulkan.cpp:4660: GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)) failed

I start web interface with: llama-server -m "C:\Program Files\llama-b4273-bin-win-vulkan-x64\tensorblock\falcon-mamba-7b-instruct-GGUF\falcon-mamba-7b-instruct-Q4_K_M.gguf" --port 8080 --no-context-shift

I am using AMD GPU and have downloaded the latest release of llama.cpp to date. Version b4273. Would be nice if somebody took a look at it. Sorry if this is the wrong place to post this. Not a developer.

I have compiled llama.cpp and it's able to run the new Falcon3 Mamba.

This is the command for running from official release - with Cuda runtime:

D:\llama-cpp\llama-server -m "D:\Users\TIMEX DEMO\Downloads\Falcon3-Mamba-7B-Instruct-q5_k_m.gguf" --port 8087 -t 8 -ngl 65 --no-context-shift -c 8192 --prio-batch 2

As you can see from below output its using CUDA.

ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4050 Laptop GPU, compute capability 8.9, VMM: yes
build: 4393 (d79d8f3) with MSVC 19.29.30157.0 for
system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

system_info: n_threads = 8 (n_threads_batch = 8) / 16 | CUDA : ARCHS = 520,610,700,750 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 |

main: HTTP server is listening, hostname: 127.0.0.1, port: 8087, http threads: 15
main: loading model
srv load_model: loading model 'D:\Users\TIMEX\Downloads\Falcon3-Mamba-7B-Instruct-q5_k_m.gguf'
llama_load_model_from_file: using device CUDA0 (NVIDIA GeForce RTX 4050 Laptop GPU) - 5075 MiB free
llama_model_loader: loaded meta data with 28 key-value pairs and 643 tensors from D:\Users\TIMEX\Downloads\Falcon3-Mamba-7B-Instruct-q5_k_m.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = mamba
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Falcon3 Mamba 7B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Falcon3-Mamba
llama_model_loader: - kv 5: general.size_label str = 7B
llama_model_loader: - kv 6: mamba.context_length u32 = 1048576
llama_model_loader: - kv 7: mamba.embedding_length u32 = 4096
llama_model_loader: - kv 8: mamba.feed_forward_length u32 = 0
llama_model_loader: - kv 9: mamba.attention.head_count u32 = 0
llama_model_loader: - kv 10: mamba.block_count u32 = 64
llama_model_loader: - kv 11: mamba.ssm.conv_kernel u32 = 4
llama_model_loader: - kv 12: mamba.ssm.inner_size u32 = 8192
llama_model_loader: - kv 13: mamba.ssm.state_size u32 = 16
llama_model_loader: - kv 14: mamba.ssm.time_step_rank u32 = 256
llama_model_loader: - kv 15: mamba.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 16: mamba.ssm.dt_b_c_rms bool = true
llama_model_loader: - kv 17: general.file_type u32 = 17
llama_model_loader: - kv 18: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 19: tokenizer.ggml.pre str = falcon
llama_model_loader: - kv 20: tokenizer.ggml.tokens arr[str,65024] = [">>TITLE<<", ">>ABSTRACT<<", ">>INTR...
llama_model_loader: - kv 21: tokenizer.ggml.token_type arr[i32,65024] = [3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, ...
llama_model_loader: - kv 22: tokenizer.ggml.merges arr[str,64784] = ["─á t", "─á a", "i n", "h e", "r e",...
llama_model_loader: - kv 23: tokenizer.ggml.bos_token_id u32 = 8
llama_model_loader: - kv 24: tokenizer.ggml.eos_token_id u32 = 11
llama_model_loader: - kv 25: tokenizer.ggml.padding_token_id u32 = 0
llama_model_loader: - kv 26: tokenizer.chat_template str = {{bos_token}}{% for message in messag...
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - type f32: 385 tensors
llama_model_loader: - type q5_K: 257 tensors
llama_model_loader: - type q6_K: 1 tensors
llm_load_vocab: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
llm_load_vocab: special tokens cache size = 12
llm_load_vocab: token to piece cache size = 0.3884 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = mamba
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 65024
llm_load_print_meta: n_merges = 64784
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 1048576
llm_load_print_meta: n_embd = 4096
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 0
llm_load_print_meta: n_head_kv = 0
llm_load_print_meta: n_rot = 0
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_embd_head_k = 0
llm_load_print_meta: n_embd_head_v = 0
llm_load_print_meta: n_gqa = 0
llm_load_print_meta: n_embd_k_gqa = 0
llm_load_print_meta: n_embd_v_gqa = 0
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-05
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 0
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = -1
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 10000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 1048576
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 4
llm_load_print_meta: ssm_d_inner = 8192
llm_load_print_meta: ssm_d_state = 16
llm_load_print_meta: ssm_dt_rank = 256
llm_load_print_meta: ssm_dt_b_c_rms = 1
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 7.27 B
llm_load_print_meta: model size = 4.73 GiB (5.58 BPW)
llm_load_print_meta: general.name = Falcon3 Mamba 7B Instruct
llm_load_print_meta: BOS token = 8 '<|begin_of_text|>'
llm_load_print_meta: EOS token = 11 '<|end_of_text|>'
llm_load_print_meta: EOT token = 10 '<|im_end|>'
llm_load_print_meta: PAD token = 0 '>>TITLE<<'
llm_load_print_meta: LF token = 138 'Ä'
llm_load_print_meta: EOG token = 10 '<|im_end|>'
llm_load_print_meta: EOG token = 11 '<|end_of_text|>'
llm_load_print_meta: max token length = 130
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading output layer to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: CUDA0 model buffer size = 4626.39 MiB
llm_load_tensors: CPU_Mapped model buffer size = 4563.64 MiB
..............................................................................................
llama_new_context_with_model: n_seq_max = 1
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_ctx_per_seq = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 1
llama_new_context_with_model: n_ctx_per_seq (8192) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 1, offload = 1, type_k = 'f32', type_v = 'f32', n_layer = 64
llama_kv_cache_init: CUDA0 KV buffer size = 38.00 MiB
llama_new_context_with_model: KV self size = 38.00 MiB, K (f32): 6.00 MiB, V (f32): 32.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.25 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 151.59 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 50.16 MiB
llama_new_context_with_model: graph nodes = 3334
llama_new_context_with_model: graph splits = 387 (with bs=512), 259 (with bs=1)
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv init: initializing slots, n_slots = 1
slot init: id 0 | task -1 | new slot n_ctx_slot = 8192
main: model loaded
main: chat template, built_in: 1, chat_example: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
'
main: server is listening on http://127.0.0.1:8087 - starting the main loop
srv update_slots: all slots are idle
slot launch_slot_: id 0 | task 0 | processing task
slot update_slots: id 0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 31
slot update_slots: id 0 | task 0 | kv cache rm [0, end)
slot update_slots: id 0 | task 0 | prompt processing progress, n_past = 31, n_tokens = 31, progress = 1.000000
slot update_slots: id 0 | task 0 | prompt done, n_past = 31, n_tokens = 31
slot release: id 0 | task 0 | stop processing: n_past = 509, truncated = 0
slot print_timing: id 0 | task 0 |
prompt eval time = 1924.82 ms / 31 tokens ( 62.09 ms per token, 16.11 tokens per second)
eval time = 48094.68 ms / 479 tokens ( 100.41 ms per token, 9.96 tokens per second)
total time = 50019.49 ms / 510 tokens
srv update_slots: all slots are idle
request: POST /v1/chat/completions 127.0.0.1 200

@wbruna
Copy link

wbruna commented Jan 14, 2025

Testing Falcon3-Mamba-7B-Instruct.i1-Q4_K_S.gguf on Vulkan (AMD iGPU on Linux; 8G VRAM) at commit 504af20 , I consistently get that assertion failure when any layer is offloaded to the GPU:

$ ./b-dbg/bin/llama-simple -m ./Falcon3-Mamba-7B-Instruct.i1-Q4_K_S.gguf -ngl 1 "Hello my name is"
(...)
ggml_vulkan: Compiling shaders............................................Done!
load_tensors: tensor 'token_embd.weight' (q4_K) (and 634 others) cannot be used with preferred buffer type CPU_AARCH64, using CPU instead
load_tensors: offloading 1 repeating layers to GPU
load_tensors: offloaded 1/65 layers to GPU
load_tensors:      Vulkan0 model buffer size =    56.50 MiB
load_tensors:   CPU_Mapped model buffer size =  3950.80 MiB
llama_init_from_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 64
llama_init_from_model: n_ctx_per_seq = 64
llama_init_from_model: n_batch       = 32
llama_init_from_model: n_ubatch      = 32
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (64) < n_ctx_train (1048576) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 1, offload = 1, type_k = 'f32', type_v = 'f32', n_layer = 64, can_shift = 0
llama_kv_cache_init: layer 0: n_embd_k_gqa = 24576, n_embd_v_gqa = 131072
llama_kv_cache_init: layer 1: n_embd_k_gqa = 24576, n_embd_v_gqa = 131072
(...)
llama_kv_cache_init: layer 62: n_embd_k_gqa = 24576, n_embd_v_gqa = 131072
llama_kv_cache_init: layer 63: n_embd_k_gqa = 24576, n_embd_v_gqa = 131072
llama_kv_cache_init:    Vulkan0 KV buffer size =     0.59 MiB
llama_kv_cache_init:        CPU KV buffer size =    37.41 MiB
llama_init_from_model: KV self size  =   38.00 MiB, K (f32):    6.00 MiB, V (f32):   32.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.25 MiB
llama_init_from_model:    Vulkan0 compute buffer size =   217.80 MiB
llama_init_from_model: Vulkan_Host compute buffer size =     4.11 MiB
llama_init_from_model: graph nodes  = 3334
llama_init_from_model: graph splits = 639 (with bs=32), 9 (with bs=1)
Hello my name is/opt/iaprg/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp:5164: GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)) failed
[New LWP 27789]
[New LWP 28226]
[New LWP 28227]
[New LWP 28228]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f051faf2bd7 in __GI___wait4 (pid=28229, stat_loc=0x7fff8c3bd244, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f051faf2bd7 in __GI___wait4 (pid=28229, stat_loc=0x7fff8c3bd244, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
30      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x00007f0520010f90 in ggml_abort () from /opt/iaprg/llama.cpp/b-dbg/ggml/src/libggml-base.so
#2  0x00007f051f06ed5d in void ggml_vk_op_f32<vk_op_push_constants>(ggml_backend_vk_context*, std::shared_ptr<vk_context_struct>&, ggml_tensor const*, ggml_tensor const*, ggml_tensor const*, ggml_tensor*, ggml_op, vk_op_push_constants&&, bool) [clone .constprop.0] () from /opt/iaprg/llama.cpp/b-dbg/ggml/src/ggml-vulkan/libggml-vulkan.so
#3  0x00007f051f077971 in ggml_vk_build_graph(ggml_backend_vk_context*, ggml_tensor*, int, ggml_tensor*, int, bool, bool, bool) () from /opt/iaprg/llama.cpp/b-dbg/ggml/src/ggml-vulkan/libggml-vulkan.so
#4  0x00007f051f079c4e in ggml_backend_vk_graph_compute(ggml_backend*, ggml_cgraph*) () from /opt/iaprg/llama.cpp/b-dbg/ggml/src/ggml-vulkan/libggml-vulkan.so
#5  0x00007f0520025663 in ggml_backend_sched_graph_compute_async () from /opt/iaprg/llama.cpp/b-dbg/ggml/src/libggml-base.so
#6  0x00007f0520128130 in llama_graph_compute(llama_context&, ggml_cgraph*, int, ggml_threadpool*) () from /opt/iaprg/llama.cpp/b-dbg/src/libllama.so
#7  0x00007f052012d651 in llama_decode_impl(llama_context&, llama_batch) () from /opt/iaprg/llama.cpp/b-dbg/src/libllama.so
#8  0x00007f052012e707 in llama_decode () from /opt/iaprg/llama.cpp/b-dbg/src/libllama.so
#9  0x000055d05be2cd68 in main ()
[Inferior 1 (process 27788) detached]
Aborted (core dumped)

-ngl 0 seems to work fine.

@wbruna
Copy link

wbruna commented Jan 14, 2025

It seems that llama-simple works only because it's a clean run; with llama-server, I'm also getting the crash after a few messages.

The assertion failure is in:

GGML_ASSERT(ggml_vk_op_supports_incontiguous(op) || ggml_vk_dim01_contiguous(src0)); // NOLINT

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request help wanted Extra attention is needed Nvidia GPU Issues specific to Nvidia GPUs
Projects
Status: Todo
Development

No branches or pull requests