Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

Closed
mann1x opened this issue Aug 24, 2024 · 2 comments · Fixed by #9156
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)

Comments

@mann1x
Copy link

mann1x commented Aug 24, 2024

What happened?

b3614 release simplify Mamba with advanced batch splits (#8526) broke quantization for deepseek2
rolling back to b3613 works fine

Name and Version

llama-cli --version
version: 3614 (a1631e5)
built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

main: build = 3614 (a1631e53)
main: built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu
main: quantizing 'deepseek-coder-v2-lite-instruct.fp32.bin' to 'deepseek-coder-v2-lite-instruct.Q5_0.gguf' as Q5_0
llama_model_loader: loaded meta data with 44 key-value pairs and 377 tensors from deepseek-coder-v2-lite-instruct.fp32.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = ..
llama_model_loader: - kv   3:                           general.finetune str              = ..
llama_model_loader: - kv   4:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = deepseek-license
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   9:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  10:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv  11:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv  12:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv  13:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv  14:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  17:                          general.file_type u32              = 0
llama_model_loader: - kv  18:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  19:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  377 tensors
/shared/dev/llama.cpp/src/llama.cpp:16840: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f207e755746 in __GI___wait4 (pid=271293, stat_loc=0x7ffdfaa194c4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f207e755746 in __GI___wait4 (pid=271293, stat_loc=0x7ffdfaa194c4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000055a032cd37a9 in ggml_abort ()
#2  0x000055a032be7197 in llama_model_quantize_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>,
 std::allocator<char> > const&, llama_model_quantize_params const*) ()
#3  0x000055a032be74d5 in llama_model_quantize ()
#4  0x000055a032b769fa in main ()
@mann1x mann1x added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 24, 2024
@compilade
Copy link
Collaborator

Oh, this is because qs.n_attention_wv is 0 for that model even though it's still a Transformer.

Previously, this worked because 0 was accepted without checking if that's expected for the model type.

Should it really be 0 for that model?

Sorry to have broken this be making the assertion stricter, but thank you for reporting the problem!

I'll look into this.

@mann1x
Copy link
Author

mann1x commented Aug 24, 2024

Should it really be 0 for that model?

I wouldn't know honestly... hope it's not that hard to fix.

Thank you for the PR, that was really important!

molamooo added a commit to promoe-opensource/llama.cpp that referenced this issue Jan 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants