Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

mann1x · 2024-08-24T13:09:30Z

What happened?

b3614 release simplify Mamba with advanced batch splits (#8526) broke quantization for deepseek2
rolling back to b3613 works fine

Name and Version

llama-cli --version
version: 3614 (a1631e5)
built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu

What operating system are you seeing the problem on?

Linux

Relevant log output

main: build = 3614 (a1631e53)
main: built with cc (Debian 10.2.1-6) 10.2.1 20210110 for x86_64-linux-gnu
main: quantizing 'deepseek-coder-v2-lite-instruct.fp32.bin' to 'deepseek-coder-v2-lite-instruct.Q5_0.gguf' as Q5_0
llama_model_loader: loaded meta data with 44 key-value pairs and 377 tensors from deepseek-coder-v2-lite-instruct.fp32.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = ..
llama_model_loader: - kv   3:                           general.finetune str              = ..
llama_model_loader: - kv   4:                         general.size_label str              = 64x1.5B
llama_model_loader: - kv   5:                            general.license str              = other
llama_model_loader: - kv   6:                       general.license.name str              = deepseek-license
llama_model_loader: - kv   7:                       general.license.link str              = LICENSE
llama_model_loader: - kv   8:                      deepseek2.block_count u32              = 27
llama_model_loader: - kv   9:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  10:                 deepseek2.embedding_length u32              = 2048
llama_model_loader: - kv  11:              deepseek2.feed_forward_length u32              = 10944
llama_model_loader: - kv  12:             deepseek2.attention.head_count u32              = 16
llama_model_loader: - kv  13:          deepseek2.attention.head_count_kv u32              = 16
llama_model_loader: - kv  14:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  15: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  16:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  17:                          general.file_type u32              = 0
llama_model_loader: - kv  18:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  19:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  20:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  21:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  22:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  23:       deepseek2.expert_feed_forward_length u32              = 1408
llama_model_loader: - kv  24:                     deepseek2.expert_count u32              = 64
llama_model_loader: - kv  25:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  26:             deepseek2.expert_weights_scale f32              = 1.000000
llama_model_loader: - kv  27:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  28:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  29:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  30: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  31: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.070700
llama_model_loader: - kv  32:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  33:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  34:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  35:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  36:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  37:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  38:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  39:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  40:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  41:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  42:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  43:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  377 tensors
/shared/dev/llama.cpp/src/llama.cpp:16840: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
0x00007f207e755746 in __GI___wait4 (pid=271293, stat_loc=0x7ffdfaa194c4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory.
#0  0x00007f207e755746 in __GI___wait4 (pid=271293, stat_loc=0x7ffdfaa194c4, options=0, usage=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:27
27      in ../sysdeps/unix/sysv/linux/wait4.c
#1  0x000055a032cd37a9 in ggml_abort ()
#2  0x000055a032be7197 in llama_model_quantize_internal(std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>,
 std::allocator<char> > const&, llama_model_quantize_params const*) ()
#3  0x000055a032be74d5 in llama_model_quantize ()
#4  0x000055a032b769fa in main ()

The text was updated successfully, but these errors were encountered:

compilade · 2024-08-24T13:55:44Z

Oh, this is because qs.n_attention_wv is 0 for that model even though it's still a Transformer.

Previously, this worked because 0 was accepted without checking if that's expected for the model type.

Should it really be 0 for that model?

Sorry to have broken this be making the assertion stricter, but thank you for reporting the problem!

I'll look into this.

mann1x · 2024-08-24T14:18:35Z

Should it really be 0 for that model?

I wouldn't know honestly... hope it's not that hard to fix.

Thank you for the PR, that was really important!

mann1x added bug-unconfirmed medium severity Used to report medium severity bugs in llama.cpp (e.g. Malfunctioning Features but still useable) labels Aug 24, 2024

mann1x mentioned this issue Aug 24, 2024

llama : simplify Mamba with advanced batch splits #8526

Merged

10 tasks

compilade mentioned this issue Aug 24, 2024

llama : fix qs.n_attention_wv for DeepSeek-V2 #9156

Merged

2 tasks

ggerganov closed this as completed in #9156 Aug 27, 2024

molamooo added a commit to promoe-opensource/llama.cpp that referenced this issue Jan 27, 2025

fix deepseek v2 quant support. refer to ggerganov#9155

1cb05ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

mann1x commented Aug 24, 2024

compilade commented Aug 24, 2024

mann1x commented Aug 24, 2024

Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

Bug: GGML_ASSERT((qs.n_attention_wv == n_attn_layer) && "n_attention_wv is unexpected") failed with deepseek2 #9155

Comments

mann1x commented Aug 24, 2024

What happened?

Name and Version

What operating system are you seeing the problem on?

Relevant log output

compilade commented Aug 24, 2024

mann1x commented Aug 24, 2024