Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"CPU_AARCH64 model buffer" appears when not using AARCH64 #11204

Open
pt13762104 opened this issue Jan 12, 2025 · 0 comments
Open

"CPU_AARCH64 model buffer" appears when not using AARCH64 #11204

pt13762104 opened this issue Jan 12, 2025 · 0 comments

Comments

@pt13762104
Copy link

pt13762104 commented Jan 12, 2025

Name and Version

build: 4465 (9a48399) with gcc (conda-forge gcc 13.3.0-1) 13.3.0 for x86_64-conda-linux-gnu

Operating systems

Linux

GGML backends

CPU

Hardware

2x Intel Xeon 24 core (Kaggle)

Models

DeepSeek-V2.5: https://huggingface.co/bartowski/DeepSeek-V2.5-GGUF/tree/main/DeepSeek-V2.5-Q4_0

Problem description & steps to reproduce

The problem is that a part of the memory was used for "CPU_AARCH64 model buffer". Normally the model takes only 150GB of RAM, now it takes 260GB and loads much slower. Command line: /root/llama.cpp/build/bin/llama-server -m /dev/shm/DeepSeek-V2.5-Q4_0-00001-of-00004.gguf -t 72. This doesn't appear when using Q4_K_M.

Compile commands:

git clone https://github.com/ggerganov/llama.cpp ~/llama.cpp
cd ~/llama.cpp && cmake -G Ninja -B build && cmake --build build --config Release -j 64

First Bad Commit

No response

Relevant log output

build: 4465 (9a483999) with gcc (conda-forge gcc 13.3.0-1) 13.3.0 for x86_64-conda-linux-gnu
system info: n_threads = 72, n_threads_batch = 72, total_threads = 96

system_info: n_threads = 72 (n_threads_batch = 72) / 96 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | LLAMAFILE = 1 | OPENMP = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 127.0.0.1, port: 8080, http threads: 95
main: loading model
srv    load_model: loading model '/dev/shm/DeepSeek-V2.5-Q4_0-00001-of-00004.gguf'
llama_model_loader: additional 3 GGUFs metadata loaded.
llama_model_loader: loaded meta data with 53 key-value pairs and 959 tensors from /dev/shm/DeepSeek-V2.5-Q4_0-00001-of-00004.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = deepseek2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepSeek V2.5
llama_model_loader: - kv   3:                            general.version str              = V2.5
llama_model_loader: - kv   4:                           general.basename str              = DeepSeek
llama_model_loader: - kv   5:                         general.size_label str              = 160x14B
llama_model_loader: - kv   6:                            general.license str              = other
llama_model_loader: - kv   7:                       general.license.name str              = deepseek
llama_model_loader: - kv   8:                       general.license.link str              = https://github.com/deepseek-ai/DeepSe...
llama_model_loader: - kv   9:                      deepseek2.block_count u32              = 60
llama_model_loader: - kv  10:                   deepseek2.context_length u32              = 163840
llama_model_loader: - kv  11:                 deepseek2.embedding_length u32              = 5120
llama_model_loader: - kv  12:              deepseek2.feed_forward_length u32              = 12288
llama_model_loader: - kv  13:             deepseek2.attention.head_count u32              = 128
llama_model_loader: - kv  14:          deepseek2.attention.head_count_kv u32              = 128
llama_model_loader: - kv  15:                   deepseek2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  16: deepseek2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  17:                deepseek2.expert_used_count u32              = 6
llama_model_loader: - kv  18:                          general.file_type u32              = 2
llama_model_loader: - kv  19:        deepseek2.leading_dense_block_count u32              = 1
llama_model_loader: - kv  20:                       deepseek2.vocab_size u32              = 102400
llama_model_loader: - kv  21:            deepseek2.attention.q_lora_rank u32              = 1536
llama_model_loader: - kv  22:           deepseek2.attention.kv_lora_rank u32              = 512
llama_model_loader: - kv  23:             deepseek2.attention.key_length u32              = 192
llama_model_loader: - kv  24:           deepseek2.attention.value_length u32              = 128
llama_model_loader: - kv  25:       deepseek2.expert_feed_forward_length u32              = 1536
llama_model_loader: - kv  26:                     deepseek2.expert_count u32              = 160
llama_model_loader: - kv  27:              deepseek2.expert_shared_count u32              = 2
llama_model_loader: - kv  28:             deepseek2.expert_weights_scale f32              = 16.000000
llama_model_loader: - kv  29:             deepseek2.rope.dimension_count u32              = 64
llama_model_loader: - kv  30:                deepseek2.rope.scaling.type str              = yarn
llama_model_loader: - kv  31:              deepseek2.rope.scaling.factor f32              = 40.000000
llama_model_loader: - kv  32: deepseek2.rope.scaling.original_context_length u32              = 4096
llama_model_loader: - kv  33: deepseek2.rope.scaling.yarn_log_multiplier f32              = 0.100000
llama_model_loader: - kv  34:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  35:                         tokenizer.ggml.pre str              = deepseek-llm
llama_model_loader: - kv  36:                      tokenizer.ggml.tokens arr[str,102400]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  37:                  tokenizer.ggml.token_type arr[i32,102400]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  38:                      tokenizer.ggml.merges arr[str,99757]   = ["Ġ Ġ", "Ġ t", "Ġ a", "i n", "h e...
llama_model_loader: - kv  39:                tokenizer.ggml.bos_token_id u32              = 100000
llama_model_loader: - kv  40:                tokenizer.ggml.eos_token_id u32              = 100001
llama_model_loader: - kv  41:            tokenizer.ggml.padding_token_id u32              = 100001
llama_model_loader: - kv  42:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  43:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  44:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  45:               general.quantization_version u32              = 2
llama_model_loader: - kv  46:                      quantize.imatrix.file str              = /models_out/DeepSeek-V2.5-GGUF/DeepSe...
llama_model_loader: - kv  47:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  48:             quantize.imatrix.entries_count i32              = 716
llama_model_loader: - kv  49:              quantize.imatrix.chunks_count i32              = 139
llama_model_loader: - kv  50:                                   split.no u16              = 0
llama_model_loader: - kv  51:                                split.count u16              = 4
llama_model_loader: - kv  52:                        split.tensors.count i32              = 959
llama_model_loader: - type  f32:  300 tensors
llama_model_loader: - type q4_0:  645 tensors
llama_model_loader: - type q4_1:   13 tensors
llama_model_loader: - type q6_K:    1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 124.23 GiB (4.53 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 18
load: token to piece cache size = 0.6411 MB
print_info: arch             = deepseek2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 163840
print_info: n_embd           = 5120
print_info: n_layer          = 60
print_info: n_head           = 128
print_info: n_head_kv        = 128
print_info: n_rot            = 64
print_info: n_swa            = 0
print_info: n_embd_head_k    = 192
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 1
print_info: n_embd_k_gqa     = 24576
print_info: n_embd_v_gqa     = 16384
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 12288
print_info: n_expert         = 160
print_info: n_expert_used    = 6
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 0
print_info: rope scaling     = yarn
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 0.025
print_info: n_ctx_orig_yarn  = 4096
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 236B
print_info: model params     = 235.74 B
print_info: general.name     = DeepSeek V2.5
print_info: n_layer_dense_lead   = 1
print_info: n_lora_q             = 1536
print_info: n_lora_kv            = 512
print_info: n_ff_exp             = 1536
print_info: n_expert_shared      = 2
print_info: expert_weights_scale = 16.0
print_info: expert_weights_norm  = 0
print_info: expert_gating_func   = softmax
print_info: rope_yarn_log_mul    = 0.1000
print_info: vocab type       = BPE
print_info: n_vocab          = 102400
print_info: n_merges         = 99757
print_info: BOS token        = 100000 '<|begin▁of▁sentence|>'
print_info: EOS token        = 100001 '<|end▁of▁sentence|>'
print_info: EOT token        = 100001 '<|end▁of▁sentence|>'
print_info: PAD token        = 100001 '<|end▁of▁sentence|>'
print_info: LF token         = 126 'Ä'
print_info: FIM PRE token    = 100003 '<|fim▁begin|>'
print_info: FIM SUF token    = 100002 '<|fim▁hole|>'
print_info: FIM MID token    = 100004 '<|fim▁end|>'
print_info: EOG token        = 100001 '<|end▁of▁sentence|>'
print_info: max token length = 256
load_tensors:   CPU_Mapped model buffer size = 37602.27 MiB
load_tensors:   CPU_Mapped model buffer size = 34353.59 MiB
load_tensors:   CPU_Mapped model buffer size = 36378.63 MiB
load_tensors:   CPU_Mapped model buffer size = 12801.23 MiB
load_tensors:  CPU_AARCH64 model buffer size = 121738.36 MiB
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 4096
llama_init_from_model: n_ctx_per_seq = 4096
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 512
llama_init_from_model: flash_attn    = 0
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 0.025
llama_init_from_model: n_ctx_per_seq (4096) < n_ctx_train (163840) -- the full capacity of the model will not be utilized
llama_kv_cache_init: kv_size = 4096, offload = 1, type_k = 'f16', type_v = 'f16', n_layer = 60, can_shift = 0
llama_kv_cache_init:        CPU KV buffer size = 19200.00 MiB
llama_init_from_model: KV self size  = 19200.00 MiB, K (f16): 11520.00 MiB, V (f16): 7680.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.39 MiB
llama_init_from_model:        CPU compute buffer size =  1174.01 MiB
llama_init_from_model: graph nodes  = 4480
llama_init_from_model: graph splits = 1
common_init_from_params: KV cache shifting is not supported for this model, disabling KV cache shifting
common_init_from_params: setting dry_penalty_last_n to ctx_size = 4096
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 4096
main: model loaded
main: chat template, chat_template: (built-in), example_format: 'You are a helpful assistant

<|User|>Hello<|Assistant|>Hi there<|end▁of▁sentence|><|User|>How are you?<|Assistant|>'
main: server is listening on http://127.0.0.1:8080 - starting the main loop
srv  update_slots: all slots are idle
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant