Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BENCHMARKS] DeepScaleR-1.5B-Preview F16 ollama GGUF vs llama.cpp #11828

Open
loretoparisi opened this issue Feb 12, 2025 · 3 comments
Open

[BENCHMARKS] DeepScaleR-1.5B-Preview F16 ollama GGUF vs llama.cpp #11828

loretoparisi opened this issue Feb 12, 2025 · 3 comments

Comments

@loretoparisi
Copy link

Running on Mac M1 Pro the brand new DeepScaleR-1.5B-Preview quantized F16 (here)

I see ollama GGUF quantization running (eval rate) at 48.19 tokens/s (short prompt)

ollama run deepscaler --verbose
>>> what is the capital of Italy?
<think>

</think>

The capital of Italy is Rome.

total duration:       544.99475ms
load duration:        40.629958ms
prompt eval count:    10 token(s)
prompt eval duration: 253ms
prompt eval rate:     39.53 tokens/s
eval count:           12 token(s)
eval duration:        249ms
eval rate:            48.19 tokens/s

and for a more complex prompt

>>> tell me some information about Rome in Italy
<think>
Okay, so the user initially asked for the capital of Italy and I told them it's Rome. Now they're asking to "tell me some information about Rome in Italy." They 
might be interested in getting a more detailed overview or perhaps looking to learn more about the city itself.

I should provide a comprehensive answer that covers different aspects of Rome. Maybe start with its location, history, architecture, culture, and modern 
developments. Also, mentioning its significance as an educational and cultural hub would add value.

I need to make sure the information is accurate and easy to understand, while also giving depth so it answers their query thoroughly.
</think>

Rome, often referred to as the "Capital of Italy," is a historic and vibrant city located in Italy. Here are some key details about Rome:

1. **Location**: Rome is situated in the Italy region of Central Europe, specifically within the province of Emilia-Romagna.

2. **History**: Rome has a rich history dating back over two thousand years. It was the capital of the Roman Republic until it was occupied by the Mongol Empire 
and later by the Ottomans during the Islamic conquest. The city gained its modern identity during the Middle Ages with its transformation into a major European 
metropolis.

3. **Architecture**: Rome boasts a stunning array of historic buildings that have become iconic landmarks, such as the Roman Colosseum, the Roman Forum, the 
Pantheon, and the Vatican Museums (which includes the Sistine Chapel). The city's architecture is renowned for its intricate streets, domes, and towers.

4. **Cultural Significance**: Rome is one of the most important cultural and educational centers in the world. It has been a center of learning, art, and trade 
throughout history. The Accademia Piazzale, a cultural institution, attracts millions of visitors each year.

5. **Modern Developments**: In recent years, Rome has experienced significant growth and development. Modern architecture, such as the Sistine Chapel Cathedral 
and the Vatican Museums, are now more accessible to the public. The city is also home to numerous universities and vibrant entertainment venues.

6. **Tourism**: Rome offers a variety of tourist attractions, including the Roman Forum, Palazzo Vecchio (the city's oldest building), the Colosseum, and the 
Vatican. The city is famous for its stunning views from Roman ruins like the Pantheon and the Great Wall of China.

7. **Economy**: Rome has a diverse economy that includes tourism, finance, technology, and manufacturing. It serves as a hub for business and innovation.

Rome's combination of history, culture, and modern architecture makes it a fascinating and important city to explore. Whether you're a history enthusiast or an 
cultural lover, Rome offers something for everyone.

total duration:       13.643213s
load duration:        33.387458ms
prompt eval count:    32 token(s)
prompt eval duration: 273ms
prompt eval rate:     117.22 tokens/s
eval count:           567 token(s)
eval duration:        13.105s
eval rate:            43.27 tokens/s

while running it with llama-server, using this setup:

./build/bin/llama-server \
    --model .ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e \
    --n-gpu-layers 29 \
    --ctx-size 8192 \
    --cache-type-k q4_0 \
    --cache-type-v f16 \
    --flash-attn \
    --parallel 1 \
    --threads 12 \
    -ub 256 \
    --temp 0.6 \
    --host 0.0.0.0 \
    --port 8000

I'm getting a real slow token/sec (short prompt)

main: server is listening on http://0.0.0.0:8000 - starting the main loop
srv  update_slots: all slots are idle
slot launch_slot_: id  0 | task 0 | processing task
slot update_slots: id  0 | task 0 | new prompt, n_ctx_slot = 8192, n_keep = 0, n_prompt_tokens = 6
slot update_slots: id  0 | task 0 | kv cache rm [0, end)
slot update_slots: id  0 | task 0 | prompt processing progress, n_past = 6, n_tokens = 6, progress = 1.000000
slot update_slots: id  0 | task 0 | prompt done, n_past = 6, n_tokens = 6

Here details of llama-server loading

build: 4695 (fef0cbea) with Apple clang version 16.0.0 (clang-1600.0.26.6) for arm64-apple-darwin24.0.0
system info: n_threads = 12, n_threads_batch = 12, total_threads = 8

system_info: n_threads = 12 (n_threads_batch = 12) / 8 | Metal : EMBED_LIBRARY = 1 | CPU : NEON = 1 | ARM_FMA = 1 | FP16_VA = 1 | DOTPROD = 1 | LLAMAFILE = 1 | ACCELERATE = 1 | AARCH64_REPACK = 1 | 

main: HTTP server is listening, hostname: 0.0.0.0, port: 8000, http threads: 7
main: loading model
srv    load_model: loading model '.ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e'
llama_model_load_from_file_impl: using device Metal (Apple M1 Pro) - 10922 MiB free
llama_model_loader: loaded meta data with 47 key-value pairs and 339 tensors from .ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = qwen2
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = DeepScaleR 1.5B Preview
llama_model_loader: - kv   3:                       general.organization str              = Agentica Org
llama_model_loader: - kv   4:                           general.finetune str              = Preview
llama_model_loader: - kv   5:                           general.basename str              = DeepScaleR
llama_model_loader: - kv   6:                         general.size_label str              = 1.5B
llama_model_loader: - kv   7:                            general.license str              = mit
llama_model_loader: - kv   8:                   general.base_model.count u32              = 1
llama_model_loader: - kv   9:                  general.base_model.0.name str              = DeepSeek R1 Distill Qwen 1.5B
llama_model_loader: - kv  10:          general.base_model.0.organization str              = Deepseek Ai
llama_model_loader: - kv  11:              general.base_model.0.repo_url str              = https://huggingface.co/deepseek-ai/De...
llama_model_loader: - kv  12:                      general.dataset.count u32              = 4
llama_model_loader: - kv  13:                     general.dataset.0.name str              = NuminaMath CoT
llama_model_loader: - kv  14:             general.dataset.0.organization str              = AI MO
llama_model_loader: - kv  15:                 general.dataset.0.repo_url str              = https://huggingface.co/AI-MO/NuminaMa...
llama_model_loader: - kv  16:                     general.dataset.1.name str              = Omni MATH
llama_model_loader: - kv  17:             general.dataset.1.organization str              = KbsdJames
llama_model_loader: - kv  18:                 general.dataset.1.repo_url str              = https://huggingface.co/KbsdJames/Omni...
llama_model_loader: - kv  19:                     general.dataset.2.name str              = STILL 3 Preview RL Data
llama_model_loader: - kv  20:             general.dataset.2.organization str              = RUC AIBOX
llama_model_loader: - kv  21:                 general.dataset.2.repo_url str              = https://huggingface.co/RUC-AIBOX/STIL...
llama_model_loader: - kv  22:                     general.dataset.3.name str              = Competition_Math
llama_model_loader: - kv  23:             general.dataset.3.organization str              = Hendrycks
llama_model_loader: - kv  24:                 general.dataset.3.repo_url str              = https://huggingface.co/hendrycks/comp...
llama_model_loader: - kv  25:                          general.languages arr[str,1]       = ["en"]
llama_model_loader: - kv  26:                          qwen2.block_count u32              = 28
llama_model_loader: - kv  27:                       qwen2.context_length u32              = 131072
llama_model_loader: - kv  28:                     qwen2.embedding_length u32              = 1536
llama_model_loader: - kv  29:                  qwen2.feed_forward_length u32              = 8960
llama_model_loader: - kv  30:                 qwen2.attention.head_count u32              = 12
llama_model_loader: - kv  31:              qwen2.attention.head_count_kv u32              = 2
llama_model_loader: - kv  32:                       qwen2.rope.freq_base f32              = 10000.000000
llama_model_loader: - kv  33:     qwen2.attention.layer_norm_rms_epsilon f32              = 0.000001
llama_model_loader: - kv  34:                          general.file_type u32              = 1
llama_model_loader: - kv  35:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  36:                         tokenizer.ggml.pre str              = qwen2
llama_model_loader: - kv  37:                      tokenizer.ggml.tokens arr[str,151936]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  38:                  tokenizer.ggml.token_type arr[i32,151936]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  39:                      tokenizer.ggml.merges arr[str,151387]  = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv  40:                tokenizer.ggml.bos_token_id u32              = 151646
llama_model_loader: - kv  41:                tokenizer.ggml.eos_token_id u32              = 151643
llama_model_loader: - kv  42:            tokenizer.ggml.padding_token_id u32              = 151643
llama_model_loader: - kv  43:               tokenizer.ggml.add_bos_token bool             = true
llama_model_loader: - kv  44:               tokenizer.ggml.add_eos_token bool             = false
llama_model_loader: - kv  45:                    tokenizer.chat_template str              = {% if not add_generation_prompt is de...
llama_model_loader: - kv  46:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:  141 tensors
llama_model_loader: - type  f16:  198 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = F16
print_info: file size   = 3.31 GiB (16.00 BPW) 
load: special_eos_id is not in special_eog_ids - the tokenizer config may be incorrect
load: special tokens cache size = 22
load: token to piece cache size = 0.9310 MB
print_info: arch             = qwen2
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 1536
print_info: n_layer          = 28
print_info: n_head           = 12
print_info: n_head_kv        = 2
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 6
print_info: n_embd_k_gqa     = 256
print_info: n_embd_v_gqa     = 256
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-06
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 8960
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: causal attn      = 1
print_info: pooling type     = 0
print_info: rope type        = 2
print_info: rope scaling     = linear
print_info: freq_base_train  = 10000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn  = 131072
print_info: rope_finetuned   = unknown
print_info: ssm_d_conv       = 0
print_info: ssm_d_inner      = 0
print_info: ssm_d_state      = 0
print_info: ssm_dt_rank      = 0
print_info: ssm_dt_b_c_rms   = 0
print_info: model type       = 1.5B
print_info: model params     = 1.78 B
print_info: general.name     = DeepScaleR 1.5B Preview
print_info: vocab type       = BPE
print_info: n_vocab          = 151936
print_info: n_merges         = 151387
print_info: BOS token        = 151646 '<|begin▁of▁sentence|>'
print_info: EOS token        = 151643 '<|end▁of▁sentence|>'
print_info: EOT token        = 151643 '<|end▁of▁sentence|>'
print_info: PAD token        = 151643 '<|end▁of▁sentence|>'
print_info: LF token         = 198 'Ċ'
print_info: FIM PRE token    = 151659 '<|fim_prefix|>'
print_info: FIM SUF token    = 151661 '<|fim_suffix|>'
print_info: FIM MID token    = 151660 '<|fim_middle|>'
print_info: FIM PAD token    = 151662 '<|fim_pad|>'
print_info: FIM REP token    = 151663 '<|repo_name|>'
print_info: FIM SEP token    = 151664 '<|file_sep|>'
print_info: EOG token        = 151643 '<|end▁of▁sentence|>'
print_info: EOG token        = 151662 '<|fim_pad|>'
print_info: EOG token        = 151663 '<|repo_name|>'
print_info: EOG token        = 151664 '<|file_sep|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 28 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 29/29 layers to GPU
load_tensors: Metal_Mapped model buffer size =  2944.68 MiB
load_tensors:   CPU_Mapped model buffer size =   445.12 MiB
...........................................................................
llama_init_from_model: n_seq_max     = 1
llama_init_from_model: n_ctx         = 8192
llama_init_from_model: n_ctx_per_seq = 8192
llama_init_from_model: n_batch       = 2048
llama_init_from_model: n_ubatch      = 256
llama_init_from_model: flash_attn    = 1
llama_init_from_model: freq_base     = 10000.0
llama_init_from_model: freq_scale    = 1
llama_init_from_model: n_ctx_per_seq (8192) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
ggml_metal_init: allocating
ggml_metal_init: found device: Apple M1 Pro
ggml_metal_init: picking default device: Apple M1 Pro
ggml_metal_init: using embedded metal library
ggml_metal_init: GPU name:   Apple M1 Pro
ggml_metal_init: GPU family: MTLGPUFamilyApple7  (1007)
ggml_metal_init: GPU family: MTLGPUFamilyCommon3 (3003)
ggml_metal_init: GPU family: MTLGPUFamilyMetal3  (5001)
ggml_metal_init: simdgroup reduction   = true
ggml_metal_init: simdgroup matrix mul. = true
ggml_metal_init: has residency sets    = true
ggml_metal_init: has bfloat            = true
ggml_metal_init: use bfloat            = false
ggml_metal_init: hasUnifiedMemory      = true
ggml_metal_init: recommendedMaxWorkingSetSize  = 11453.25 MB
ggml_metal_init: skipping kernel_get_rows_bf16                     (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_1row              (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_f32_l4                (not supported)
ggml_metal_init: skipping kernel_mul_mv_bf16_bf16                  (not supported)
ggml_metal_init: skipping kernel_mul_mv_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_mul_mm_bf16_f32                   (not supported)
ggml_metal_init: skipping kernel_mul_mm_id_bf16_f32                (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h64           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h80           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h96           (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h112          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h128          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_bf16_h256          (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h128      (not supported)
ggml_metal_init: skipping kernel_flash_attn_ext_vec_bf16_h256      (not supported)
ggml_metal_init: skipping kernel_cpy_f32_bf16                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_f32                      (not supported)
ggml_metal_init: skipping kernel_cpy_bf16_bf16                     (not supported)
llama_kv_cache_init: kv_size = 8192, offload = 1, type_k = 'q4_0', type_v = 'f16', n_layer = 28, can_shift = 1
llama_kv_cache_init:      Metal KV buffer size =   143.50 MiB
llama_init_from_model: KV self size  =  143.50 MiB, K (q4_0):   31.50 MiB, V (f16):  112.00 MiB
llama_init_from_model:        CPU  output buffer size =     0.58 MiB
llama_init_from_model:      Metal compute buffer size =   151.38 MiB
llama_init_from_model:        CPU compute buffer size =     9.50 MiB
llama_init_from_model: graph nodes  = 875
llama_init_from_model: graph splits = 58
common_init_from_params: setting dry_penalty_last_n to ctx_size = 8192
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
srv          init: initializing slots, n_slots = 1
slot         init: id  0 | task -1 | new slot n_ctx_slot = 8192
main: model loaded

I wil not put a longer prompt 'cause extremely slow.

@loretoparisi
Copy link
Author

loretoparisi commented Feb 12, 2025

[UIPDATE]

Okay some improvements setting --n-gpu-layers 28 and removing --cache-type-k q4_0:

%./build/bin/llama-cli \
    --model /.ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e \
    --n-gpu-layers 28 \
    --ctx-size 8192 \
    --cache-type-v f16 \
    --flash-attn \
    --parallel 1 \
    --threads 12 -no-cnv --prio 2 \
    -ub 256 \
    --temp 0.6 \
    --prompt "<|User|>What is the capital of Italy?<|Assistant|>"

So I get now 10,95 tokens per second) on Mac M1 / Pro

llama_perf_sampler_print:    sampling time =      34,28 ms /   340 runs   (    0,10 ms per token,  9918,32 tokens per second)
llama_perf_context_print:        load time =     338,53 ms
llama_perf_context_print: prompt eval time =      96,75 ms /    10 tokens (    9,68 ms per token,   103,36 tokens per second)
llama_perf_context_print:        eval time =   30043,21 ms /   329 runs   (   91,32 ms per token,    10,95 tokens per second)
llama_perf_context_print:       total time =   30233,64 ms /   339 tokens
ggml_metal_free: deallocating

that is anyways 4x times slower than ollama apparently.

@ggerganov
Copy link
Member

-ngl 29 -t 1

@loretoparisi
Copy link
Author

Setting -t 1 and --n-gpu-layers 29:

% ./build/bin/llama-cli \
    --model /Users/musixmatch/.ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e \
    --n-gpu-layers 29 \
    --ctx-size 8192 \
    --cache-type-v f16 \
    --flash-attn \
    --parallel 1 \
    --threads 12 -no-cnv --prio 2 -t 1\
    -ub 256 \
    --temp 0.6 \
  • SHORT PROMPT
    we get 39,15 tokens per second
llama_perf_sampler_print:    sampling time =      74,55 ms /  1127 runs   (    0,07 ms per token, 15117,37 tokens per second)
llama_perf_context_print:        load time =     420,99 ms
llama_perf_context_print: prompt eval time =      48,24 ms /    10 tokens (    4,82 ms per token,   207,28 tokens per second)
llama_perf_context_print:        eval time =   28508,24 ms /  1116 runs   (   25,55 ms per token,    39,15 tokens per second)
llama_perf_context_print:       total time =   28720,27 ms /  1126 tokens
  • LONG PROMPT
    Running at 46,52 tokens per second
llama_perf_sampler_print:    sampling time =      80,40 ms /  1272 runs   (    0,06 ms per token, 15821,49 tokens per second)
llama_perf_context_print:        load time =    1507,67 ms
llama_perf_context_print: prompt eval time =      53,25 ms /    11 tokens (    4,84 ms per token,   206,56 tokens per second)
llama_perf_context_print:        eval time =   27082,33 ms /  1260 runs   (   21,49 ms per token,    46,52 tokens per second)
llama_perf_context_print:       total time =   27301,40 ms /  1271 tokens

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants