-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BENCHMARKS] DeepScaleR-1.5B-Preview F16 ollama GGUF vs llama.cpp #11828
Comments
[UIPDATE] Okay some improvements setting %./build/bin/llama-cli \
--model /.ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e \
--n-gpu-layers 28 \
--ctx-size 8192 \
--cache-type-v f16 \
--flash-attn \
--parallel 1 \
--threads 12 -no-cnv --prio 2 \
-ub 256 \
--temp 0.6 \
--prompt "<|User|>What is the capital of Italy?<|Assistant|>" So I get now
that is anyways 4x times slower than |
|
Setting % ./build/bin/llama-cli \
--model /Users/musixmatch/.ollama/models/blobs/sha256-95ff0bccfe6096c58d176bcbe8d0c87ccc4b517c0eade8acaa0797a9e441122e \
--n-gpu-layers 29 \
--ctx-size 8192 \
--cache-type-v f16 \
--flash-attn \
--parallel 1 \
--threads 12 -no-cnv --prio 2 -t 1\
-ub 256 \
--temp 0.6 \
llama_perf_sampler_print: sampling time = 74,55 ms / 1127 runs ( 0,07 ms per token, 15117,37 tokens per second)
llama_perf_context_print: load time = 420,99 ms
llama_perf_context_print: prompt eval time = 48,24 ms / 10 tokens ( 4,82 ms per token, 207,28 tokens per second)
llama_perf_context_print: eval time = 28508,24 ms / 1116 runs ( 25,55 ms per token, 39,15 tokens per second)
llama_perf_context_print: total time = 28720,27 ms / 1126 tokens
llama_perf_sampler_print: sampling time = 80,40 ms / 1272 runs ( 0,06 ms per token, 15821,49 tokens per second)
llama_perf_context_print: load time = 1507,67 ms
llama_perf_context_print: prompt eval time = 53,25 ms / 11 tokens ( 4,84 ms per token, 206,56 tokens per second)
llama_perf_context_print: eval time = 27082,33 ms / 1260 runs ( 21,49 ms per token, 46,52 tokens per second)
llama_perf_context_print: total time = 27301,40 ms / 1271 tokens |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Running on Mac M1 Pro the brand new
DeepScaleR-1.5B-Preview
quantized F16 (here)I see
ollama
GGUF quantization running (eval rate) at48.19 tokens/s
(short prompt)and for a more complex prompt
while running it with
llama-server
, using this setup:I'm getting a real slow token/sec (short prompt)
Here details of
llama-server
loadingI wil not put a longer prompt 'cause extremely slow.
The text was updated successfully, but these errors were encountered: