Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using T-MAC is slower than original llama.cpp #79

Open
xdd130 opened this issue Dec 18, 2024 · 3 comments
Open

Using T-MAC is slower than original llama.cpp #79

xdd130 opened this issue Dec 18, 2024 · 3 comments

Comments

@xdd130
Copy link

xdd130 commented Dec 18, 2024

TEST PLATFORM :AMD R5 7600X

T-MAC Test step:

test model
Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4

Compilation instructions

python tools/run_pipeline.py -o Qwen2.5-3B-Instruct-GPTQ-Int4 -m auto-gptq -q int_n

test instructions:

./3rdparty/llama.cpp/build/bin/llama-bench -m Qwen2.5-3B-Instruct-GPTQ-Int4/ggml-model.int_n.gguf -p 512 -n 128 -t 4

result:

model size params backend threads test t/s
qwen2 ?B INT_N 2.53 GiB 3.40 B CPU 4 pp512 58.42 ± 0.21
qwen2 ?B INT_N 2.53 GiB 3.40 B CPU 4 tg128 20.46 ± 0.07

original llama.cpp:

test model:

Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q4_k_m.gguf

Compilation instructions

mkdir build && cd build
cmake .. -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
cmake --build . --target llama-cli llama-bench --config Release -- -j6

test instructions:

./llama-bench -m qwen2.5-3b-instruct-q4_k_m.gguf -p 512 -n 128 -t 4

result:

model size params backend threads test t/s
qwen2 3B Q4_K - Medium 1.95 GiB 3.40 B CPU 4 pp512 67.33 ± 0.10
qwen2 3B Q4_K - Medium 1.95 GiB 3.40 B CPU 4 tg128 22.72 ± 0.04

From the test results, it seems that using T-MAC has no performance advantage on this machine?What could be the reason for this phenomenon?

@BodhiHu
Copy link

BodhiHu commented Dec 23, 2024

Hi @xdd130 , what's your testing OS and hardware config ?

@xdd130
Copy link
Author

xdd130 commented Dec 25, 2024

Hi @BodhiHu
Thanks for your reply
there is my test config:
OS:Ubuntu20.04
CPU:AMD R5 7600X(6cores 12threads)
memory:32GB 6400MHz DDR5

@QingtaoLi1
Copy link
Contributor

@xdd130 Since T-MAC uses another set of instructions (tbl/shuf) compared to multiply-based methods (mul/madd/...), their performance gap can vary according to the CPU. AVX512 in Zen4 may be one of the reasons.

BTW the convert script will keep embedding/output weights FP16, while Q4_K uses smaller types. You can try running llama-quantize with --token-embedding-type q4_k --output-tensor-type q6_k and quant type f16 to further compress the model size.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants