Using T-MAC is slower than original llama.cpp #79

xdd130 · 2024-12-18T05:53:47Z

TEST PLATFORM ：AMD R5 7600X

T-MAC Test step：

test model ：
Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4

Compilation instructions：

python tools/run_pipeline.py -o Qwen2.5-3B-Instruct-GPTQ-Int4 -m auto-gptq -q int_n

test instructions:

./3rdparty/llama.cpp/build/bin/llama-bench -m Qwen2.5-3B-Instruct-GPTQ-Int4/ggml-model.int_n.gguf -p 512 -n 128 -t 4

result:

model	size	params	backend	threads	test	t/s
qwen2 ?B INT_N	2.53 GiB	3.40 B	CPU	4	pp512	58.42 ± 0.21
qwen2 ?B INT_N	2.53 GiB	3.40 B	CPU	4	tg128	20.46 ± 0.07

original llama.cpp:

test model:

Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q4_k_m.gguf

Compilation instructions：

mkdir build && cd build
cmake .. -DCMAKE_VERBOSE_MAKEFILE:BOOL=ON -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_COMPILER=clang -DCMAKE_CXX_COMPILER=clang++
cmake --build . --target llama-cli llama-bench --config Release -- -j6

test instructions:

./llama-bench -m qwen2.5-3b-instruct-q4_k_m.gguf -p 512 -n 128 -t 4

result:

model	size	params	backend	threads	test	t/s
qwen2 3B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	pp512	67.33 ± 0.10
qwen2 3B Q4_K - Medium	1.95 GiB	3.40 B	CPU	4	tg128	22.72 ± 0.04

From the test results, it seems that using T-MAC has no performance advantage on this machine？What could be the reason for this phenomenon?

The text was updated successfully, but these errors were encountered:

BodhiHu · 2024-12-23T12:37:36Z

Hi @xdd130 , what's your testing OS and hardware config ?

xdd130 · 2024-12-25T08:28:48Z

Hi @BodhiHu
Thanks for your reply
there is my test config：
OS：Ubuntu20.04
CPU：AMD R5 7600X(6cores 12threads)
memory：32GB 6400MHz DDR5

QingtaoLi1 · 2025-01-03T10:36:40Z

@xdd130 Since T-MAC uses another set of instructions (tbl/shuf) compared to multiply-based methods (mul/madd/...), their performance gap can vary according to the CPU. AVX512 in Zen4 may be one of the reasons.

BTW the convert script will keep embedding/output weights FP16, while Q4_K uses smaller types. You can try running llama-quantize with --token-embedding-type q4_k --output-tensor-type q6_k and quant type f16 to further compress the model size.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using T-MAC is slower than original llama.cpp #79

Using T-MAC is slower than original llama.cpp #79

xdd130 commented Dec 18, 2024

BodhiHu commented Dec 23, 2024

xdd130 commented Dec 25, 2024

QingtaoLi1 commented Jan 3, 2025

Using T-MAC is slower than original llama.cpp #79

Using T-MAC is slower than original llama.cpp #79

Comments

xdd130 commented Dec 18, 2024

TEST PLATFORM ：AMD R5 7600X

T-MAC Test step：

original llama.cpp:

BodhiHu commented Dec 23, 2024

xdd130 commented Dec 25, 2024

QingtaoLi1 commented Jan 3, 2025