You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@xdd130 Since T-MAC uses another set of instructions (tbl/shuf) compared to multiply-based methods (mul/madd/...), their performance gap can vary according to the CPU. AVX512 in Zen4 may be one of the reasons.
BTW the convert script will keep embedding/output weights FP16, while Q4_K uses smaller types. You can try running llama-quantize with --token-embedding-type q4_k --output-tensor-type q6_k and quant type f16 to further compress the model size.
TEST PLATFORM :AMD R5 7600X
T-MAC Test step:
test model :
Qwen/Qwen2.5-3B-Instruct-GPTQ-Int4
Compilation instructions:
test instructions:
result:
original llama.cpp:
test model:
Qwen/Qwen2.5-3B-Instruct-GGUF/qwen2.5-3b-instruct-q4_k_m.gguf
Compilation instructions:
test instructions:
result:
From the test results, it seems that using T-MAC has no performance advantage on this machine?What could be the reason for this phenomenon?
The text was updated successfully, but these errors were encountered: