You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.
I build llama.cpp using:
cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on
Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:
bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128
Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:
Model
b1842
b1843
b2709
Synthia-70b-v1.2.Q8_0
9.76 t/s
3.73 t/s
3.84 t/s
phind-codellama-34b-v2.Q8_0
16.99 t/s
7.54 t/s
7.78 t/s
llama-2-13b-Q8_0
21.10 t/s
17.67 t/s
18.63 t/s
Meta-Llama-3-8B-Instruct.Q8_0
25.66 t/s
33.27 t/s
31.83 t/s
Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:
GPUs
b1842
b1843
b2709
8
9.76 t/s
3.73 t/s
3.84 t/s
4
9.61 t/s
3.77 t/s
3.89 t/s
3
8.32 t/s
3.77 t/s
3.91 t/s
Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:
Threads
b1842
b2709
-t 1
10.05 t/s
3.90 t/s
-t 4
10.06 t/s
3.90 t/s
-t 8
10.09 t/s
3.90 t/s
The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.
Any ideas?
The text was updated successfully, but these errors were encountered:
I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.
I build llama.cpp using:
cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on
Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:
bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128
b1691: 10.76 t/s
b1767: 9.75 t/s
b1808: 9.76 t/s
b1832: 9.77 t/s
b1842: 9.76 t/s
b1843: 3.73 t/s
b2400: 3.83 t/s
b2709: 3.84 t/s
Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:
Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:
Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:
The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.
Any ideas?
The text was updated successfully, but these errors were encountered: