Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance degradation with P40 on larger models #6814

Closed
samr7 opened this issue Apr 21, 2024 · 2 comments
Closed

Performance degradation with P40 on larger models #6814

samr7 opened this issue Apr 21, 2024 · 2 comments

Comments

@samr7
Copy link

samr7 commented Apr 21, 2024

I have a machine with a lot of old parts in it, including 8 P40s and 2 Xeon E5-2667v2 CPUs.

I build llama.cpp using:
cmake -DLLAMA_AVX2=off -DLLAMA_F16C=off -DLLAMA_CUBLAS=on -DLLAMA_CUDA_FORCE_MMQ=on

Using a llama2-70b-Q8_0 model, I see good results with release b1842 and earlier. With b1843 and newer, from January 12, with #4766, I see a ~62% drop:

bin/main -m ../text-generation-webui/models/Synthia-70b-v1.2.Q8_0.gguf -ngl 99 -p "Why is the sky blue?" -n 128

b1691: 10.76 t/s
b1767: 9.75 t/s
b1808: 9.76 t/s
b1832: 9.77 t/s
b1842: 9.76 t/s
b1843: 3.73 t/s
b2400: 3.83 t/s
b2709: 3.84 t/s

Trying the test with some other models, the discrepancy is much less in smaller models, to the point that the 8B model is considerably faster with the latest release:

Model b1842 b1843 b2709
Synthia-70b-v1.2.Q8_0 9.76 t/s 3.73 t/s 3.84 t/s
phind-codellama-34b-v2.Q8_0 16.99 t/s 7.54 t/s 7.78 t/s
llama-2-13b-Q8_0 21.10 t/s 17.67 t/s 18.63 t/s
Meta-Llama-3-8B-Instruct.Q8_0 25.66 t/s 33.27 t/s 31.83 t/s

Using fewer GPUs for this test (with the 70b model) makes b1842 a bit slower, but otherwise doesn't seem to change the result much:

GPUs b1842 b1843 b2709
8 9.76 t/s 3.73 t/s 3.84 t/s
4 9.61 t/s 3.77 t/s 3.89 t/s
3 8.32 t/s 3.77 t/s 3.91 t/s

Changing the CPU thread count (with the 70b model) shows relative improvements for each build, but does not resolve the bigger discrepancies:

Threads b1842 b2709
-t 1 10.05 t/s 3.90 t/s
-t 4 10.06 t/s 3.90 t/s
-t 8 10.09 t/s 3.90 t/s

The system is similar in topology to a Supermicro SYS-4028GR-TR2. The GPUs are all PCIe 3.0x16 attached to PLX switches and have relatively good CPU and P2P bandwidth over PCIe -- 11-13GB/s between any pair.

Any ideas?

@slaren
Copy link
Member

slaren commented Apr 21, 2024

Try -sm row.

@samr7
Copy link
Author

samr7 commented Apr 21, 2024

-sm row seems to improve things a lot! Thanks.

@samr7 samr7 closed this as completed Apr 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants