-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cuda : speed-up by using CUBLAS_COMPUTE_32F instead of CUBLAS_COMPUTE_16F #3816
Conversation
@slaren Have you noticed this as well? Do you think there is any reason not to switch to F32 compute? |
I tested this again on a 3090 Ti, and for me master is faster: Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
build: c8d6a1f (1431) (master) Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
build: 3b9ea65 (1432) (PR) |
Might be a dumb question, but would these changes affect quantized models at all? |
@slaren Indeed, on RTX 3090 I also don't observe benefit from 32F mode. So it's not an universal thing @KerfuffleV2 Yes, because for some of the operations we dequantize to F16 and use cuBLAS. There are some numbers in my post above for quantized models ( |
I restored the Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
build: 0f2498f (1433) LLAMA_CUBLAS=1 make -j batched-bench && ./batched-bench ./models/llama-2-13b.Q4_K_M.gguf 4096 1 99 1 512,3200 128,800 1
### master
main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 0.248 | 2060.39 | 1.985 | 64.48 | 2.234 | 286.52 |
| 512 | 800 | 1 | 1312 | 0.240 | 2131.52 | 12.681 | 63.08 | 12.922 | 101.54 |
| 3200 | 128 | 1 | 3328 | 1.946 | 1644.78 | 2.416 | 52.98 | 4.362 | 763.01 |
| 3200 | 800 | 1 | 4000 | 1.929 | 1658.93 | 15.530 | 51.51 | 17.459 | 229.11 |
### PR
main: n_kv_max = 4096, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 512 | 128 | 1 | 640 | 0.243 | 2110.64 | 1.898 | 67.43 | 2.141 | 298.96 |
| 512 | 800 | 1 | 1312 | 0.233 | 2197.75 | 12.323 | 64.92 | 12.556 | 104.49 |
| 3200 | 128 | 1 | 3328 | 1.893 | 1690.68 | 2.452 | 52.21 | 4.344 | 766.03 |
| 3200 | 800 | 1 | 4000 | 1.872 | 1709.50 | 15.783 | 50.69 | 17.655 | 226.57 | Device 0: NVIDIA RTX A6000, compute capability 8.6
build: 0f2498f (1433) |
The performance is the same now for me too, maybe a little bit better than master, but within the margin of error. Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6
build: 0f2498f (1433) |
After 0f2498f the performance seems about the same as Tested with a Q5_K_M Mistral model, using
Expandmastermain: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1
PR pre-0f2498fmain: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1
PRmain: n_kv_max = 4608, is_pp_shared = 1, n_gpu_layers = 99, mmq = 1
PR +
|
PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
---|---|---|---|---|---|---|---|---|---|
512 | 128 | 1 | 640 | 0.983 | 521.02 | 4.016 | 31.87 | 4.999 | 128.04 |
512 | 128 | 2 | 768 | 0.982 | 521.25 | 17.280 | 14.81 | 18.263 | 42.05 |
512 | 128 | 3 | 896 | 1.016 | 503.79 | 17.448 | 22.01 | 18.465 | 48.53 |
512 | 128 | 4 | 1024 | 0.987 | 519.00 | 17.664 | 28.99 | 18.650 | 54.91 |
512 | 128 | 5 | 1152 | 1.053 | 486.17 | 17.780 | 36.00 | 18.833 | 61.17 |
512 | 128 | 6 | 1280 | 0.988 | 518.00 | 17.942 | 42.80 | 18.930 | 67.62 |
512 | 128 | 7 | 1408 | 0.988 | 518.47 | 18.091 | 49.53 | 19.078 | 73.80 |
512 | 128 | 8 | 1536 | 0.989 | 517.95 | 18.360 | 55.77 | 19.348 | 79.39 |
512 | 128 | 16 | 2560 | 0.989 | 517.65 | 20.503 | 99.89 | 21.492 | 119.12 |
512 | 128 | 32 | 4608 | 0.989 | 517.81 | 26.741 | 153.17 | 27.730 | 166.17 |
@KerfuffleV2 Thanks for the results. Likely this PR is not getting merged as the numbers are not convincing. |
I had tested FP8 cublas in ggllm, which is 40 series nvidia cards. The kernels for fp8 conversion and the cublas wrapper are here: https://github.com/cmp-nct/ggllm.cpp/blob/ggfalcon_dev/ggml-cuda.cu |
For dual P40s on 70b, I started having 107 second replies during prompt processing of about 3k tokens. With this PR, those replies have come down to 25 seconds, which is reasonable. Generation speed itself only went from 8.8 tokens to 8.95 tokens. Model is Q4KM on compute 6.1. Testing dual 3090s, there was some performance hit but it was negligible for me during oneshot generations or chat. I mostly see it in prompt processing and on such fast GPU it's fractions of a second and .XX tokens. |
Prompt processing results on my Tesla P40:
|
It was much worse for me without 0f2498f (not that anyone else probably really cares about performance on el-cheapo AMD GPUs) |
As expected pascal benefit greatly from FP32, 3090 can go either way and AMD favors FP16. |
Can you explain why it is expected exactly ? You can also point me to readings if it will be easier to explain by them. |
Pascal (at least P40) is 1/3 speed in FP16 ops. Nvidia made it this way and released the P100 for accelerated FP16 (but missing 8bit I think). They told users to pick one or the other based on application. It's why it doesn't work well for back-ends like exllama. It doesn't even have tensor cores. 3090 speed for FP16/FP32 is pretty much similar. Again, it's how they optimized it for what people were doing at the time. More and more workloads are using lower precision so nvidia keeps giving you smaller and smaller tensors. Hence 4xxx cards have stuff like FP8. People used to care about double precision at one point and now not a peep. AMD I think just came out at a time when people were using FP16 so they accelerated that. In short, every card is optimized for what was popular and demanded from customers at the time. They tend to tout FLOPS at a given precision in the specs too. https://www.techpowerup.com/gpu-specs/tesla-p40.c2878 |
I saw this and was excited for better performance on my cheapo P40+1080ti setup. (Both Tesla Cards that have 1:64 in fp16 vs fp32, so should be waaayyy faster, i guess?)
What is the explanation for this poor improvement? (q3 70b is the most I can fit in VRAM) |
Seeming 4x speedup on prompt processing is nothing to sneeze at. Go try 3k context now. I think token generation is already FP32 due to the nature of offloading to CPU unless you forced that FP16 compile flag. |
i guess it's faster in HBM only... |
Most of the gains of this seems to have been replicated by: #3882 |
edit: cannot reproduce these numbers for master. |
2648647
to
c830a05
Compare
I don't know what I did wrong in my previous testing, but I'm still seeing a 2.76x prompt processing speedup with this PR for 7B LLaMA on my P40. |
PR seems to have been obsoleted so I can't try it along with the new changes. I didn't get any speedup right after when the fix was made but now the code is all different. |
How come? I merged |
When I try to merge it it has conflicts. edit: I pulled the repo again and re-merged. It worked. I think it was a matter of 99 vs 103 for PP. From 10.x ms per token to 9.x ms per token. |
Hello, does the compilation for old graphics cards (NO TENSOR) change THAT MUCH? Have I made a mistake somewhere? I'm shocked :) GTX1080Ti power :) Today's llama.cpp build, Windows10 + AMD2990WX + GTX1080Ti + 64GB Test1: -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_MMQ=ON Test1 llama.cpp Amd2990WX + GTX1080TI -
============================Linux Ubuntu 22 CuBlass+MMQ==========================
============================================================================ Test2 llama.cpp Amd2990WX + GTX1080TI -
===========================Windows 10 CuBlass============================
================================================================================ ++++++++++++++++++++LLama-Bench+++++++++++++++++++ 1Test - llama-bench ((cmake .. -DLLAMA_CUBLAS=ON -DLLAMA_CUDA_FORCE_MMQ=ON)) =========================Windows 10 CuBlass+MMQ============================
==========================Windows 10 CuBlass================================
=========================Linux Ubuntu 22 CuBlass+MMQ==========================
============================Ubuntu-Linux-cublass=============================
|
Yea, mmq changes a lot. For newer cards it wasn't helpful either for single batches. |
For new cards we'll have to wait for cublas fp8 support, that's several times faster than fp32 and the precision is still awesome. |
FP8? Sure.. let me just fire up my H100 :P Already beat exllama on ampere, minus the prompt processing speed. I'm more hopeful for 8 bit kv_cache than FP8. On older GPUs I'm not sure what else can be done. |
Any 40 series card and the upcoming Super and 50 series support it. |
TIL, 4xxx supports it. In textgen when I did testing on exllamav2 8 or 16bit cache. Didn't appear to make a difference for the same models and wikitext. Hopefully that holds true here. For most "good" models, sadly 24g is now not enough. |
I had implemented it for Falcon inference in ggllm and it worked very well on my 4090, a significant speed boost when using cublas as compared to fp16 or fp32. |
True but that only benefits bleeding edge cards. I'd rather have a reasonable 103b than an instant 7b. Quality over quantity. |
e75889a
to
a40f611
Compare
There are conflicts since #4606 was merged. |
As noted above, FP32 is much faster than FP16 on the Tesla P40, but it's still a capable card otherwise with its 24GB VRAM. Can we have the option to specify the computation floating point type (and upcast float16 to float32 when necessary)? Besides choosing People using cutting-edge GPUs can also benefit from this option when they encounter Inf or NaN. See this comment for more details. |
41f0f44
to
4011f09
Compare
Unironically I get no benefits on 3090 from using FP16 besides some lower memory use. Through SD I have found that the weights can even be loaded as FP16 for these pascal cards as long as the calculations happen at the correct precision. xformers does this automatically on that end. |
This could be related: https://twitter.com/main_horse/status/1742013125090795531 |
4011f09
to
4cc78d3
Compare
CUBLAS_COMPUTE_32F, | ||
CUBLAS_GEMM_DEFAULT_TENSOR_OP)); | ||
} break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We might want to merge this particular change of ggml_cuda_op_mul_mat_cublas
since it uses less memory than cublasSgemm
and still performs the compute in F32 which is needed for models like Phi-2
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure if I am following all the logic, but I would be concerned about down-converting F32 src0/src1 to F16 despite the user requesting GGML_PREC_32
. In the long run, I think it would be better to always respect the user types and do all the type conversions in the graph (ggerganov/ggml#455), since it would give users more control and it would simplify the code in the backends. It would also move the temporary buffer from the pool to the compute buffer, which would result in more accurate estimation of the VRAM needed to run a model. It should also help with the issue of to_fp32
and to_fp16
in the CUDA backend being unable to deal with non-contiguous tensors, since it would be done in a ggml_cpy
instead.
Yet every time I tried Triton kernels on 3090 they were overall slower. |
I noticed significant speed differences based on such small changes when using cuBlas while developing ggllm(the flacon fork). In general, we should also have a look at the EXL2 implementation. Even without FP8 that kernel is 2 times faster than llama.cpp on modern hardware. It delivers up to 14000 tokens/second prompt processing on a single 4090 while on llama.cpp I top out at 5500. |
Heh.. so I have discovered a bad dimm in my server that was causing memory bandwidth to drop from 60gb/s down to 10g/s. Having fixed that issue, I was properly able to test performance regressions again. Since I merged this PR into main@5a7d312 I have gone from 18.6 t/s down to 15.5t/s on dual 3090s. Using the same kernel settings and also going back to splitting by row. I had merged this PR for P40s but I don't think it's contents is what returned the performance, I just happened to grab a backup at the right time. So now I have a date and a commit to test against, to find the change that ate my 3 tokens/s. I have other backups from dec 27th and those also have the regression. edit: |
On master I'm getting about 910 t/s with pp512 on a Q4_0 7b llama with or without this change, so it doesn't seem to be necessary for Tesla P40s anymore. |
Curious observation - using
CUBLAS_COMPUTE_32F
is faster thanCUBLAS_COMPUTE_16F
. Tested on V100 and A6000Seems to improve both TG, PP and Batched decoding speed and we avoid allocating and copying the F16
dst
data.Edit: It leads to improvements on some NVIDIA cards, but not all. For example on 3090 the performance is degraded when using
CUBLAS_COMPUTE_32F
. Also AMD cards can suffer too.Leaving this PR as a demonstration that people can try for their specific case to see if it helps
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
build: c8d6a1f (1431) (master)
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
build: 3b9ea65 (1432) (PR)
Device 0: NVIDIA RTX A6000, compute capability 8.6
build: c8d6a1f (1431)
Device 0: NVIDIA RTX A6000, compute capability 8.6
build: 3b9ea65 (1432)
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
build: c8d6a1f (1431)
Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6
build: 3b9ea65 (1432)
Latest benches after
GGML_PREC_F32
addition:LLAMA_CUBLAS=1 make -j llama-bench && ./llama-bench -m ./models/openllama-7b-v2/ggml-model-f16.gguf -m ./models/openllama-7b-v2/ggml-model-q4_k.gguf -ngl 99
Device 0: Tesla V100-PCIE-16GB, compute capability 7.0
build: a40f611 (1662)