-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suspiciously low performance in batched inference compared to single token #3771
Comments
#3545 (comment) |
I guess I see how this can be the case but the fact that this is only present in quantized CUDA case gives me hope that this still can be fixed somehow. As per my tests CPU + Q4 is fine. Also according to #3749 CUDA + FP16 is also fine. It is only CUDA + Q4 that is acting weird. |
You might try tweaking some of the defines in Copying the table from that comment for convenience:
64,128,8 is the default. You can see that it doesn't break even with single token generation until the batch size is 4 (but the prompt processing speed is great). GG also proposed 8,32,8 for Ampere architecture Nvidia cards. Modifying the defines is a bit annoying because you have to change all the ones that affect quantizations your model uses. For example, a Q5_K_M model uses Q5_K and Q6_K so to test, you'd need to change the defines for both. |
@Microflame Thank you for looking into this. Please try the new branch in #3776 and let me know your results with RTX 4070 |
So I have done some experiments on
Time to call
|
MMQ_X / MMQ_Y | BS=8 | BS=16 | BS=32 | BS=64 | BS=128 |
---|---|---|---|---|---|
4 / 32 | 29.891 | 53.710 | 97.843 | 187.631 | 361.295 |
4 / 64 | 28.461 | 51.782 | 93.105 | 175.487 | 336.617 |
4 / 128 | 29.625 | 54.205 | 96.655 | 180.841 | 348.829 |
8 / 32 | 20.812 | 38.341 | 62.257 | 122.892 | 227.166 |
8 / 64 | 20.604 | 36.354 | 62.756 | 114.453 | 225.396 |
8 / 128 | 24.694 | 36.426 | 61.958 | 112.404 | 217.993 |
16 / 32 | 28.107 | 32.901 | 52.693 | 101.583 | 188.438 |
16 / 64 | 24.269 | 28.112 | 45.716 | 80.278 | 159.500 |
16 / 128 | 25.525 | 27.747 | 41.072 | 75.095 | 148.149 |
32 / 32 | 40.825 | 43.274 | 47.103 | 84.782 | 170.380 |
32 / 64 | 33.508 | 36.028 | 36.941 | 68.523 | 125.096 |
32 / 128 | 33.316 | 34.977 | 35.219 | 59.444 | 110.120 |
64 / 32 | 63.074 | 66.827 | 68.703 | 77.136 | 148.736 |
64 / 64 | 49.107 | 52.348 | 52.767 | 60.961 | 111.457 |
64 / 128 | 51.156 | 52.819 | 53.281 | 57.953 | 96.687 |
Per token time, ms.
MMQ_X / MMQ_Y | BS=8 | BS=16 | BS=32 | BS=64 | BS=128 |
---|---|---|---|---|---|
4 / 32 | 3.736 | 3.357 | 3.058 | 2.932 | 2.823 |
4 / 64 | 3.558 | 3.236 | 2.910 | 2.742 | 2.630 |
4 / 128 | 3.703 | 3.388 | 3.020 | 2.826 | 2.725 |
8 / 32 | 2.601 | 2.396 | 1.946 | 1.920 | 1.775 |
8 / 64 | 2.576 | 2.272 | 1.961 | 1.788 | 1.761 |
8 / 128 | 3.087 | 2.277 | 1.936 | 1.756 | 1.703 |
16 / 32 | 3.513 | 2.056 | 1.647 | 1.587 | 1.472 |
16 / 64 | 3.034 | 1.757 | 1.429 | 1.254 | 1.246 |
16 / 128 | 3.191 | 1.734 | 1.283 | 1.173 | 1.157 |
32 / 32 | 5.103 | 2.705 | 1.472 | 1.325 | 1.331 |
32 / 64 | 4.189 | 2.252 | 1.154 | 1.071 | 0.977 |
32 / 128 | 4.165 | 2.186 | 1.101 | 0.929 | 0.860 |
64 / 32 | 7.884 | 4.177 | 2.147 | 1.205 | 1.162 |
64 / 64 | 6.138 | 3.272 | 1.649 | 0.953 | 0.871 |
64 / 128 | 6.394 | 3.301 | 1.665 | 0.906 | 0.755 |
My understanding is that PR #3776 proposes MMQ_X / MMQ_Y to be set as 4 / 32 but there are seem to be more optimal combinations for ntokens
> 4, at least for RTX 4070
.
Also here are some observations:
- It seems that optimal
MMQ_X
is equal to batch size/ntokens - In almost all cases optimal
MMQ_Y
is 128 - I could not increase
MMQ_Y
past 128 because I ran out of shared memory (Entry function '_Z12mul_mat_q5_1ILb1EEvPKvS1_Pfiiiii' uses too much shared data (0x12740 bytes, 0xc000 max)
) - Larger batch and MMQ_X lead to larger throughput (well, this one is not surprise)
- If we take latency into account then I believe X / Y / BS = 16 / 128 / 16 is an interesting combination
- Right now I am starting to realize that different combinations of X/Y may be optimal in different scenarios. Like smaller X for latency and larger X for throughput. The difference is significant here.
This all begs the question if it is possible to turn constants like MMQ_X
into runtime variables? My understanding is that this will require runtime recompilation of kernels and I don't know how feasible is that. I know that OpenCL
can compile kernels at runtime but not sure about nvcc
stuff.
In general I suppose that llama.cpp
may benefit from something like profiles/presets for various use scenarios / GPUs.
Thanks for the data points. I agree with the observations and adding support for custom profiles/presets is something to consider in the future. At the moment, I think we have to merge some version of this (#3776), even if it is not the optimal one so that we improve the baseline performance which on |
MMQ_X, MMQ_Y, and NWARPS are used as template parameters. Their values need to be known at compile time but it is possible to compile multiple kernels with different values and then select an appropriate kernel at runtime. The tradeoff is that this will result in longer compilations and larger binaries. |
More generally, there are currently specialized kernels for the case of batch size == 1 and more general kernels for all other cases. That is why the performance for batch size == 2 in particular is bad. For optimal performance it may be necessary to write kernels specifically for very small batches since I'm not sure that the current MMQ kernels are optimal even with low MMQ_X values. In particular I'm not sure whether it makes sense to use shared memory for very small batches (compared to just using registers). |
Prerequisites
Current Behavior
llama_decode
takes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token tollama_decode
it takes ~12 ms. to decode on average, while for 2 or more tokensllama_decode
takes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.Naively one could assume that
llama.cpp
CUDA
code can be tweaked in such a way so thatllama_decode
for 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:My question
So I was wondering if these are sane considerations and if so, whether someone of the
CUDA
experts can pull off such an optimization?Some additional notes
Here are the results of my measurements:
GPU
specific and does not affect theCPU
Environment and Context
I am running
RTX 4070
underWSL2
.The model is
llama 7B
quantized usingQ4_0
Steps to Reproduce
The code I used to collect stats:
The text was updated successfully, but these errors were encountered: