Suspiciously low performance in batched inference compared to single token #3771

Microflame · 2023-10-25T03:12:37Z

Prerequisites

I am running the latest code. Development is very rapid so there are no tagged versions as of now.
I carefully followed the README.md.
I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

llama_decode takes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token to llama_decode it takes ~12 ms. to decode on average, while for 2 or more tokens llama_decode takes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.
Naively one could assume that llama.cpp CUDA code can be tweaked in such a way so that llama_decode for 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:

Up to 2x reduction of prompt eval time for single sequence inference
Up to 2x decrease in next token prediction time for multi sequence inference

My question

So I was wondering if these are sane considerations and if so, whether someone of the CUDA experts can pull off such an optimization?

Some additional notes

Here are the results of my measurements:

n_tokens	llama_decode time, ms
1	12
2	50
4	51
8	51
64	56

cuda : add batched cuBLAS GEMM for faster attention #3749 has improved decoding time from ~100 ms. to ~50 ms.
It feels like the issue is most prominent with quantized models.
It also appears that the issue is GPU specific and does not affect the CPU

Environment and Context

I am running RTX 4070 under WSL2.
The model is llama 7B quantized using Q4_0

Steps to Reproduce

The code I used to collect stats:

#include <iostream>
#include <iomanip>
#include <vector>

#include <llama.h>
#include <common.h>


void exit_if_false(bool cond, const char* msg) {
    if (!cond) {
        std::cerr << msg << std::endl;
        exit(1);
    }
}

const int BATCH_SIZE = 2;
const bool GPU = true;

int main(int argc, char* argv[]) {
    std::cout << "Testing on " << (GPU ? "GPU" : "CPU") << '\n';
    llama_model_params model_params = llama_model_default_params();
    {
        model_params.n_gpu_layers = GPU ? 1000 : 0;
    }

    llama_context_params context_params = llama_context_default_params();
    {
        context_params.n_ctx = 1024;
        context_params.n_batch = BATCH_SIZE;
        context_params.n_threads = GPU ? 1 : 10;
    }

    llama_model* model = llama_load_model_from_file(argv[1], model_params);
    exit_if_false(model, "Can not load model");

    llama_context* ctx = llama_new_context_with_model(model, context_params);
    exit_if_false(ctx, "Can not create context");

    std::string prompt = "In another moment down went Alice after it, never once considering how in the world she was to get out again.";
    std::vector<llama_token> tokens = llama_tokenize(ctx, prompt, true, false);
    std::cout << "Processing " << tokens.size() << " tokens\n";

    llama_batch batch = llama_batch_init(BATCH_SIZE, 0, 1);
    double total_dt_ms = 0;
    int num_calls = 0;
    for (size_t start = 0; start < tokens.size(); start += BATCH_SIZE) {
        size_t end = std::min(start + BATCH_SIZE, tokens.size());
        
        llama_batch_clear(batch);
        for (size_t i = start; i < end; ++i) {
            llama_batch_add(batch, tokens[i], i, {0}, false);
        }

        double tstart = ggml_time_us();
        llama_decode(ctx, batch);
        double tend = ggml_time_us();
        double dt_ms = (tend - tstart) / 1000;
        std::cout << "llama_decode: " << std::setw(7) << std::fixed << std::setprecision(3) << dt_ms
                  << " ms. for " << std::setw(3) << batch.n_tokens << " token(s)\n";
        total_dt_ms += dt_ms;
        num_calls += 1;
    }
    llama_batch_free(batch);

    std::cout << "Average:\n"
        << (total_dt_ms / num_calls) << " ms. per call\n"
        << (total_dt_ms / tokens.size()) << " ms. per token\n";
    return 0;
}

The text was updated successfully, but these errors were encountered:

lxrite · 2023-10-25T03:48:33Z

#3545 (comment)
I think this is the reason.

Microflame · 2023-10-25T04:00:00Z

#3545 (comment) I think this is the reason.

I guess I see how this can be the case but the fact that this is only present in quantized CUDA case gives me hope that this still can be fixed somehow. As per my tests CPU + Q4 is fine. Also according to #3749 CUDA + FP16 is also fine. It is only CUDA + Q4 that is acting weird.

KerfuffleV2 · 2023-10-25T11:15:31Z

You might try tweaking some of the defines in ggml-cuda.cu. I played with various permutations via ROCM (uses most of the CUDA stuff): #3749 (comment)

Copying the table from that comment for convenience:

MX	MY	NW	PP1	PP2	TG1	TG2	TG3	TG4	TG5	TG6	TG7
4	32	4	157.6	190.5	29.8	37.9	55.8	73.4	82.2	97.3	111.3
8	32	8	172.4	172.2	30.0	34.0	50.1	66.0	81.3	96.0	109.6
16	32	8	252.3	251.2	31.9	27.4	41.6	54.9	68.0	80.7	92.9
16	32	4	223.7	286.5	31.9	31.4	46.3	61.1	75.4	89.5	102.8
64	128	8	516.0	515.8	29.2	14.4	21.9	29.1	36.2	43.3	50.1
32	32	8	284.1	283.8	31.9	17.2	21.6	34.0	42.3	50.5	58.5

64,128,8 is the default. You can see that it doesn't break even with single token generation until the batch size is 4 (but the prompt processing speed is great). GG also proposed 8,32,8 for Ampere architecture Nvidia cards.

Modifying the defines is a bit annoying because you have to change all the ones that affect quantizations your model uses. For example, a Q5_K_M model uses Q5_K and Q6_K so to test, you'd need to change the defines for both.

ggerganov · 2023-10-25T12:27:28Z

@Microflame Thank you for looking into this.

Please try the new branch in #3776 and let me know your results with RTX 4070

Microflame · 2023-10-26T01:19:22Z

So I have done some experiments on RTX 4070 with MMQ_X and MMQ_Y while leaving NWARPS and WARP_SIZE default. Here are the results, including configuration from #3776:

Here I will use words batch size/BS/ntokens interchangeably

Time to call `llama_decode` in ms.

MMQ_X / MMQ_Y	BS=8	BS=16	BS=32	BS=64	BS=128
4 / 32	29.891	53.710	97.843	187.631	361.295
4 / 64	28.461	51.782	93.105	175.487	336.617
4 / 128	29.625	54.205	96.655	180.841	348.829
8 / 32	20.812	38.341	62.257	122.892	227.166
8 / 64	20.604	36.354	62.756	114.453	225.396
8 / 128	24.694	36.426	61.958	112.404	217.993
16 / 32	28.107	32.901	52.693	101.583	188.438
16 / 64	24.269	28.112	45.716	80.278	159.500
16 / 128	25.525	27.747	41.072	75.095	148.149
32 / 32	40.825	43.274	47.103	84.782	170.380
32 / 64	33.508	36.028	36.941	68.523	125.096
32 / 128	33.316	34.977	35.219	59.444	110.120
64 / 32	63.074	66.827	68.703	77.136	148.736
64 / 64	49.107	52.348	52.767	60.961	111.457
64 / 128	51.156	52.819	53.281	57.953	96.687

Per token time, ms.

MMQ_X / MMQ_Y	BS=8	BS=16	BS=32	BS=64	BS=128
4 / 32	3.736	3.357	3.058	2.932	2.823
4 / 64	3.558	3.236	2.910	2.742	2.630
4 / 128	3.703	3.388	3.020	2.826	2.725
8 / 32	2.601	2.396	1.946	1.920	1.775
8 / 64	2.576	2.272	1.961	1.788	1.761
8 / 128	3.087	2.277	1.936	1.756	1.703
16 / 32	3.513	2.056	1.647	1.587	1.472
16 / 64	3.034	1.757	1.429	1.254	1.246
16 / 128	3.191	1.734	1.283	1.173	1.157
32 / 32	5.103	2.705	1.472	1.325	1.331
32 / 64	4.189	2.252	1.154	1.071	0.977
32 / 128	4.165	2.186	1.101	0.929	0.860
64 / 32	7.884	4.177	2.147	1.205	1.162
64 / 64	6.138	3.272	1.649	0.953	0.871
64 / 128	6.394	3.301	1.665	0.906	0.755

My understanding is that PR #3776 proposes MMQ_X / MMQ_Y to be set as 4 / 32 but there are seem to be more optimal combinations for ntokens > 4, at least for RTX 4070.

Also here are some observations:

It seems that optimal MMQ_X is equal to batch size/ntokens
In almost all cases optimal MMQ_Y is 128
I could not increase MMQ_Y past 128 because I ran out of shared memory (Entry function '_Z12mul_mat_q5_1ILb1EEvPKvS1_Pfiiiii' uses too much shared data (0x12740 bytes, 0xc000 max))
Larger batch and MMQ_X lead to larger throughput (well, this one is not surprise)
If we take latency into account then I believe X / Y / BS = 16 / 128 / 16 is an interesting combination
Right now I am starting to realize that different combinations of X/Y may be optimal in different scenarios. Like smaller X for latency and larger X for throughput. The difference is significant here.

This all begs the question if it is possible to turn constants like MMQ_X into runtime variables? My understanding is that this will require runtime recompilation of kernels and I don't know how feasible is that. I know that OpenCL can compile kernels at runtime but not sure about nvcc stuff.

In general I suppose that llama.cpp may benefit from something like profiles/presets for various use scenarios / GPUs.

ggerganov · 2023-10-26T14:37:26Z

Thanks for the data points. I agree with the observations and adding support for custom profiles/presets is something to consider in the future. At the moment, I think we have to merge some version of this (#3776), even if it is not the optimal one so that we improve the baseline performance which on master is heavily degraded. And then we should think for the best way to provide means for tuning the parameters for the specific hardware.

JohannesGaessler · 2023-10-26T20:16:59Z

This all begs the question if it is possible to turn constants like MMQ_X into runtime variables? My understanding is that this will require runtime recompilation of kernels and I don't know how feasible is that. I know that OpenCL can compile kernels at runtime but not sure about nvcc stuff.

MMQ_X, MMQ_Y, and NWARPS are used as template parameters. Their values need to be known at compile time but it is possible to compile multiple kernels with different values and then select an appropriate kernel at runtime. The tradeoff is that this will result in longer compilations and larger binaries.

JohannesGaessler · 2023-10-26T20:23:56Z

More generally, there are currently specialized kernels for the case of batch size == 1 and more general kernels for all other cases. That is why the performance for batch size == 2 in particular is bad. For optimal performance it may be necessary to write kernels specifically for very small batches since I'm not sure that the current MMQ kernels are optimal even with low MMQ_X values. In particular I'm not sure whether it makes sense to use shared memory for very small batches (compared to just using registers).

Microflame added the bug Something isn't working label Oct 25, 2023

ggerganov mentioned this issue Oct 25, 2023

cuda : improve text-generation and batched decoding performance #3776

Merged

6 tasks

Microflame closed this as completed Oct 29, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Suspiciously low performance in batched inference compared to single token #3771

Suspiciously low performance in batched inference compared to single token #3771

Microflame commented Oct 25, 2023 •

edited

Loading

lxrite commented Oct 25, 2023

Microflame commented Oct 25, 2023

KerfuffleV2 commented Oct 25, 2023

ggerganov commented Oct 25, 2023

Microflame commented Oct 26, 2023 •

edited

Loading

ggerganov commented Oct 26, 2023

JohannesGaessler commented Oct 26, 2023

JohannesGaessler commented Oct 26, 2023

Suspiciously low performance in batched inference compared to single token #3771

Suspiciously low performance in batched inference compared to single token #3771

Comments

Microflame commented Oct 25, 2023 • edited Loading

Prerequisites

Current Behavior

My question

Some additional notes

Environment and Context

Steps to Reproduce

lxrite commented Oct 25, 2023

Microflame commented Oct 25, 2023

KerfuffleV2 commented Oct 25, 2023

ggerganov commented Oct 25, 2023

Microflame commented Oct 26, 2023 • edited Loading

Time to call llama_decode in ms.

Per token time, ms.

ggerganov commented Oct 26, 2023

JohannesGaessler commented Oct 26, 2023

JohannesGaessler commented Oct 26, 2023

Microflame commented Oct 25, 2023 •

edited

Loading

Microflame commented Oct 26, 2023 •

edited

Loading

Time to call `llama_decode` in ms.