Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suspiciously low performance in batched inference compared to single token #3771

Closed
4 tasks done
Microflame opened this issue Oct 25, 2023 · 8 comments
Closed
4 tasks done
Labels
bug Something isn't working

Comments

@Microflame
Copy link
Contributor

Microflame commented Oct 25, 2023

Prerequisites

  • I am running the latest code. Development is very rapid so there are no tagged versions as of now.
  • I carefully followed the README.md.
  • I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
  • I reviewed the Discussions, and have a new bug or useful enhancement to share.

Current Behavior

llama_decode takes 4x more time to complete for 2 tokens compared to 1 token. Specifically when I feed single token to llama_decode it takes ~12 ms. to decode on average, while for 2 or more tokens llama_decode takes ~50 ms. to complete. Naturally I would expect at most 2x increase in time taken to process twice the amount of tokens but in fact processing 2 tokens takes 4x more time than processing 1 token.
Naively one could assume that llama.cpp CUDA code can be tweaked in such a way so that llama_decode for 2 tokens would complete in at most twice the time it takes to decode 1 token. This would result in the following benefits:

  • Up to 2x reduction of prompt eval time for single sequence inference
  • Up to 2x decrease in next token prediction time for multi sequence inference

My question

So I was wondering if these are sane considerations and if so, whether someone of the CUDA experts can pull off such an optimization?

Some additional notes

Here are the results of my measurements:

n_tokens llama_decode time, ms
1 12
2 50
4 51
8 51
64 56

Environment and Context

I am running RTX 4070 under WSL2.
The model is llama 7B quantized using Q4_0

Steps to Reproduce

The code I used to collect stats:

#include <iostream>
#include <iomanip>
#include <vector>

#include <llama.h>
#include <common.h>


void exit_if_false(bool cond, const char* msg) {
    if (!cond) {
        std::cerr << msg << std::endl;
        exit(1);
    }
}

const int BATCH_SIZE = 2;
const bool GPU = true;

int main(int argc, char* argv[]) {
    std::cout << "Testing on " << (GPU ? "GPU" : "CPU") << '\n';
    llama_model_params model_params = llama_model_default_params();
    {
        model_params.n_gpu_layers = GPU ? 1000 : 0;
    }

    llama_context_params context_params = llama_context_default_params();
    {
        context_params.n_ctx = 1024;
        context_params.n_batch = BATCH_SIZE;
        context_params.n_threads = GPU ? 1 : 10;
    }

    llama_model* model = llama_load_model_from_file(argv[1], model_params);
    exit_if_false(model, "Can not load model");

    llama_context* ctx = llama_new_context_with_model(model, context_params);
    exit_if_false(ctx, "Can not create context");

    std::string prompt = "In another moment down went Alice after it, never once considering how in the world she was to get out again.";
    std::vector<llama_token> tokens = llama_tokenize(ctx, prompt, true, false);
    std::cout << "Processing " << tokens.size() << " tokens\n";

    llama_batch batch = llama_batch_init(BATCH_SIZE, 0, 1);
    double total_dt_ms = 0;
    int num_calls = 0;
    for (size_t start = 0; start < tokens.size(); start += BATCH_SIZE) {
        size_t end = std::min(start + BATCH_SIZE, tokens.size());
        
        llama_batch_clear(batch);
        for (size_t i = start; i < end; ++i) {
            llama_batch_add(batch, tokens[i], i, {0}, false);
        }

        double tstart = ggml_time_us();
        llama_decode(ctx, batch);
        double tend = ggml_time_us();
        double dt_ms = (tend - tstart) / 1000;
        std::cout << "llama_decode: " << std::setw(7) << std::fixed << std::setprecision(3) << dt_ms
                  << " ms. for " << std::setw(3) << batch.n_tokens << " token(s)\n";
        total_dt_ms += dt_ms;
        num_calls += 1;
    }
    llama_batch_free(batch);

    std::cout << "Average:\n"
        << (total_dt_ms / num_calls) << " ms. per call\n"
        << (total_dt_ms / tokens.size()) << " ms. per token\n";
    return 0;
}
@Microflame Microflame added the bug Something isn't working label Oct 25, 2023
@lxrite
Copy link

lxrite commented Oct 25, 2023

#3545 (comment)
I think this is the reason.

@Microflame
Copy link
Contributor Author

#3545 (comment) I think this is the reason.

I guess I see how this can be the case but the fact that this is only present in quantized CUDA case gives me hope that this still can be fixed somehow. As per my tests CPU + Q4 is fine. Also according to #3749 CUDA + FP16 is also fine. It is only CUDA + Q4 that is acting weird.

@KerfuffleV2
Copy link
Collaborator

You might try tweaking some of the defines in ggml-cuda.cu. I played with various permutations via ROCM (uses most of the CUDA stuff): #3749 (comment)

Copying the table from that comment for convenience:

MX MY NW PP1 PP2 TG1 TG2 TG3 TG4 TG5 TG6 TG7
4 32 4 157.6 190.5 29.8 37.9 55.8 73.4 82.2 97.3 111.3
8 32 8 172.4 172.2 30.0 34.0 50.1 66.0 81.3 96.0 109.6
16 32 8 252.3 251.2 31.9 27.4 41.6 54.9 68.0 80.7 92.9
16 32 4 223.7 286.5 31.9 31.4 46.3 61.1 75.4 89.5 102.8
64 128 8 516.0 515.8 29.2 14.4 21.9 29.1 36.2 43.3 50.1
32 32 8 284.1 283.8 31.9 17.2 21.6 34.0 42.3 50.5 58.5

64,128,8 is the default. You can see that it doesn't break even with single token generation until the batch size is 4 (but the prompt processing speed is great). GG also proposed 8,32,8 for Ampere architecture Nvidia cards.

Modifying the defines is a bit annoying because you have to change all the ones that affect quantizations your model uses. For example, a Q5_K_M model uses Q5_K and Q6_K so to test, you'd need to change the defines for both.

@ggerganov
Copy link
Owner

@Microflame Thank you for looking into this.

Please try the new branch in #3776 and let me know your results with RTX 4070

@Microflame
Copy link
Contributor Author

Microflame commented Oct 26, 2023

So I have done some experiments on RTX 4070 with MMQ_X and MMQ_Y while leaving NWARPS and WARP_SIZE default. Here are the results, including configuration from #3776:

Here I will use words batch size/BS/ntokens interchangeably

Time to call llama_decode in ms.

MMQ_X / MMQ_Y BS=8 BS=16 BS=32 BS=64 BS=128
4 / 32 29.891 53.710 97.843 187.631 361.295
4 / 64 28.461 51.782 93.105 175.487 336.617
4 / 128 29.625 54.205 96.655 180.841 348.829
8 / 32 20.812 38.341 62.257 122.892 227.166
8 / 64 20.604 36.354 62.756 114.453 225.396
8 / 128 24.694 36.426 61.958 112.404 217.993
16 / 32 28.107 32.901 52.693 101.583 188.438
16 / 64 24.269 28.112 45.716 80.278 159.500
16 / 128 25.525 27.747 41.072 75.095 148.149
32 / 32 40.825 43.274 47.103 84.782 170.380
32 / 64 33.508 36.028 36.941 68.523 125.096
32 / 128 33.316 34.977 35.219 59.444 110.120
64 / 32 63.074 66.827 68.703 77.136 148.736
64 / 64 49.107 52.348 52.767 60.961 111.457
64 / 128 51.156 52.819 53.281 57.953 96.687

Per token time, ms.

MMQ_X / MMQ_Y BS=8 BS=16 BS=32 BS=64 BS=128
4 / 32 3.736 3.357 3.058 2.932 2.823
4 / 64 3.558 3.236 2.910 2.742 2.630
4 / 128 3.703 3.388 3.020 2.826 2.725
8 / 32 2.601 2.396 1.946 1.920 1.775
8 / 64 2.576 2.272 1.961 1.788 1.761
8 / 128 3.087 2.277 1.936 1.756 1.703
16 / 32 3.513 2.056 1.647 1.587 1.472
16 / 64 3.034 1.757 1.429 1.254 1.246
16 / 128 3.191 1.734 1.283 1.173 1.157
32 / 32 5.103 2.705 1.472 1.325 1.331
32 / 64 4.189 2.252 1.154 1.071 0.977
32 / 128 4.165 2.186 1.101 0.929 0.860
64 / 32 7.884 4.177 2.147 1.205 1.162
64 / 64 6.138 3.272 1.649 0.953 0.871
64 / 128 6.394 3.301 1.665 0.906 0.755

My understanding is that PR #3776 proposes MMQ_X / MMQ_Y to be set as 4 / 32 but there are seem to be more optimal combinations for ntokens > 4, at least for RTX 4070.

Also here are some observations:

  • It seems that optimal MMQ_X is equal to batch size/ntokens
  • In almost all cases optimal MMQ_Y is 128
  • I could not increase MMQ_Y past 128 because I ran out of shared memory (Entry function '_Z12mul_mat_q5_1ILb1EEvPKvS1_Pfiiiii' uses too much shared data (0x12740 bytes, 0xc000 max))
  • Larger batch and MMQ_X lead to larger throughput (well, this one is not surprise)
  • If we take latency into account then I believe X / Y / BS = 16 / 128 / 16 is an interesting combination
  • Right now I am starting to realize that different combinations of X/Y may be optimal in different scenarios. Like smaller X for latency and larger X for throughput. The difference is significant here.

This all begs the question if it is possible to turn constants like MMQ_X into runtime variables? My understanding is that this will require runtime recompilation of kernels and I don't know how feasible is that. I know that OpenCL can compile kernels at runtime but not sure about nvcc stuff.

In general I suppose that llama.cpp may benefit from something like profiles/presets for various use scenarios / GPUs.

@ggerganov
Copy link
Owner

Thanks for the data points. I agree with the observations and adding support for custom profiles/presets is something to consider in the future. At the moment, I think we have to merge some version of this (#3776), even if it is not the optimal one so that we improve the baseline performance which on master is heavily degraded. And then we should think for the best way to provide means for tuning the parameters for the specific hardware.

@JohannesGaessler
Copy link
Collaborator

This all begs the question if it is possible to turn constants like MMQ_X into runtime variables? My understanding is that this will require runtime recompilation of kernels and I don't know how feasible is that. I know that OpenCL can compile kernels at runtime but not sure about nvcc stuff.

MMQ_X, MMQ_Y, and NWARPS are used as template parameters. Their values need to be known at compile time but it is possible to compile multiple kernels with different values and then select an appropriate kernel at runtime. The tradeoff is that this will result in longer compilations and larger binaries.

@JohannesGaessler
Copy link
Collaborator

More generally, there are currently specialized kernels for the case of batch size == 1 and more general kernels for all other cases. That is why the performance for batch size == 2 in particular is bad. For optimal performance it may be necessary to write kernels specifically for very small batches since I'm not sure that the current MMQ kernels are optimal even with low MMQ_X values. In particular I'm not sure whether it makes sense to use shared memory for very small batches (compared to just using registers).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants