Faster multi-gpu strategy? #3120

calvintwr · 2023-09-11T08:24:11Z

I am getting a slower tps when using multi gpu, as opposed to using 1 gpu (by using CUDA_VISIBLE_DEVICES).

No. of GPUs	TPS (generation)
1	13.48
2	10.14
3	9.69
4	9.23

I have done multiple runs, so the TPS is an average.

The command the and output is as follows (omitting the outputs for 2 and 3 gpus runs):

Note: --n-gpu-layers is 76 for all in order to fit the model into a single A100. Should not affect the results, as for smaller models where all layers are offloaded to the GPU, I observed the same slowdown

4 GPUs

$ ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76

<truncated>

Log start
ggml_init_cublas: found 4 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0

<truncated>

llama_print_timings:        load time = 11896.31 ms
llama_print_timings:      sample time =   126.25 ms /   128 runs   (    0.99 ms per token,  1013.87 tokens per second)
llama_print_timings: prompt eval time =   570.27 ms /     6 tokens (   95.04 ms per token,    10.52 tokens per second)
llama_print_timings:        eval time = 13757.15 ms /   127 runs   (  108.32 ms per token,     9.23 tokens per second)
llama_print_timings:       total time = 14653.54 ms

1 GPU

$ CUDA_VISIBLE_DEVICES=GPU-0870b5a7-7e03-79d9-d3b2-e1277c9ca547 ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76

<truncated>

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0

<truncated>

llama_print_timings:        load time = 11464.86 ms
llama_print_timings:      sample time =   127.03 ms /   128 runs   (    0.99 ms per token,  1007.66 tokens per second)
llama_print_timings: prompt eval time =   584.76 ms /     6 tokens (   97.46 ms per token,    10.26 tokens per second)
llama_print_timings:        eval time =  9420.06 ms /   127 runs   (   74.17 ms per token,    13.48 tokens per second)
llama_print_timings:       total time = 10333.01 ms

I read in llama.cpp file

llama.cpp/llama.cpp

Line 2041 in 6eeb4d9

for (uint32_t i = 0; i < n_layer; ++i) {

that it seems to split up the tensors of each layer, and put then across the GPUs.

I suppose the slowdown is because of the synchronization steps given the tensor split.

Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?

For example, if there are 83 layers and 4 gpus, GPU can take 20 layers, and GPU1, 2, and 3 can take 21 layers.

I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.

The text was updated successfully, but these errors were encountered:

Green-Sky · 2023-09-11T10:42:28Z

and divide all layers across the GPUs?

afaik that is what is already happening.

I suppose the slowdown is because of the synchronization

yes, you can specify how many layers are assigned to which gpu. In your case, the fewer gpus you use the better, so make it not use the extra gpus (you seem to have the vram).

there are also #3110 and #2470 happening. you can help test there :)

calvintwr · 2023-09-11T10:49:36Z

and divide all layers across the GPUs?

afaik that is what is already happening.

Thanks for your response.

You mean that if there are 80 layers and 4 GPUs, llamacpp will load first 20 layers into GPU 0, and next 20 in GPU 1… so on so forth?

Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth?

JohnnyOpcode · 2023-09-11T10:51:54Z

I wonder if this has something to do with memory (tensor) addressing across the VRAM boundaries. I expect any copying across shards (GPU) will be over PCIe unless the setup has NVLink (like in a DGX).

PyTorch handles this pretty well.

https://www.run.ai/guides/multi-gpu/pytorch-multi-gpu-4-techniques-explained

Green-Sky · 2023-09-11T11:14:08Z

I wonder if this has something to do with memory (tensor) addressing across the VRAM boundaries. I expect any copying across shards (GPU) will be over PCIe unless the setup has NVLink (like in a DGX).

#2470 goes into more details on that.

Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth?

no, it does not split up a layer. there are also extra "layers" beyond the 80 the model has. try to use -ngl 83, this will offload some more memory to the gpu(s). the kv cache would otherwise have to reside on the cpu and needs to be streamed. But as said before, using multiple gpus might make it slower.

Green-Sky · 2023-09-11T11:16:42Z

check out the docs here for some more details on available commandline options: https://github.com/ggerganov/llama.cpp/tree/master/examples/main#additional-options

JohannesGaessler · 2023-09-11T11:55:35Z

Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?

Depends on interconnect vs. GPU speed. I think that given enough optimization splitting tensors as is currently being done will be faster. As mentioned above, look at #2470 . Since A100s should have NVLink the synchronization overhead should be much lower with peer access enabled.

I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.

Write me an email or give me a way to contact you and I will give you the credentials for a Mumble server where I can explain to you what would need to be done. It will take a considerable amount of effort though.

Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth?

Tensors are split by rows across GPUs. So currently the main GPU distributes the hidden state across GPUs, each GPU works on part of the matrix, and then each GPU writes back its result to the main GPU.

enn-nafnlaus · 2023-10-02T20:10:34Z

I'm experiencing problems with uneven balancing of loads between GPUs. Here it is putting most of the load in GPU 1, which is half the compute capacity (and half the VRAM) of GPU 0:

It's not always like this, it just periodically gets into this state, and then progress completes at a crawl. The layer ratio between GPU 0 and GP1 is 43:17 (out of 60 layers).

enn-nafnlaus · 2023-10-02T20:11:43Z

Here it is getting back out of that state and resuming more balanced operations.

It'd be nice if there was a way to stop these sort of imbalances, as they greatly reduce my mean inference speed.

bannsec · 2023-12-08T01:45:06Z

Hey, I'm seeing this exact same problem. In my case, I'm not offloading the gpu layers to RAM, everything is fully in the GPU. 2 dedicated cards, 2 running instantiations of the model (each dedicated to the specific GPU main_gpu), and I'm seeing the exact same type of slowdown. It's faster for me to use a single GPU and instance of llama.cpp than two GPUs and two instances of llama.cpp.

github-actions · 2024-04-03T01:16:23Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

github-actions bot added the stale label Mar 20, 2024

github-actions bot closed this as completed Apr 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster multi-gpu strategy? #3120

Faster multi-gpu strategy? #3120

calvintwr commented Sep 11, 2023 •

edited

Loading

Green-Sky commented Sep 11, 2023

calvintwr commented Sep 11, 2023

JohnnyOpcode commented Sep 11, 2023

Green-Sky commented Sep 11, 2023

Green-Sky commented Sep 11, 2023

JohannesGaessler commented Sep 11, 2023

enn-nafnlaus commented Oct 2, 2023 •

edited

Loading

enn-nafnlaus commented Oct 2, 2023 •

edited

Loading

bannsec commented Dec 8, 2023

github-actions bot commented Apr 3, 2024

Faster multi-gpu strategy? #3120

Faster multi-gpu strategy? #3120

Comments

calvintwr commented Sep 11, 2023 • edited Loading

Green-Sky commented Sep 11, 2023

calvintwr commented Sep 11, 2023

JohnnyOpcode commented Sep 11, 2023

Green-Sky commented Sep 11, 2023

Green-Sky commented Sep 11, 2023

JohannesGaessler commented Sep 11, 2023

enn-nafnlaus commented Oct 2, 2023 • edited Loading

enn-nafnlaus commented Oct 2, 2023 • edited Loading

bannsec commented Dec 8, 2023

github-actions bot commented Apr 3, 2024

calvintwr commented Sep 11, 2023 •

edited

Loading

enn-nafnlaus commented Oct 2, 2023 •

edited

Loading

enn-nafnlaus commented Oct 2, 2023 •

edited

Loading