Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster multi-gpu strategy? #3120

Closed
calvintwr opened this issue Sep 11, 2023 · 10 comments
Closed

Faster multi-gpu strategy? #3120

calvintwr opened this issue Sep 11, 2023 · 10 comments
Labels

Comments

@calvintwr
Copy link

calvintwr commented Sep 11, 2023

I am getting a slower tps when using multi gpu, as opposed to using 1 gpu (by using CUDA_VISIBLE_DEVICES).

No. of GPUs TPS (generation)
1 13.48
2 10.14
3 9.69
4 9.23

I have done multiple runs, so the TPS is an average.

The command the and output is as follows (omitting the outputs for 2 and 3 gpus runs):

Note: --n-gpu-layers is 76 for all in order to fit the model into a single A100. Should not affect the results, as for smaller models where all layers are offloaded to the GPU, I observed the same slowdown

4 GPUs

$ ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76

<truncated>

Log start
ggml_init_cublas: found 4 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0

<truncated>

llama_print_timings:        load time = 11896.31 ms
llama_print_timings:      sample time =   126.25 ms /   128 runs   (    0.99 ms per token,  1013.87 tokens per second)
llama_print_timings: prompt eval time =   570.27 ms /     6 tokens (   95.04 ms per token,    10.52 tokens per second)
llama_print_timings:        eval time = 13757.15 ms /   127 runs   (  108.32 ms per token,     9.23 tokens per second)
llama_print_timings:       total time = 14653.54 ms

1 GPU

$ CUDA_VISIBLE_DEVICES=GPU-0870b5a7-7e03-79d9-d3b2-e1277c9ca547 ./main --model ../llama2-70b-chat-q4_1.gguf --prompt "The quick brown fox" --n-predict 128 --ctx-size 4096 --n-gpu-layers 76

<truncated>

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0

<truncated>

llama_print_timings:        load time = 11464.86 ms
llama_print_timings:      sample time =   127.03 ms /   128 runs   (    0.99 ms per token,  1007.66 tokens per second)
llama_print_timings: prompt eval time =   584.76 ms /     6 tokens (   97.46 ms per token,    10.26 tokens per second)
llama_print_timings:        eval time =  9420.06 ms /   127 runs   (   74.17 ms per token,    13.48 tokens per second)
llama_print_timings:       total time = 10333.01 ms

I read in llama.cpp file

for (uint32_t i = 0; i < n_layer; ++i) {
that it seems to split up the tensors of each layer, and put then across the GPUs.

I suppose the slowdown is because of the synchronization steps given the tensor split.

Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?

For example, if there are 83 layers and 4 gpus, GPU can take 20 layers, and GPU1, 2, and 3 can take 21 layers.

I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.

@Green-Sky
Copy link
Collaborator

and divide all layers across the GPUs?

afaik that is what is already happening.

I suppose the slowdown is because of the synchronization

yes, you can specify how many layers are assigned to which gpu. In your case, the fewer gpus you use the better, so make it not use the extra gpus (you seem to have the vram).

there are also #3110 and #2470 happening. you can help test there :)

@calvintwr
Copy link
Author

and divide all layers across the GPUs?

afaik that is what is already happening.

Thanks for your response.

You mean that if there are 80 layers and 4 GPUs, llamacpp will load first 20 layers into GPU 0, and next 20 in GPU 1… so on so forth?

Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth?

@JohnnyOpcode
Copy link

I wonder if this has something to do with memory (tensor) addressing across the VRAM boundaries. I expect any copying across shards (GPU) will be over PCIe unless the setup has NVLink (like in a DGX).

PyTorch handles this pretty well.

https://www.run.ai/guides/multi-gpu/pytorch-multi-gpu-4-techniques-explained

@Green-Sky
Copy link
Collaborator

I wonder if this has something to do with memory (tensor) addressing across the VRAM boundaries. I expect any copying across shards (GPU) will be over PCIe unless the setup has NVLink (like in a DGX).

#2470 goes into more details on that.

Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth?

no, it does not split up a layer. there are also extra "layers" beyond the 80 the model has. try to use -ngl 83, this will offload some more memory to the gpu(s). the kv cache would otherwise have to reside on the cpu and needs to be streamed. But as said before, using multiple gpus might make it slower.

@Green-Sky
Copy link
Collaborator

check out the docs here for some more details on available commandline options: https://github.com/ggerganov/llama.cpp/tree/master/examples/main#additional-options

@JohannesGaessler
Copy link
Collaborator

Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?

Depends on interconnect vs. GPU speed. I think that given enough optimization splitting tensors as is currently being done will be faster. As mentioned above, look at #2470 . Since A100s should have NVLink the synchronization overhead should be much lower with peer access enabled.

I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.

Write me an email or give me a way to contact you and I will give you the credentials for a Mumble server where I can explain to you what would need to be done. It will take a considerable amount of effort though.

Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth?

Tensors are split by rows across GPUs. So currently the main GPU distributes the hidden state across GPUs, each GPU works on part of the matrix, and then each GPU writes back its result to the main GPU.

@enn-nafnlaus
Copy link

enn-nafnlaus commented Oct 2, 2023

I'm experiencing problems with uneven balancing of loads between GPUs. Here it is putting most of the load in GPU 1, which is half the compute capacity (and half the VRAM) of GPU 0:

image

It's not always like this, it just periodically gets into this state, and then progress completes at a crawl. The layer ratio between GPU 0 and GP1 is 43:17 (out of 60 layers).

@enn-nafnlaus
Copy link

enn-nafnlaus commented Oct 2, 2023

Here it is getting back out of that state and resuming more balanced operations.

image

It'd be nice if there was a way to stop these sort of imbalances, as they greatly reduce my mean inference speed.

@bannsec
Copy link

bannsec commented Dec 8, 2023

Hey, I'm seeing this exact same problem. In my case, I'm not offloading the gpu layers to RAM, everything is fully in the GPU. 2 dedicated cards, 2 running instantiations of the model (each dedicated to the specific GPU main_gpu), and I'm seeing the exact same type of slowdown. It's faster for me to use a single GPU and instance of llama.cpp than two GPUs and two instances of llama.cpp.

@github-actions github-actions bot added the stale label Mar 20, 2024
Copy link
Contributor

github-actions bot commented Apr 3, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

6 participants