-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster multi-gpu strategy? #3120
Comments
afaik that is what is already happening.
yes, you can specify how many layers are assigned to which gpu. In your case, the fewer gpus you use the better, so make it not use the extra gpus (you seem to have the vram). there are also #3110 and #2470 happening. you can help test there :) |
Thanks for your response. You mean that if there are 80 layers and 4 GPUs, llamacpp will load first 20 layers into GPU 0, and next 20 in GPU 1… so on so forth? Or does it split up the layers. Like, layer 0 part 0 into GPU 0, and layer 0 part 1 into GPU 1, so on so forth? |
I wonder if this has something to do with memory (tensor) addressing across the VRAM boundaries. I expect any copying across shards (GPU) will be over PCIe unless the setup has NVLink (like in a DGX). PyTorch handles this pretty well. https://www.run.ai/guides/multi-gpu/pytorch-multi-gpu-4-techniques-explained |
#2470 goes into more details on that.
no, it does not split up a layer. there are also extra "layers" beyond the 80 the model has. try to use |
check out the docs here for some more details on available commandline options: https://github.com/ggerganov/llama.cpp/tree/master/examples/main#additional-options |
Depends on interconnect vs. GPU speed. I think that given enough optimization splitting tensors as is currently being done will be faster. As mentioned above, look at #2470 . Since A100s should have NVLink the synchronization overhead should be much lower with peer access enabled.
Write me an email or give me a way to contact you and I will give you the credentials for a Mumble server where I can explain to you what would need to be done. It will take a considerable amount of effort though.
Tensors are split by rows across GPUs. So currently the main GPU distributes the hidden state across GPUs, each GPU works on part of the matrix, and then each GPU writes back its result to the main GPU. |
I'm experiencing problems with uneven balancing of loads between GPUs. Here it is putting most of the load in GPU 1, which is half the compute capacity (and half the VRAM) of GPU 0: It's not always like this, it just periodically gets into this state, and then progress completes at a crawl. The layer ratio between GPU 0 and GP1 is 43:17 (out of 60 layers). |
Hey, I'm seeing this exact same problem. In my case, I'm not offloading the gpu layers to RAM, everything is fully in the GPU. 2 dedicated cards, 2 running instantiations of the model (each dedicated to the specific GPU |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
I am getting a slower tps when using multi gpu, as opposed to using 1 gpu (by using
CUDA_VISIBLE_DEVICES
).I have done multiple runs, so the TPS is an average.
The command the and output is as follows (omitting the outputs for 2 and 3 gpus runs):
Note:
--n-gpu-layers
is 76 for all in order to fit the model into a single A100. Should not affect the results, as for smaller models where all layers are offloaded to the GPU, I observed the same slowdown4 GPUs
1 GPU
I read in
llama.cpp
filellama.cpp/llama.cpp
Line 2041 in 6eeb4d9
I suppose the slowdown is because of the synchronization steps given the tensor split.
Could it be a faster strategy to load the layers as a whole into the GPUs, and divide all layers across the GPUs?
For example, if there are 83 layers and 4 gpus, GPU can take 20 layers, and GPU1, 2, and 3 can take 21 layers.
I will be more than happy to help do the feature if it makes sense, and if I am pointed in the correct direction.
The text was updated successfully, but these errors were encountered: