Tensor parallelism is all you need. Run LLMs on weak devices or make powerful devices even more powerful by distributing the workload and dividing the RAM usage. This project proves that it's possible split the workload of LLMs across multiple devices and achieve a significant speedup. Distributed Llama allows you to run huge LLMs in-house. The project uses TCP sockets to synchronize the state. You can easily configure your AI cluster by using a home router.
Distributed Llama running Llama 2 70B on 8 Raspberry Pi 4B devices
🔥 Run Distributed Llama by single command
Python and GCC required. Download this repository and run:
- Llama 3 8B:
python download-model.py llama3
- Llama 3 8B Instruct:
python download-model.py llama3_instruct
- TinyLlama:
python download-model.py tinylama
Supported modes:
- Inference
- Chat
- API Server
Known limitations:
- You can run Distributed Llama only on 1, 2, 4... 2^n nodes.
- The maximum number of nodes is equal to the number of KV heads in the model #70.
- Optimized for (weights format × buffer format):
- ARM CPUs
- ✅ F32 × F32
- ❌ F16 × F32
- ❌ Q40 × F32
- ✅ Q40 × Q80
- x86_64 AVX2 CPUs
- ❌ F32 × F32
- ❌ F16 × F32
- ❌ Q40 × F32
- ✅ Q40 × Q80
- ARM CPUs
Architecture
The project is split up into two parts:
- Root node - it's responsible for loading the model and weights and forward them to workers. Also, it synchronizes the state of the neural network. The root node is also a worker, it processes own slice of the neural network.
- Worker node - it processes own slice of the neural network. It doesn't require any configuration related to the model.
You always need the root node and you can add 2^n - 1 worker nodes to speed up the inference. The RAM usage of the neural network is split up across all nodes. The root node requires a bit more RAM than worker nodes.
All tests below utilized Q40 weights and a Q80 buffer. The generation time encompasses the inference time, network transfer time, sampling time, and multi-thread synchronization time. Number of samples: 16.
Raspberry Pi 5 8GB
Model | 1 x RasPi 5 8 GB | 2 x RasPi 5 8 GB | 4 x RasPi 5 8 GB |
---|---|---|---|
Llama 2 7B | 441.09 ms, 2.26 t/s (I: 434.84 ms, T: 5.25 ms) |
341.46 ms, 2.92 t/s (I: 257.78 ms, T: 83.27 ms) |
219.08 ms, 4.56 t/s (I: 163.42 ms, T: 55.25 ms) |
Llama 3 8B | 564.31 ms, 1.77 t/s (I: 556.67 ms, T: 6.17 ms) |
444.27 ms, 2.25 t/s (I: 362.73 ms, T: 80.11 ms) |
331.47 ms, 3.01 t/s (I: 267.62 ms, T: 62.34 ms) |
I - inference time of the root node, T - network transfer time, tested on 0.3.1 version
Raspberry Pi 4B 8 GB
All Raspberry Pi units were connected via Gigabit Ethernet to the TP-Link LS1008G Switch.
Model | 1 x RasPi 4B 8 GB | 2 x RasPi 4B 8 GB | 4 x RasPi 4B 8 GB | 8 x RasPi 4B 8 GB |
---|---|---|---|---|
Llama 2 7B | 1312.50 ms (I: 1307.94 ms, T: 1.81 ms) |
793.69 ms (I: 739.00 ms, T: 52.50 ms) |
494.00 ms 🔥 (I: 458.81 ms, T: 34.06 ms) |
588.19 ms (I: 296.69 ms, T: 289.75 ms) |
Llama 2 13B | Not enough RAM | 1497.19 ms (I: 1465.06 ms, T: 30.88 ms) |
848.19 ms 🔥 (I: 746.88 ms, T: 99.50 ms) |
1114.88 ms (I: 460.8 ms, T: 652.88 ms) |
Llama 2 70B | Not enough RAM | Not enough RAM | Not enough RAM | 4842.81 ms 🔥 (I: 2121.94 ms, T: 2719.62 ms) |
I - inference time of the root node, T - network transfer time, tested on 0.1.0 version
x86_64 CPU Cloud Server
All tests below were conducted on c3d-highcpu-30 (30 vCPU, 15 core, 59 GB memory) VMs in Google Cloud. More details.
Model | 1 x VM | 2 x VM | 4 x VM |
---|---|---|---|
Llama 2 7B | 101.81 ms (I: 101.06 ms, T: 0.19 ms) |
69.69 ms (I: 61.50 ms, T: 7.62 ms) |
53.69 ms 🔥 (I: 40.25 ms, T: 12.81 ms) |
Llama 2 13B | 184.19 ms (I: 182.88 ms, T: 0.69 ms) |
115.38 ms (I: 107.12 ms, T: 7.81 ms) |
86.81 ms 🔥 (I: 66.25 ms, T: 19.94 ms) |
Llama 2 70B | 909.69 ms (I: 907.25 ms, T: 1.75 ms) |
501.38 ms (I: 475.50 ms, T: 25.00 ms) |
293.06 ms 🔥 (I: 264.00 ms, T: 28.50 ms) |
I - inference time of the root node, T - network transfer time, tested on 0.1.0 version
F32 Buffer
Model | 2 devices | 4 devices | 8 devices |
---|---|---|---|
Llama 3 8B | 2048 kB (S: 1024 kB, R: 1024 kB) |
6144 kB (S: 3072 kB, R: 3072 kB) |
14336 kB (S: 7168 kB, R: 7168 kB) |
S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version
Q80 Buffer
Model | 2 devices | 4 devices | 8 devices |
---|---|---|---|
Llama 3 8B | 544 kB (S: 272 kB, R: 272 kB) |
1632 kB (S: 816 kB, R: 816 kB) |
3808 kB (S: 1904 kB, R: 1904 kB) |
S - sent data from the root node to workers, R - received data by the root node from workers, tested on 0.7.1 version
- Install
Raspberry Pi OS Lite (64 bit)
on your Raspberry Pi devices. This OS doesn't have desktop environment. - Connect all devices to the Gigabit switch.
- Connect to all devices via SSH.
ssh [email protected]
ssh [email protected]
- Install Git:
sudo apt install git
- Clone this repository:
git clone https://github.com/b4rtaz/distributed-llama.git
- Compile Distributed Llama:
make dllama
- Transfer weights and the tokenizer file to the root device.
- Optional: assign static IP addresses.
sudo ip addr add 10.0.0.1/24 dev eth0 # 1th device
sudo ip addr add 10.0.0.2/24 dev eth0 # 2th device
- Run worker nodes on worker devices:
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
- Run root node on the root device:
sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 10.0.0.2:9998
To add more worker nodes, just add more addresses to the --workers
argument.
./dllama inference ... --workers 10.0.0.2:9998 10.0.0.3:9998 10.0.0.4:9998
You need to have x86_64 AVX2 CPU or ARM CPU. Different devices may have different CPUs. The below instructions are for Debian-based distributions but you can easily adapt them to your distribution, macOS, or Windows.
- Install Git and G++:
sudo apt install git build-essential
- Clone this repository:
git clone https://github.com/b4rtaz/distributed-llama.git
- Compile Distributed Llama:
make dllama
- Transfer weights and the tokenizer file to the root node.
- Run worker nodes on worker devices:
sudo nice -n -20 ./dllama worker --port 9998 --nthreads 4
- Run root node on the root device:
sudo nice -n -20 ./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
- To run the root node in the chat mode:
sudo nice -n -20 ./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998
- Install Git and Mingw (Chocolatey):
choco install mingw
- Clone this repository:
git clone https://github.com/b4rtaz/distributed-llama.git
- Compile Distributed Llama:
make dllama
- Transfer weights and the tokenizer file to the root node.
- Run worker nodes on worker devices:
./dllama worker --port 9998 --nthreads 4
- Run root node on the root device:
./dllama inference --model ../dllama_llama-2-7b_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --prompt "Hello world" --steps 16 --nthreads 4 --workers 192.168.0.1:9998
- To run the root node in the chat mode:
./dllama chat --model ../dllama_llama-2-7b-chat_q40.bin --tokenizer ../dllama-llama2-tokenizer.t --weights-float-type q40 --buffer-float-type q80 --nthreads 4 --workers 192.168.0.1:9998
This project is released under the MIT license.
@misc{dllama,
author = {Bartłomiej Tadych},
title = {Distributed Llama},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/b4rtaz/distributed-llama}},
commit = {7eb77ca93ec0d502e28d36b6fb20039b449cbea4}
}