Releases: b4rtaz/distributed-llama
0.11.0 🚀
This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.
Key changes:
- All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
- Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
- Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.
Measurement
4 x Raspberry Pi 5 8GB
Model | Token/s - 0.10.6 | Token/s - This version | Acceleration |
---|---|---|---|
Llama 3.2 1B Q40 | 9.90 | 21.42 | 2.1x |
Llama 3.2 3B Q40 | 3.47 | 9.01 | 2.6x 🚀 |
Llama 3 8B Q40 | 2.83 | 4.67 | 1.6x |
2 x Raspberry Pi 5 8GB
Model | Tok/s - 0.10.6 | Tok/s - This version | Acceleration |
---|---|---|---|
Llama 3.2 1B Q40 | 8.44 | 15.31 | 1.8x |
Llama 3.2 3B Q40 | 3.24 | 6.80 | 2.0x 🚀 |
Llama 3 8B Q40 | 2.02 | 3.44 | 1.7x |
TODO
mixtral
model is temporary not supported, it will be fixed in a next release.
0.10.6
0.10.5
0.10.4
0.10.3
0.10.2
This version introduces a new CLI argument: --max-seq-len <n>
. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference
, dllama chat
, and dllama-api
. You don't need to set it in the worker because the root node will distribute the information to the worker.
Example:
./dllama chat --model ... --nthreads 8 --max-seq-len 1024
0.10.1
0.10.0
This version introduces support for the Llama 3.1 model! 🔥 Additionally, it includes a small improvement that enables you to run the Llama 3.1 8B Q40 on a standard computer with the full context size (131,072 tokens!).
Llama 3.1 8B Q40 on MacBook Pro M1 16GB RAM with full context
The quantized Llama 3.1 8B model to Q40 format requires 6.3 GB GB of RAM. The key-value cache for the full context requires approximately 34 GB of memory (F32). For casual devices, this is definitely too high. That's why this version introduces the --kv-cache-storage disc
argument (Windows is not supported yet). Once set, the key-value cache will be stored on your disk. If you have a fast SSD, the slowdown should be acceptable. This argument works for the dllama inference
, dllama worker
, and dllama-api
commands. An important fact is that the size of the KV cache is split across all nodes in the cluster. So, for example, with 4 nodes, each needs to have ~8.5 GB of memory (RAM or disk) to keep the KV cache.
How to run Llama 3.1 8B
- Download Distributed Llama repository and compile it:
make dllama && make dllama-api
. - Download model
python launch.py llama3_1_8b_instruct_q40
- Run model:
./dllama chat --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999
or./dllama-api --model models/llama3_1_8b_instruct_q40/dllama_model_llama3_1_8b_instruct_q40.m --tokenizer models/llama3_1_8b_instruct_q40/dllama_tokenizer_llama3_1_8b_instruct_q40.t --buffer-float-type q80 --kv-cache-storage disc --nthreads 8 --workers 192.168.0.1:9999
If your worker node does not have enough RAM for the KV cache, you can run the worker with the --kv-cache-storage disc
argument.
./dllama worker --port 9999 --kv-cache-storage disc --nthreads 8
TODO
A future version will include the ability to reduce the context size. This should reduce memory consumption when the full context is not needed.
The 0.10.2 version introduced the --max-seq-len <n>
argument.
0.9.2
This version allows to override the chat template. This may be helpful if a model does not have a tokenizer with a chat template.
How to use:
./dllama... --chat-template llama3
./dllama-api ... --chat-template llama3
Supported values:
llama2
llama3
zephyr
chatml