Releases · b4rtaz/distributed-llama

This version brings major changes to the project after a month-long refactor. The restructuring improves organization, making maintenance and future development significantly easier. Some details about the refactor you can find in this pull request.

List of changes:

✅ Introduced an abstract neural network model with opcodes to describe neural network behavior
✅ Completely restructure the project
✅ Batch processing to support evaluation and prediction #138
✅ Speed up matmul operation for the evaluation (used sgemm llamafile)
✅ Improve tokenizer
✅ Fixed obvious memory leaks (detected by -fsanitize=address)
✅ ARM and AVX2 optimalisation for all op codes

Fixes:

Fixed a bug in rope scaling
Fixed a bug in the tokenizer that caused invalidly tokenized special tokens

Tokenizer

The most important part of this change was delivering a stable version of Distributed Llama that can be used daily. Until now, this project was in the experimental stage and, due to multiple bugs, was not usable. This version focuses solely on the Llama 3 model family; other models are not supported at the moment. Llama 2 models may still work, but they are not a priority right now.

The tokenizer for Llama 3 has been improved, and the API and chat mode now function correctly.

Chat Example

To paste the output I needed to transform "```" into "` ` `".

📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128001 (<|end_of_text|>)
📄 ChatEosId: 128009 (<|end_of_text|>)
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 4096
💡 HiddenDim: 14336
💡 VocabSize: 128256
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 4096
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 7329040 kB
🧠 CPU: neon dotprod fp16
💿 Weights loaded
⭐ Chat template: llama3
🛑 Stop: <|eot_id|>
💻 System prompt (optional): 

👱 User
> hello

🤖 Assistant
Hello! How can I help you today?
👱 User
> what is 1+1?

🤖 Assistant
 The answer is:

2!
👱 User
> could you implement a simple application that calculates a sum provided by a user? c++ please. the input is stdin.

🤖 Assistant
Here's a simple C++ program that calculates the sum of two numbers provided by the user:

#include <iostream>

` ` `
int main() {
    int num1, num2, sum;

    // Get input from user
    std::cout << "Enter the first number: ";
    std::cin >> num1;

    std::cout << "Enter the second number: ";
    std::cin >> num2;

    // Calculate sum
    sum = num1 + num2;

    // Print result
    std::cout << "The sum is: " << sum << std::endl;

    return 0;
}
` ` `

This program uses the `std::cin` function to read input from the user, and the `std::cout` function to print the result.

If you want to make the program more robust, you can add error handling to ensure that the user enters valid input. For example, you can use a loop to keep asking for input until the user enters a valid number.

Here's an example of how you could add error handling:

` ` `
#include <iostream>

int main() {
    int num1, num2, sum;

    // Get input from user
    while (!(std::cin >> num1)) {
        std::cout << "Invalid input. Please enter a valid number: ";
        std::cin.clear();
        std::cin.ignore(10000, '\n');
    }

    while (!(std::cin >> num2)) {
        std::cout << "Invalid input. Please enter a valid number: ";
        std::cin.clear();
        std::cin.ignore(10000, '\n');
    }

    // Calculate sum
    sum = num1 + num2;

    // Print result
    std::cout << "The sum is: " << sum << std::endl;

    return 0;
}
` ` `

In this version of the program, the `while` loop keeps asking for input until the user enters a valid number. The `std::cin.clear()` and `std::cin.ignore(10000, '\n')` statements are used to clear the input buffer and ignore any remaining input.
👱 User
>

🚨 You need to re-download the models and tokenizers using launch.py.

Performance

This version introduces significant performance improvements on the CPU. All operations are optimized for NEON and AVX2. The most important change is that inference is now split into evaluation and prediction. Evaluation is much faster than prediction due to the use of the SGEMM operation.

Llama 3.1 8B Q40, MacBook M1 Pro 16 GB RAM	Evaluation	Prediction
Distributed Llama 0.11.2	-	10.28 tok/s
Distributed Llama 0.12.0	48.00 tok/s	🚀 19.70 tok/s

This version is not as fast as llama.cpp in evaluation, but it is slightly faster in prediction on the Raspberry Pi 5 8GB.

Llama 3.1 8B Q40, Raspberry Pi 5 8GB	Evaluation	Prediction
llama.cpp 4667	12.52 tok/s	2.03 tok/s
Distributed Llama 0.12.0	6.70 tok/s	🚀 2.47 tok/s

Llama 3.1 8B Q40 - llama.cpp

build: 4667 (d2fe216f) with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from ../../../meta-llama-3.1-8b-instruct-q4_0.gguf?download=true (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type q4_0:  224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 5.61 GiB (6.01 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: ca...

This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.

Key changes:

All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.

Measurement

4 x Raspberry Pi 5 8GB

Model	Token/s - 0.10.6	Token/s - This version	Acceleration
Llama 3.2 1B Q40	9.90	21.42	2.1x
Llama 3.2 3B Q40	3.47	9.01	2.6x 🚀
Llama 3 8B Q40	2.83	4.67	1.6x

2 x Raspberry Pi 5 8GB

Model	Tok/s - 0.10.6	Tok/s - This version	Acceleration
Llama 3.2 1B Q40	8.44	15.31	1.8x
Llama 3.2 3B Q40	3.24	6.80	2.0x 🚀
Llama 3 8B Q40	2.02	3.44	1.7x

Test details

TODO

mixtral model is temporary not supported, it will be fixed in a next release.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tokenizer

Performance

Contributors

Contributors

Key changes:

Measurement

TODO

Releases: b4rtaz/distributed-llama

0.12.0 🚀

Tokenizer

Performance

0.11.2

Contributors

0.11.1

Contributors

0.11.0 🚀

Key changes:

Measurement

TODO

0.10.6

0.10.5

0.10.4

0.10.3

0.10.2

0.10.1