Skip to content

Releases: b4rtaz/distributed-llama

0.12.0 🚀

12 Feb 23:54
121bc8c
Compare
Choose a tag to compare

This version brings major changes to the project after a month-long refactor. The restructuring improves organization, making maintenance and future development significantly easier. Some details about the refactor you can find in this pull request.

List of changes:

  • ✅ Introduced an abstract neural network model with opcodes to describe neural network behavior
  • ✅ Completely restructure the project
  • ✅ Batch processing to support evaluation and prediction #138
  • ✅ Speed up matmul operation for the evaluation (used sgemm llamafile)
  • ✅ Improve tokenizer
  • ✅ Fixed obvious memory leaks (detected by -fsanitize=address)
  • ✅ ARM and AVX2 optimalisation for all op codes

Fixes:

  • Fixed a bug in rope scaling
  • Fixed a bug in the tokenizer that caused invalidly tokenized special tokens

Tokenizer

The most important part of this change was delivering a stable version of Distributed Llama that can be used daily. Until now, this project was in the experimental stage and, due to multiple bugs, was not usable. This version focuses solely on the Llama 3 model family; other models are not supported at the moment. Llama 2 models may still work, but they are not a priority right now.

The tokenizer for Llama 3 has been improved, and the API and chat mode now function correctly.

Chat Example

To paste the output I needed to transform "```" into "` ` `".

📄 BosId: 128000 (<|begin_of_text|>)
📄 EosId: 128001 (<|end_of_text|>)
📄 ChatEosId: 128009 (<|end_of_text|>)
📄 RegularVocabSize: 128000
📄 SpecialVocabSize: 256
💡 Arch: Llama
💡 HiddenAct: Silu
💡 Dim: 4096
💡 HiddenDim: 14336
💡 VocabSize: 128256
💡 nLayers: 32
💡 nHeads: 32
💡 nKvHeads: 8
💡 OrigSeqLen: 131072
💡 SeqLen: 4096
💡 NormEpsilon: 0.000010
💡 RopeType: Llama3.1
💡 RopeTheta: 500000
💡 RopeScaling: f=8.0, l=1.0, h=4.0, o=8192
📀 RequiredMemory: 7329040 kB
🧠 CPU: neon dotprod fp16
💿 Weights loaded
⭐ Chat template: llama3
🛑 Stop: <|eot_id|>
💻 System prompt (optional): 

👱 User
> hello

🤖 Assistant
Hello! How can I help you today?
👱 User
> what is 1+1?

🤖 Assistant
 The answer is:

2!
👱 User
> could you implement a simple application that calculates a sum provided by a user? c++ please. the input is stdin.

🤖 Assistant
Here's a simple C++ program that calculates the sum of two numbers provided by the user:

#include <iostream>

` ` `
int main() {
    int num1, num2, sum;

    // Get input from user
    std::cout << "Enter the first number: ";
    std::cin >> num1;

    std::cout << "Enter the second number: ";
    std::cin >> num2;

    // Calculate sum
    sum = num1 + num2;

    // Print result
    std::cout << "The sum is: " << sum << std::endl;

    return 0;
}
` ` `

This program uses the `std::cin` function to read input from the user, and the `std::cout` function to print the result.

If you want to make the program more robust, you can add error handling to ensure that the user enters valid input. For example, you can use a loop to keep asking for input until the user enters a valid number.

Here's an example of how you could add error handling:

` ` `
#include <iostream>

int main() {
    int num1, num2, sum;

    // Get input from user
    while (!(std::cin >> num1)) {
        std::cout << "Invalid input. Please enter a valid number: ";
        std::cin.clear();
        std::cin.ignore(10000, '\n');
    }

    while (!(std::cin >> num2)) {
        std::cout << "Invalid input. Please enter a valid number: ";
        std::cin.clear();
        std::cin.ignore(10000, '\n');
    }

    // Calculate sum
    sum = num1 + num2;

    // Print result
    std::cout << "The sum is: " << sum << std::endl;

    return 0;
}
` ` `

In this version of the program, the `while` loop keeps asking for input until the user enters a valid number. The `std::cin.clear()` and `std::cin.ignore(10000, '\n')` statements are used to clear the input buffer and ignore any remaining input.
👱 User
> 

🚨 You need to re-download the models and tokenizers using launch.py.

Performance

This version introduces significant performance improvements on the CPU. All operations are optimized for NEON and AVX2. The most important change is that inference is now split into evaluation and prediction. Evaluation is much faster than prediction due to the use of the SGEMM operation.

Llama 3.1 8B Q40, MacBook M1 Pro 16 GB RAM Evaluation Prediction
Distributed Llama 0.11.2 - 10.28 tok/s
Distributed Llama 0.12.0 48.00 tok/s 🚀 19.70 tok/s

This version is not as fast as llama.cpp in evaluation, but it is slightly faster in prediction on the Raspberry Pi 5 8GB.

Llama 3.1 8B Q40, Raspberry Pi 5 8GB Evaluation Prediction
llama.cpp 4667 12.52 tok/s 2.03 tok/s
Distributed Llama 0.12.0 6.70 tok/s 🚀 2.47 tok/s
Llama 3.1 8B Q40 - llama.cpp
build: 4667 (d2fe216f) with cc (Debian 12.2.0-14) 12.2.0 for aarch64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 30 key-value pairs and 292 tensors from ../../../meta-llama-3.1-8b-instruct-q4_0.gguf?download=true (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 2
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:             llama.rope.scaling.attn_factor f32              = 1.000000
llama_model_loader: - kv  21:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  22:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  23:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  24:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  25:                      tokenizer.ggml.merges arr[str,280147]  = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv  26:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  27:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  28:                    tokenizer.chat_template str              = {% set loop_messages = messages %}{% ...
llama_model_loader: - kv  29:               general.quantization_version u32              = 2
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type  f16:    2 tensors
llama_model_loader: - type q4_0:  224 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_0
print_info: file size   = 5.61 GiB (6.01 BPW) 
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch             = llama
print_info: vocab_only       = 0
print_info: n_ctx_train      = 131072
print_info: n_embd           = 4096
print_info: n_layer          = 32
print_info: n_head           = 32
print_info: n_head_kv        = 8
print_info: n_rot            = 128
print_info: n_swa            = 0
print_info: n_embd_head_k    = 128
print_info: n_embd_head_v    = 128
print_info: n_gqa            = 4
print_info: n_embd_k_gqa     = 1024
print_info: n_embd_v_gqa     = 1024
print_info: f_norm_eps       = 0.0e+00
print_info: f_norm_rms_eps   = 1.0e-05
print_info: f_clamp_kqv      = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale    = 0.0e+00
print_info: n_ff             = 14336
print_info: n_expert         = 0
print_info: n_expert_used    = 0
print_info: ca...
Read more

0.11.2

06 Feb 15:36
9f04161
Compare
Choose a tag to compare

This version fixes the issue with dllama-api incorrectly reading HTTP requests #153. Thanks @jkeegan!

0.11.1

09 Dec 22:22
0975af8
Compare
Choose a tag to compare
  • This version disables CPU pinning #141
  • This version introduces help/usage information accessible via the command line #143 (thanks @jkeegan!)

0.11.0 🚀

21 Nov 18:59
8b1cf89
Compare
Choose a tag to compare

This update introduces a significant speed improvement 🚀 in inference for clusters with 2 or more nodes.

Key changes:

  • All nodes in the Distributed Llama cluster are now interconnected using a mesh topology. Previously, a star topology was used.
  • Now, every layer is distributed across all nodes, including the last layer, which previously caused a major bottleneck.
  • Norm layers are now calculated redundantly on all nodes. While redundant, this step is very fast and does not impact performance significantly.

Measurement

4 x Raspberry Pi 5 8GB

Model Token/s - 0.10.6 Token/s - This version Acceleration
Llama 3.2 1B Q40 9.90 21.42 2.1x
Llama 3.2 3B Q40 3.47 9.01 2.6x 🚀
Llama 3 8B Q40 2.83 4.67 1.6x

2 x Raspberry Pi 5 8GB

Model Tok/s - 0.10.6 Tok/s - This version Acceleration
Llama 3.2 1B Q40 8.44 15.31 1.8x
Llama 3.2 3B Q40 3.24 6.80 2.0x 🚀
Llama 3 8B Q40 2.02 3.44 1.7x

Test details

TODO

  • mixtral model is temporary not supported, it will be fixed in a next release.

0.10.6

17 Nov 13:41
6599db2
Compare
Choose a tag to compare

This version fixes a bug in the rms function for processors with AVX2 instructions #137.

0.10.5

11 Nov 23:12
c09173b
Compare
Choose a tag to compare

This version fixes the bug related to releasing memory #134.

0.10.4

13 Oct 14:22
Compare
Choose a tag to compare

This version adds to the launch.py script two new models:

  • Llama 3.2 1B Instruct Q40
  • Llama 3.2 3B Instruct Q40

0.10.3

10 Aug 22:27
3353d56
Compare
Choose a tag to compare

This version refactors the code to reduce the use of the writeMany and readMany methods.

0.10.2

29 Jul 12:23
71135e6
Compare
Choose a tag to compare

This version introduces a new CLI argument: --max-seq-len <n>. It allows you to reduce the context size and, at the same time, reduce memory consumption. This argument works with the following commands: dllama inference, dllama chat, and dllama-api. You don't need to set it in the worker because the root node will distribute the information to the worker.

Example:

./dllama chat --model ... --nthreads 8 --max-seq-len 1024

0.10.1

28 Jul 14:29
Compare
Choose a tag to compare

Implemented the fallback implementation for the matmulQ40vQ80 operation. Distributed Llama now supports all CPU architectures, with optimizations specifically for ARM and AVX2 CPUs.