Merge branch 'fix-max-token-and-prompt-remaining'

onicai · Oct 18, 2024 · 69d3e1d · 69d3e1d
2 parents 7ac41f7 + 83beb29
commit 69d3e1d
Show file tree

Hide file tree

Showing 10 changed files with 134 additions and 128 deletions.
diff --git a/.gitignore b/.gitignore
@@ -12,6 +12,8 @@ secret
 models/
 .canister_cache
 README-WIP.md
+Makefile-WIP
+main*.log
 
 # Windows
 .vs

diff --git a/README.md b/README.md
@@ -1,21 +1,21 @@
 [![llama_cpp_canister](https://github.com/onicai/llama_cpp_canister/actions/workflows/cicd-mac.yml/badge.svg)](https://github.com/onicai/llama_cpp_canister/actions/workflows/cicd-mac.yml)
 
-# [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) for the Internet Computer.
+# llama.cpp for the Internet Computer.
 
 ![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)
 
 
-# [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) for the Internet Computer.
+This repo allows you to deploy [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) as a Smart Contract on the Internet Computer.
 
-This repo allows you to deploy llama.cpp as a Smart Contract to the Internet Computer.
+# Try it out
+
+You can try out a deployed version at https://icgpt.icpp.world/
 
 # WARNING ⚠️
 
 This repo is under heavy development. 🚧
 
-- Important limitation is that it only works on a `Mac`. (Windows & Linux is coming soon)
-- Only use it if you're brave 💪 and not afraid of digging ⛏️ into the details of C++, ICP & LLMs.
-- Things in this README should be mostly correct, though no guarantee.
+- Important limitation is that you can only build the canister on a `Mac`. (Windows & Linux is coming soon)
 - Everything is moving fast, so refresh your local clone frequently. ⏰ 
 - The canister endpoint APIs are not yet fixed. Expect breaking changes ❗❗❗
 
@@ -27,11 +27,7 @@ Please join our [OpenChat C++ community](https://oc.app/community/cklkv-3aaaa-aa
 
 # Set up
 
-WARNING: Currently, the canister can only be build on a `mac` ! 
-
-- VERY IMPORTANT: Use Python 3.11 ❗❗❗
-
-- Install  [icpp-pro](https://docs.icpp.world/installation.html), the C++ Canister Development Kit (CDK) for the Internet Computer
+WARNING: Currently, the canister can only be build on a `Mac` ! 
 
 - Clone the repo and it's children:
 
@@ -54,12 +50,15 @@ WARNING: Currently, the canister can only be build on a `mac` !
   ```
   make build-info-cpp-wasm
   ```
-  TODO: recipe for Windows.
 
 - Create a Python environment with dependencies installed
+
+  ❗❗❗ Use Python 3.11 ❗❗❗
+
+  _(This is needed for binaryen.py dependency)_
 
    ```bash
-   # We use MiniConda (Use Python 3.11 ❗❗❗)
+   # We use MiniConda
    conda create --name llama_cpp_canister python=3.11
    conda activate llama_cpp_canister
 
@@ -77,10 +76,7 @@ WARNING: Currently, the canister can only be build on a `mac` !
    source "$HOME/.local/share/dfx/env"
    ```
 
-   _(Note 1: On Windows, just install dfx in wsl, and icpp-pro in PowerShell will know where to find it. )_
-   _(Note 2: It does not yet work on Windows... Stay tuned... )_
-
-- Build & Deploy a pre-trained model to canister `llama_cpp`:
+- Build & Deploy the canister `llama_cpp`:
 
   - Compile & link to WebAssembly (wasm):
     ```bash
@@ -96,41 +92,56 @@ WARNING: Currently, the canister can only be build on a `mac` !
     ```bash
     dfx start --clean
     ```  
+
   - Deploy the wasm to a canister on the local network:
     ```bash
     dfx deploy
+
+    # When upgrading the code in the canister, use:
+    dfx deploy -m upgrade
     ```
+
   - Check the health endpoint of the `llama_cpp` canister:
     ```bash
     $ dfx canister call llama_cpp health
     (variant { Ok = record { status_code = 200 : nat16 } })
     ```
   
-# Build & Test models
 
-## qwen2.5-0.5b-instruct-q8_0.gguf (676 Mb; ~14 tokens max)
+- Upload gguf file
+
+  The canister is now up & running, and ready to be loaded with a gguf file. In
+  this example we use the powerful qwen2.5-0.5b-instruct-q8_0.gguf model.
 
   - Download the model from huggingface: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF
 
     Store it in: `models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf`
     
-  - Upload the model:
+  - Upload the gguf file:
     ```bash
-    python -m scripts.upload --network local --canister llama_cpp --canister-filename models/qwen2.5-0.5b-instruct-q8_0.gguf models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf
+    python -m scripts.upload --network local --canister llama_cpp --canister-filename models/model.gguf models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf
     ```
   
-  - Load the model into OP memory (Do once, and note that it is already done by scripts.upload above)
+  - Only needed after a canister upgrade (`dfx deploy -m upgrade`), re-load the gguf file into Orthogonal Persisted (OP) working memory 
+  
+    This step is already done by scripts.upload above, so you can skip it if you just ran that.
+
+    After a canister upgrade, the gguf file in the canister is still there, because it is persisted in 
+    stable memory, but you need to load it into Orthogonal Persisted (working) memory, which is erased during a canister upgrade.
+
     ```bash
-    dfx canister call llama_cpp load_model '(record { args = vec {"--model"; "models/qwen2.5-0.5b-instruct-q8_0.gguf";} })'
+    dfx canister call llama_cpp load_model '(record { args = vec {"--model"; "models/model.gguf";} })'
     ```
 
   - Set the max_tokens for this model, to avoid it hits the IC's instruction limit
     ```
-    dfx canister call llama_cpp set_max_tokens '(record { max_tokens_query = 12 : nat64; max_tokens_update = 12 : nat64 })'
+    dfx canister call llama_cpp set_max_tokens '(record { max_tokens_query = 10 : nat64; max_tokens_update = 10 : nat64 })'
 
     dfx canister call llama_cpp get_max_tokens
     ```
 
+- Chat with the LLM
+
   - Ensure the canister is ready for Inference, with the model loaded
     ```bash
     dfx canister call llama_cpp ready
@@ -145,18 +156,21 @@ WARNING: Currently, the canister can only be build on a `mac` !
     # Start a new chat - this resets the prompt-cache for this conversation
     dfx canister call llama_cpp new_chat '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"} })'
 
-    # Repeat this call until the prompt_remaining is empty. KEEP SENDING THE ORIGINAL PROMPT 
+    # Repeat this call until `prompt_remaining` in the response is empty. 
+    # This ingest the prompt into the prompt-cache, using multiple update calls
+    # Important: KEEP SENDING THE FULL PROMPT 
     dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all"; "-sp"; "-p"; "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\ngive me a short introduction to LLMs.<|im_end|>\n<|im_start|>assistant\n"; "-n"; "512" } })' 
     ...
 
-    # Once prompt_remaining is empty, repeat this call, with an empty prompt, until the `generated_eog=true`:
+    # Once `prompt_remaining` in the response is empty, repeat this call, with an empty prompt, until `generated_eog=true`
+    # Now the LLM is generating new tokens !
     dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all"; "-sp"; "-p"; ""; "-n"; "512" } })'
 
     ...
 
-    # Once generated_eog = true, the LLM is done generating
+    # Once `generated_eog` in the response is `true`, the LLM is done generating
 
-    # this is the output after several update calls and it has reached eog:
+    # this is the response after several update calls and it has reached eog:
     (
       variant {
         Ok = record {
@@ -170,9 +184,6 @@ WARNING: Currently, the canister can only be build on a `mac` !
       },
     )
 
-    # NOTE: This is the equivalent llama-cli call, when running llama.cpp locally
-    ./llama-cli -m /models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf -sp -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\ngive me a short introduction to LLMs.<|im_end|>\n<|im_start|>assistant\n"  -fa -ngl 80 -n 512 --prompt-cache prompt.cache --prompt-cache-all
-
     ########################################
     # Tip. Add this to the args vec if you #
     #      want to see how many tokens the #
@@ -184,90 +195,34 @@ WARNING: Currently, the canister can only be build on a `mac` !
 
     ```
 
-
-  - Deployed to mainnet at canister: 6uwoh-vaaaa-aaaag-amema-cai
-
-      To be able to upload the model, I had to change the [compute allocation](https://internetcomputer.org/docs/current/developer-docs/smart-contracts/maintain/settings#compute-allocation)
-
-      ```
-      # check the settings
-      dfx canister status --ic llama_cpp  
-
-      # Set a compute allocation  (costs a rental fee)
-      dfx canister update-settings --ic llama_cpp --compute-allocation 50
-      ```
-    
-    - Cost of uploading the 676 Mb model, using the compute-allocation of 50, cost about:
-      
-      ~3 TCycle = $4 
-
-    - Cost of 10 tokens = 31_868_339_839 Cycles = ~0.03 TCycles = ~$0.04
-      So, 1 Token = ~$0.004
-
----
----
-
-## storiesICP42Mtok4096.gguf (113.0 Mb)
-
-  This is a fine-tuned model that generates funny stories about ICP & ckBTC.
-
-  The context window for the model is 128 tokens, and that is the maximum length llama.cpp allows for token generation.
-  
-  The same deployment & test procedures can be used for the really small test models `stories260Ktok512.gguf` & `stories15Mtok4096.gguf`. Those two models are great for fleshing out the deployment, but the LLMs themselves are too small to create comprehensive stories.
-
-  - Download the model from huggingface: https://huggingface.co/onicai/llama_cpp_canister_models
-    
-    Store it in: `models/storiesICP42Mtok4096.gguf`
-
-  - Upload the model:
-    ```bash
-    python -m scripts.upload --network local --canister llama_cpp --canister-filename models/storiesICP42Mtok4096.gguf models/storiesICP42Mtok4096.gguf
+    Note: The sequence of update calls to the canister is required because the Internet Computer has a limitation
+    on the number of computations it allows per call. At the moment, only 10 tokens can be generated per call.
+    This sequence of update calls is equivalent to using the [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) 
+    repo directly and running the `llama-cli` locally, with the command:
     ```
-
-  - Load the model into OP memory
-
-    This command will load a model into working memory (Orthogonal Persisted):
-    ```bash
-    dfx canister call llama_cpp load_model '(record { args = vec {"--model"; "models/storiesICP42Mtok4096.gguf";} })'
+    ./llama-cli -m /models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf --prompt-cache prompt.cache --prompt-cache-all -sp -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\ngive me a short introduction to LLMs.<|im_end|>\n<|im_start|>assistant\n" -n 512 -fa -ngl 80 
     ```
+    In above command, the `-fa -ngl 80` arguments are useful only on GPU. We do not use them when calling the IC, because
+    the canister has a CPU only.
 
-  - Ensure the canister is ready for Inference, with the model loaded
-    ```bash
-    dfx canister call llama_cpp ready
-    ```
 
-  - Chat with the LLM:
+- You can download the `main.log` file from the canister with:
+  ```
+  python -m scripts.download --network local --canister llama_cpp --local-filename main.log main.log
+  ```
 
-    ```bash
-    # Start a new chat - this resets the prompt-cache for this conversation
-    dfx canister call llama_cpp new_chat '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"} })'
+## Smoke testing the deployed LLM
 
-    # Create 50 tokens from a prompt, with caching
-    dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all";"--samplers"; "top_p"; "--temp"; "0.1"; "--top-p"; "0.9"; "-n"; "50"; "-p"; "Dominic loves writing stories"} })'
+You can run a smoketest on the deployed LLM:
 
-    # Create another 50 tokens, using the cache - just continue, no new prompt provided
-    # Repeat until the LLM says it is done
-    dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all";"--samplers"; "top_p"; "--temp"; "0.1"; "--top-p"; "0.9"; "-n"; "50";} })'
+- Deploy Qwen2.5 model as described above
 
-    # After a couple of calls, you will get something like this as output, unless you hit the context limit error:
-    (
-      variant {
-        Ok = record {
-          status_code = 200 : nat16;
-          output = "";
-          error = "";
-          input = " Dominic loves writing stories. He wanted to share his love with others, so he built a fun website on the Internet Computer. With his ckBTC, he bought a cool new book with new characters. Every night before bed, Dominic read his favorite stories with his favorite characters. The end.";
-        }
-      },
-    )
-    
+- Run the smoketests for the Qwen2.5 LLM deployed to your local IC network:
 
-    ########################################
-    # Tip. Add this to the args vec if you #
-    #      want to see how many tokens the #
-    #      canister can generate before it #
-    #      hits the instruction limit      #
-    #                                      #
-    #      ;"--print-token-count"; "1"     #
-    ########################################
-    ```
+  ```
+  # First test the canister functions, like 'health'
+  pytest -vv test/test_canister_functions.py
+
+  # Then run the inference tests
+  pytest -vv test/test_qwen2.py
+  ```
diff --git a/requirements.txt b/requirements.txt
@@ -3,6 +3,6 @@
 
 -r scripts/requirements.txt
 -r src/llama_cpp_onicai_fork/requirements.txt
-icpp-pro==4.2.0
+icpp-pro>=4.2.0
 ic-py==1.0.1
 binaryen.py
diff --git a/scripts/download.py b/scripts/download.py
@@ -77,7 +77,7 @@ def main() -> int:
 
     done = False
     offset = 0
-    with open(local_filename_path, "ab") as f:
+    with open(local_filename_path, "wb") as f:
         while not done:
             response = canister_instance.file_download_chunk(
                 {

diff --git a/scripts/qa_deploy_and_pytest.py b/scripts/qa_deploy_and_pytest.py
@@ -36,13 +36,13 @@ def main() -> int:
             tests = [
                 {
                     "filename": "models/stories260Ktok512.gguf",
-                    "canister_filename": "models/stories260Ktok512.gguf",
+                    "canister_filename": "models/model.gguf",
                     "test_path_model": "test/test_tiny_stories.py",
                 },
                 # This times out in Github action. Can only be run locally.
                 # {
                 #     "filename": "models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf",  # pylint: disable=line-too-long
-                #     "canister_filename": "models/qwen2.5-0.5b-instruct-q8_0.gguf",
+                #     "canister_filename": "models/model.gguf",
                 #     "test_path_model": "test/test_qwen2.py",
                 # },
             ]

diff --git a/src/download.cpp b/src/download.cpp
@@ -1,4 +1,5 @@
 #include "download.h"
+#include "auth.h"
 #include "utils.h"
 
 #include <fstream>
@@ -23,6 +24,7 @@ void print_file_download_summary(const std::string &filename,
 
 void file_download_chunk() {
   IC_API ic_api(CanisterQuery{std::string(__func__)}, false);
+  if (!is_caller_a_controller(ic_api)) return;
 
   // Get filename to download and the chunksize
   std::string filename{""};