Skip to content

Commit

Permalink
Merge branch 'fix-max-token-and-prompt-remaining'
Browse files Browse the repository at this point in the history
  • Loading branch information
icppWorld committed Oct 18, 2024
2 parents 7ac41f7 + 83beb29 commit 69d3e1d
Show file tree
Hide file tree
Showing 10 changed files with 134 additions and 128 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,8 @@ secret
models/
.canister_cache
README-WIP.md
Makefile-WIP
main*.log

# Windows
.vs
Expand Down
173 changes: 64 additions & 109 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,21 +1,21 @@
[![llama_cpp_canister](https://github.com/onicai/llama_cpp_canister/actions/workflows/cicd-mac.yml/badge.svg)](https://github.com/onicai/llama_cpp_canister/actions/workflows/cicd-mac.yml)

# [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) for the Internet Computer.
# llama.cpp for the Internet Computer.

![llama](https://user-images.githubusercontent.com/1991296/230134379-7181e485-c521-4d23-a0d6-f7b3b61ba524.png)


# [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) for the Internet Computer.
This repo allows you to deploy [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) as a Smart Contract on the Internet Computer.

This repo allows you to deploy llama.cpp as a Smart Contract to the Internet Computer.
# Try it out

You can try out a deployed version at https://icgpt.icpp.world/

# WARNING ⚠️

This repo is under heavy development. 🚧

- Important limitation is that it only works on a `Mac`. (Windows & Linux is coming soon)
- Only use it if you're brave 💪 and not afraid of digging ⛏️ into the details of C++, ICP & LLMs.
- Things in this README should be mostly correct, though no guarantee.
- Important limitation is that you can only build the canister on a `Mac`. (Windows & Linux is coming soon)
- Everything is moving fast, so refresh your local clone frequently. ⏰
- The canister endpoint APIs are not yet fixed. Expect breaking changes ❗❗❗

Expand All @@ -27,11 +27,7 @@ Please join our [OpenChat C++ community](https://oc.app/community/cklkv-3aaaa-aa

# Set up

WARNING: Currently, the canister can only be build on a `mac` !

- VERY IMPORTANT: Use Python 3.11 ❗❗❗

- Install [icpp-pro](https://docs.icpp.world/installation.html), the C++ Canister Development Kit (CDK) for the Internet Computer
WARNING: Currently, the canister can only be build on a `Mac` !

- Clone the repo and it's children:

Expand All @@ -54,12 +50,15 @@ WARNING: Currently, the canister can only be build on a `mac` !
```
make build-info-cpp-wasm
```
TODO: recipe for Windows.

- Create a Python environment with dependencies installed

❗❗❗ Use Python 3.11 ❗❗❗

_(This is needed for binaryen.py dependency)_

```bash
# We use MiniConda (Use Python 3.11 ❗❗❗)
# We use MiniConda
conda create --name llama_cpp_canister python=3.11
conda activate llama_cpp_canister

Expand All @@ -77,10 +76,7 @@ WARNING: Currently, the canister can only be build on a `mac` !
source "$HOME/.local/share/dfx/env"
```

_(Note 1: On Windows, just install dfx in wsl, and icpp-pro in PowerShell will know where to find it. )_
_(Note 2: It does not yet work on Windows... Stay tuned... )_

- Build & Deploy a pre-trained model to canister `llama_cpp`:
- Build & Deploy the canister `llama_cpp`:

- Compile & link to WebAssembly (wasm):
```bash
Expand All @@ -96,41 +92,56 @@ WARNING: Currently, the canister can only be build on a `mac` !
```bash
dfx start --clean
```

- Deploy the wasm to a canister on the local network:
```bash
dfx deploy
# When upgrading the code in the canister, use:
dfx deploy -m upgrade
```

- Check the health endpoint of the `llama_cpp` canister:
```bash
$ dfx canister call llama_cpp health
(variant { Ok = record { status_code = 200 : nat16 } })
```
# Build & Test models
## qwen2.5-0.5b-instruct-q8_0.gguf (676 Mb; ~14 tokens max)
- Upload gguf file
The canister is now up & running, and ready to be loaded with a gguf file. In
this example we use the powerful qwen2.5-0.5b-instruct-q8_0.gguf model.
- Download the model from huggingface: https://huggingface.co/Qwen/Qwen2.5-0.5B-Instruct-GGUF
Store it in: `models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf`
- Upload the model:
- Upload the gguf file:
```bash
python -m scripts.upload --network local --canister llama_cpp --canister-filename models/qwen2.5-0.5b-instruct-q8_0.gguf models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf
python -m scripts.upload --network local --canister llama_cpp --canister-filename models/model.gguf models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf
```
- Load the model into OP memory (Do once, and note that it is already done by scripts.upload above)
- Only needed after a canister upgrade (`dfx deploy -m upgrade`), re-load the gguf file into Orthogonal Persisted (OP) working memory
This step is already done by scripts.upload above, so you can skip it if you just ran that.
After a canister upgrade, the gguf file in the canister is still there, because it is persisted in
stable memory, but you need to load it into Orthogonal Persisted (working) memory, which is erased during a canister upgrade.
```bash
dfx canister call llama_cpp load_model '(record { args = vec {"--model"; "models/qwen2.5-0.5b-instruct-q8_0.gguf";} })'
dfx canister call llama_cpp load_model '(record { args = vec {"--model"; "models/model.gguf";} })'
```
- Set the max_tokens for this model, to avoid it hits the IC's instruction limit
```
dfx canister call llama_cpp set_max_tokens '(record { max_tokens_query = 12 : nat64; max_tokens_update = 12 : nat64 })'
dfx canister call llama_cpp set_max_tokens '(record { max_tokens_query = 10 : nat64; max_tokens_update = 10 : nat64 })'
dfx canister call llama_cpp get_max_tokens
```
- Chat with the LLM
- Ensure the canister is ready for Inference, with the model loaded
```bash
dfx canister call llama_cpp ready
Expand All @@ -145,18 +156,21 @@ WARNING: Currently, the canister can only be build on a `mac` !
# Start a new chat - this resets the prompt-cache for this conversation
dfx canister call llama_cpp new_chat '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"} })'
# Repeat this call until the prompt_remaining is empty. KEEP SENDING THE ORIGINAL PROMPT
# Repeat this call until `prompt_remaining` in the response is empty.
# This ingest the prompt into the prompt-cache, using multiple update calls
# Important: KEEP SENDING THE FULL PROMPT
dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all"; "-sp"; "-p"; "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\ngive me a short introduction to LLMs.<|im_end|>\n<|im_start|>assistant\n"; "-n"; "512" } })'
...
# Once prompt_remaining is empty, repeat this call, with an empty prompt, until the `generated_eog=true`:
# Once `prompt_remaining` in the response is empty, repeat this call, with an empty prompt, until `generated_eog=true`
# Now the LLM is generating new tokens !
dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all"; "-sp"; "-p"; ""; "-n"; "512" } })'
...
# Once generated_eog = true, the LLM is done generating
# Once `generated_eog` in the response is `true`, the LLM is done generating
# this is the output after several update calls and it has reached eog:
# this is the response after several update calls and it has reached eog:
(
variant {
Ok = record {
Expand All @@ -170,9 +184,6 @@ WARNING: Currently, the canister can only be build on a `mac` !
},
)
# NOTE: This is the equivalent llama-cli call, when running llama.cpp locally
./llama-cli -m /models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf -sp -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\ngive me a short introduction to LLMs.<|im_end|>\n<|im_start|>assistant\n" -fa -ngl 80 -n 512 --prompt-cache prompt.cache --prompt-cache-all
########################################
# Tip. Add this to the args vec if you #
# want to see how many tokens the #
Expand All @@ -184,90 +195,34 @@ WARNING: Currently, the canister can only be build on a `mac` !
```
- Deployed to mainnet at canister: 6uwoh-vaaaa-aaaag-amema-cai
To be able to upload the model, I had to change the [compute allocation](https://internetcomputer.org/docs/current/developer-docs/smart-contracts/maintain/settings#compute-allocation)
```
# check the settings
dfx canister status --ic llama_cpp
# Set a compute allocation (costs a rental fee)
dfx canister update-settings --ic llama_cpp --compute-allocation 50
```
- Cost of uploading the 676 Mb model, using the compute-allocation of 50, cost about:
~3 TCycle = $4
- Cost of 10 tokens = 31_868_339_839 Cycles = ~0.03 TCycles = ~$0.04
So, 1 Token = ~$0.004
---
---
## storiesICP42Mtok4096.gguf (113.0 Mb)
This is a fine-tuned model that generates funny stories about ICP & ckBTC.
The context window for the model is 128 tokens, and that is the maximum length llama.cpp allows for token generation.
The same deployment & test procedures can be used for the really small test models `stories260Ktok512.gguf` & `stories15Mtok4096.gguf`. Those two models are great for fleshing out the deployment, but the LLMs themselves are too small to create comprehensive stories.
- Download the model from huggingface: https://huggingface.co/onicai/llama_cpp_canister_models
Store it in: `models/storiesICP42Mtok4096.gguf`
- Upload the model:
```bash
python -m scripts.upload --network local --canister llama_cpp --canister-filename models/storiesICP42Mtok4096.gguf models/storiesICP42Mtok4096.gguf
Note: The sequence of update calls to the canister is required because the Internet Computer has a limitation
on the number of computations it allows per call. At the moment, only 10 tokens can be generated per call.
This sequence of update calls is equivalent to using the [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp)
repo directly and running the `llama-cli` locally, with the command:
```
- Load the model into OP memory
This command will load a model into working memory (Orthogonal Persisted):
```bash
dfx canister call llama_cpp load_model '(record { args = vec {"--model"; "models/storiesICP42Mtok4096.gguf";} })'
./llama-cli -m /models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf --prompt-cache prompt.cache --prompt-cache-all -sp -p "<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n<|im_start|>user\ngive me a short introduction to LLMs.<|im_end|>\n<|im_start|>assistant\n" -n 512 -fa -ngl 80
```
In above command, the `-fa -ngl 80` arguments are useful only on GPU. We do not use them when calling the IC, because
the canister has a CPU only.
- Ensure the canister is ready for Inference, with the model loaded
```bash
dfx canister call llama_cpp ready
```
- Chat with the LLM:
- You can download the `main.log` file from the canister with:
```
python -m scripts.download --network local --canister llama_cpp --local-filename main.log main.log
```
```bash
# Start a new chat - this resets the prompt-cache for this conversation
dfx canister call llama_cpp new_chat '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"} })'
## Smoke testing the deployed LLM
# Create 50 tokens from a prompt, with caching
dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all";"--samplers"; "top_p"; "--temp"; "0.1"; "--top-p"; "0.9"; "-n"; "50"; "-p"; "Dominic loves writing stories"} })'
You can run a smoketest on the deployed LLM:
# Create another 50 tokens, using the cache - just continue, no new prompt provided
# Repeat until the LLM says it is done
dfx canister call llama_cpp run_update '(record { args = vec {"--prompt-cache"; "my_cache/prompt.cache"; "--prompt-cache-all";"--samplers"; "top_p"; "--temp"; "0.1"; "--top-p"; "0.9"; "-n"; "50";} })'
- Deploy Qwen2.5 model as described above
# After a couple of calls, you will get something like this as output, unless you hit the context limit error:
(
variant {
Ok = record {
status_code = 200 : nat16;
output = "";
error = "";
input = " Dominic loves writing stories. He wanted to share his love with others, so he built a fun website on the Internet Computer. With his ckBTC, he bought a cool new book with new characters. Every night before bed, Dominic read his favorite stories with his favorite characters. The end.";
}
},
)
- Run the smoketests for the Qwen2.5 LLM deployed to your local IC network:
########################################
# Tip. Add this to the args vec if you #
# want to see how many tokens the #
# canister can generate before it #
# hits the instruction limit #
# #
# ;"--print-token-count"; "1" #
########################################
```
```
# First test the canister functions, like 'health'
pytest -vv test/test_canister_functions.py
# Then run the inference tests
pytest -vv test/test_qwen2.py
```
2 changes: 1 addition & 1 deletion requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,6 @@

-r scripts/requirements.txt
-r src/llama_cpp_onicai_fork/requirements.txt
icpp-pro==4.2.0
icpp-pro>=4.2.0
ic-py==1.0.1
binaryen.py
2 changes: 1 addition & 1 deletion scripts/download.py
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ def main() -> int:

done = False
offset = 0
with open(local_filename_path, "ab") as f:
with open(local_filename_path, "wb") as f:
while not done:
response = canister_instance.file_download_chunk(
{
Expand Down
4 changes: 2 additions & 2 deletions scripts/qa_deploy_and_pytest.py
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ def main() -> int:
tests = [
{
"filename": "models/stories260Ktok512.gguf",
"canister_filename": "models/stories260Ktok512.gguf",
"canister_filename": "models/model.gguf",
"test_path_model": "test/test_tiny_stories.py",
},
# This times out in Github action. Can only be run locally.
# {
# "filename": "models/Qwen/Qwen2.5-0.5B-Instruct-GGUF/qwen2.5-0.5b-instruct-q8_0.gguf", # pylint: disable=line-too-long
# "canister_filename": "models/qwen2.5-0.5b-instruct-q8_0.gguf",
# "canister_filename": "models/model.gguf",
# "test_path_model": "test/test_qwen2.py",
# },
]
Expand Down
2 changes: 2 additions & 0 deletions src/download.cpp
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include "download.h"
#include "auth.h"
#include "utils.h"

#include <fstream>
Expand All @@ -23,6 +24,7 @@ void print_file_download_summary(const std::string &filename,

void file_download_chunk() {
IC_API ic_api(CanisterQuery{std::string(__func__)}, false);
if (!is_caller_a_controller(ic_api)) return;

// Get filename to download and the chunksize
std::string filename{""};
Expand Down
Loading

0 comments on commit 69d3e1d

Please sign in to comment.