Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.
| Supported Models | Download Quantized Models |
What's New:
- 2024-11-29: QwQ-32B
- 2024-11-22: Marco-o1
- 2024-11-21: Continued generation
- 2024-11-01: Granite, generation steering
- 2024-09-29: LlaMA 3.2
- 2024-09-22: Qwen 2.5
- 2024-09-13: OLMoE
- 2024-09-11: MiniCPM3
- 2024-07-14: ggml updated
- 2024-06-15: Tool calling
- 2024-05-29: ggml is forked instead of submodule
- 2024-05-14: OpenAI API, CodeGemma Base & Instruct supported
- 2024-05-08: Layer shuffling
-
Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;
-
Use OOP to address the similarities between different Transformer based models;
-
Streaming generation with typewriter effect;
-
Continuous chatting (content length is virtually unlimited)
Two methods are available: Restart and Shift. See
--extending
options. -
Retrieval Augmented Generation (RAG) 🔥
-
LoRA;
-
Python/JavaScript/C/Nim Bindings, web demo, and more possibilities.
As simple as main -i -m :model_id
. Check it out.
Clone the ChatLLM.cpp repository into your local machine:
git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp
If you forgot the --recursive
flag when cloning the repository, run the following command in the chatllm.cpp
folder:
git submodule update --init --recursive
Some quantized models can be downloaded on demand.
Install dependencies of convert.py
:
pip install -r requirements.txt
Use convert.py
to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:
# For models such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin
# For some models such as CodeLlaMA, model type should be provided by `-a`
# Find `-a ...` option for each model in `docs/models.md`.
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA
Use -l
to specify the path of the LoRA model to be merged, such as:
python3 convert.py -i path/to/model -l path/to/lora/model -o quantized.bin
Note: Appropriately, only HF format is supported (with a few exceptions); Format of the generated .bin
files is different from the one (GGUF) used by llama.cpp
.
In order to build this project you have several different options.
-
Using
make
:Prepare for using
make
on Windows:- Download the latest fortran version of w64devkit.
- Extract
w64devkit
on your pc. - Run
w64devkit.exe
, thencd
to thechatllm.cpp
folder.
make
The executable is
./obj/main
. -
Using
CMake
:cmake -B build # On Linux, WSL: cmake --build build -j # On Windows with MSVC: cmake --build build -j --config Release
The executable is
./build/obj/main
.
Now you may chat with a quantized model by running:
./build/bin/main -m chatglm-ggml.bin # ChatGLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin --seed 100 # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....
To run the model in interactive mode, add the -i
flag. For example:
# On Windows
.\build\bin\Release\main -m model.bin -i
# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i
In interactive mode, your chat history will serve as the context for the next-round conversation.
Run ./build/bin/main -h
to explore more options!
-
This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.
-
Thank those who have released their the model sources and checkpoints.
This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.