Skip to content

Pure C++ implementation of several models for real-time chatting on your computer (CPU)

License

Notifications You must be signed in to change notification settings

foldl/chatllm.cpp

Repository files navigation

ChatLLM.cpp

中文版 | 日本語

License: MIT

Inference of a bunch of models from less than 1B to more than 300B, for real-time chatting with RAG on your computer (CPU), pure C++ implementation based on @ggerganov's ggml.

| Supported Models | Download Quantized Models |

What's New:

Features

  • Accelerated memory-efficient CPU inference with int4/int8 quantization, optimized KV cache and parallel computing;

  • Use OOP to address the similarities between different Transformer based models;

  • Streaming generation with typewriter effect;

  • Continuous chatting (content length is virtually unlimited)

    Two methods are available: Restart and Shift. See --extending options.

  • Retrieval Augmented Generation (RAG) 🔥

  • LoRA;

  • Python/JavaScript/C/Nim Bindings, web demo, and more possibilities.

Quick Start

As simple as main -i -m :model_id. Check it out.

Usage

Preparation

Clone the ChatLLM.cpp repository into your local machine:

git clone --recursive https://github.com/foldl/chatllm.cpp.git && cd chatllm.cpp

If you forgot the --recursive flag when cloning the repository, run the following command in the chatllm.cpp folder:

git submodule update --init --recursive

Quantize Model

Some quantized models can be downloaded on demand.

Install dependencies of convert.py:

pip install -r requirements.txt

Use convert.py to transform models into quantized GGML format. For example, to convert the fp16 base model to q8_0 (quantized int8) GGML model, run:

# For models such as ChatLLM-6B, ChatLLM2-6B, InternLM, LlaMA, LlaMA-2, Baichuan-2, etc
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin

# For some models such as CodeLlaMA, model type should be provided by `-a`
# Find `-a ...` option for each model in `docs/models.md`.
python3 convert.py -i path/to/model -t q8_0 -o quantized.bin -a CodeLlaMA

Use -l to specify the path of the LoRA model to be merged, such as:

python3 convert.py -i path/to/model -l path/to/lora/model -o quantized.bin

Note: Appropriately, only HF format is supported (with a few exceptions); Format of the generated .bin files is different from the one (GGUF) used by llama.cpp.

Build

In order to build this project you have several different options.

  • Using make:

    Prepare for using make on Windows:

    1. Download the latest fortran version of w64devkit.
    2. Extract w64devkit on your pc.
    3. Run w64devkit.exe, then cd to the chatllm.cpp folder.
    make

    The executable is ./obj/main.

  • Using CMake:

    cmake -B build
    # On Linux, WSL:
    cmake --build build -j
    # On Windows with MSVC:
    cmake --build build -j --config Release

    The executable is ./build/obj/main.

Run

Now you may chat with a quantized model by running:

./build/bin/main -m chatglm-ggml.bin                            # ChatGLM-6B
# 你好👋!我是人工智能助手 ChatGLM-6B,很高兴见到你,欢迎问我任何问题。
./build/bin/main -m llama2.bin  --seed 100                      # Llama-2-Chat-7B
# Hello! I'm here to help you with any questions or concerns ....

To run the model in interactive mode, add the -i flag. For example:

# On Windows
.\build\bin\Release\main -m model.bin -i

# On Linux (or WSL)
rlwrap ./build/bin/main -m model.bin -i

In interactive mode, your chat history will serve as the context for the next-round conversation.

Run ./build/bin/main -h to explore more options!

Acknowledgements

  • This project is started as refactoring of ChatGLM.cpp, without which, this project could not be possible.

  • Thank those who have released their the model sources and checkpoints.

Note

This project is my hobby project to learn DL & GGML, and under active development. PRs of features won't be accepted, while PRs for bug fixes are warmly welcome.