Skip to content

Commit

Permalink
[README] Update readme. (#431)
Browse files Browse the repository at this point in the history
  • Loading branch information
Duyi-Wang authored Jun 4, 2024
1 parent f5ab97c commit e23c8f8
Show file tree
Hide file tree
Showing 6 changed files with 244 additions and 31 deletions.
83 changes: 73 additions & 10 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ xFasterTransformer is an exceptionally optimized solution for large language mod
- [Built from source](#built-from-source)
- [Prepare Environment](#prepare-environment)
- [Manually](#manually)
- [Install dependent libraries](#install-dependent-libraries)
- [How to build](#how-to-build)
- [Models Preparation](#models-preparation)
- [API usage](#api-usage)
Expand All @@ -34,6 +35,11 @@ xFasterTransformer is an exceptionally optimized solution for large language mod
- [C++](#c)
- [Web Demo](#web-demo)
- [Serving](#serving)
- [vLLM](#vllm)
- [Install](#install)
- [OpenAI Compatible Server](#openai-compatible-server)
- [FastChat](#fastchat)
- [MLServer](#mlserver)
- [Benchmark](#benchmark)
- [Support](#support)
- [Q\&A](#qa)
Expand All @@ -55,7 +61,8 @@ xFasterTransformer provides a series of APIs, both of C++ and Python, for end us
| Llama | ✔ | ✔ | ✔ |
| Llama2 | ✔ | ✔ | ✔ |
| Llama3 | ✔ | ✔ | ✔ |
| Baichuan | ✔ | ✔ | ✔ |
| Baichuan1 | ✔ | ✔ | ✔ |
| Baichuan2 | ✔ | ✔ | ✔ |
| QWen | ✔ | ✔ | ✔ |
| QWen2 | ✔ | ✔ | ✔ |
| SecLLM(YaRN-Llama) | ✔ | ✔ | ✔ |
Expand Down Expand Up @@ -114,12 +121,12 @@ docker run -it \
### Built from source
#### Prepare Environment
##### Manually
- [PyTorch](https://pytorch.org/get-started/locally/) v2.0 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)
- [PyTorch](https://pytorch.org/get-started/locally/) v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.)
```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

- For GPU, xFT needs ABI=1 from [torch==2.0.1+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.0.1%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=fbe35a5c60aef0c4b5463caab10ba905bdfa07d6d16b7be5d510225c966a0b46) in [torch-whl-list](https://download.pytorch.org/whl/torch/) due to DPC++ need ABI=1.
- For GPU, xFT needs ABI=1 from [torch==2.3.0+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.3.0%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=c34512c3e07efe9b7fb5c3a918fef1a7c6eb8969c6b2eea92ee5c16a0583fe12) in [torch-whl-list](https://download.pytorch.org/whl/torch/) due to DPC++ need ABI=1.

##### Install dependent libraries

Expand Down Expand Up @@ -229,7 +236,10 @@ std::cout << std::endl;
```
## How to run
Recommend preloading `libiomp5.so` to get a better performance. `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after building xFasterTransformer successfully.
Recommend preloading `libiomp5.so` to get a better performance.
- ***[Recommended]*** Run `export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` if xfastertransformer's python wheel package is installed.
- `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after building xFasterTransformer successfully if building from source code.
### Single rank
FasterTransformer will automatically check the MPI environment, or you can use the `SINGLE_INSTANCE=1` environment variable to forcefully deactivate MPI.
Expand All @@ -254,7 +264,9 @@ Use MPI to run in the multi-ranks mode, please install oneCCL firstly.
- Here is a example on local.
```bash
OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \
# or export LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 ${RUN_WORKLOAD} : \
-n 1 numactl -N 1 -m 1 ${RUN_WORKLOAD}
```
Expand Down Expand Up @@ -300,14 +312,65 @@ A web demo based on [Gradio](https://www.gradio.app/) is provided in repo. Now s
- Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. `transformer`'s tokenizer is used to encode and decode text so `${TOKEN_PATH}` means the huggingface model directory. This demo also support multi-rank.
```bash
# Recommend preloading `libiomp5.so` to get a better performance.
# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer.
LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \
--dtype=bf16 \
--token_path=${TOKEN_PATH} \
--model_path=${MODEL_PATH}
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
python examples/web_demo/ChatGLM.py \
--dtype=bf16 \
--token_path=${TOKEN_PATH} \
--model_path=${MODEL_PATH}
```

## Serving
### vLLM
A fork of vLLM has been created to integrate the xFasterTransformer backend, maintaining compatibility with most of the official vLLM's features. Refer [this link](serving/vllm-xft.md) for more detail.
#### Install
```bash
pip install vllm-xft
```
***Notice: Please do not install both `vllm-xft` and `vllm` simultaneously in the environment. Although the package names are different, they will actually overwrite each other.***
#### OpenAI Compatible Server
***Notice: Preload libiomp5.so is required!***
```bash
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
python -m vllm.entrypoints.openai.api_server \
--model ${XFT_MODEL} \
--tokenizer ${TOKENIZER_DIR} \
--dtype fp16 \
--kv-cache-dtype fp16 \
--served-model-name xft \
--port 8000 \
--trust-remote-code
```
For multi-rank mode, please use `python -m vllm.entrypoints.slave` as slave and keep params of slaves align with master.
```bash
# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl --all -C 0-47 -m 0 \
python -m vllm.entrypoints.openai.api_server \
--model ${MODEL_PATH} \
--tokenizer ${TOKEN_PATH} \
--dtype bf16 \
--kv-cache-dtype fp16 \
--served-model-name xft \
--port 8000 \
--trust-remote-code \
: -n 1 numactl --all -C 48-95 -m 1 \
python -m vllm.entrypoints.slave \
--dtype bf16 \
--model ${MODEL_PATH} \
--kv-cache-dtype fp16
```
### FastChat
xFasterTransformer is an official inference backend of [FastChat](https://github.com/lm-sys/FastChat). Please refer to [xFasterTransformer in FastChat](https://github.com/lm-sys/FastChat/blob/main/docs/xFasterTransformer.md) and [FastChat's serving](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) for more details.

### MLServer
[A example serving of MLServer](serving/mlserver/README.md) is provided which supports REST and gRPC interface and adaptive batching feature to group inference requests together on the fly.

## [Benchmark](benchmark/README.md)
Expand Down
88 changes: 79 additions & 9 deletions README_CN.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ xFasterTransformer为大语言模型(LLM)在CPU X86平台上的部署提供
- [从源码构建](#从源码构建)
- [准备环境](#准备环境)
- [手动操作](#手动操作)
- [安装依赖的库](#安装依赖的库)
- [如何编译](#如何编译)
- [模型准备](#模型准备)
- [API 用法](#api-用法)
Expand All @@ -34,6 +35,11 @@ xFasterTransformer为大语言模型(LLM)在CPU X86平台上的部署提供
- [C++](#c)
- [网页示例](#网页示例)
- [服务](#服务)
- [vLLM](#vllm)
- [Install](#install)
- [兼容OpenAI-API的服务](#兼容openai-api的服务)
- [FastChat](#fastchat)
- [MLServer](#mlserver)
- [性能测试](#性能测试)
- [技术支持](#技术支持)
- [问题与回答](#问题与回答)
Expand All @@ -46,16 +52,19 @@ xFasterTransformer 提供了一系列 C++ 和 Python 应用程序接口,终端

### 支持的模型

| 模型 | 框架 | | 分布式支持 |
| 模型 | 框架 | | 分布式支持 |
| :----------------: | :------: | :------: | :--------: |
| | PyTorch | C++ | |
| ChatGLM | &#10004; | &#10004; | &#10004; |
| ChatGLM2 | &#10004; | &#10004; | &#10004; |
| ChatGLM3 | &#10004; | &#10004; | &#10004; |
| Llama | &#10004; | &#10004; | &#10004; |
| Llama2 | &#10004; | &#10004; | &#10004; |
| Baichuan | &#10004; | &#10004; | &#10004; |
| Llama3 | &#10004; | &#10004; | &#10004; |
| Baichuan1 | &#10004; | &#10004; | &#10004; |
| Baichuan2 | &#10004; | &#10004; | &#10004; |
| QWen | &#10004; | &#10004; | &#10004; |
| QWen2 | &#10004; | &#10004; | &#10004; |
| SecLLM(YaRN-Llama) | &#10004; | &#10004; | &#10004; |
| Opt | &#10004; | &#10004; | &#10004; |
| Deepseek-coder | &#10004; | &#10004; | &#10004; |
Expand Down Expand Up @@ -112,11 +121,13 @@ docker run -it \
### 从源码构建
#### 准备环境
##### 手动操作
- [PyTorch](https://pytorch.org/get-started/locally/) v2.0 (使用 PyTorch API 时需要,但使用 C++ API 时不需要。)
- [PyTorch](https://pytorch.org/get-started/locally/) v2.3 (使用 PyTorch API 时需要,但使用 C++ API 时不需要。)
```bash
pip install torch --index-url https://download.pytorch.org/whl/cpu
```

- 对于 GPU 版本的 xFT,由于 DPC++ 要求 ABI=1,因此需要安装 [torch-whl-list](https://download.pytorch.org/whl/torch/) 中 ABI=1 的 [torch==2.3.0+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.3.0%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=c34512c3e07efe9b7fb5c3a918fef1a7c6eb8969c6b2eea92ee5c16a0583fe12)

##### 安装依赖的库

请安装所依赖的libnuma库:
Expand Down Expand Up @@ -165,6 +176,7 @@ xFasterTransformer 支持的模型格式与 Huggingface 有所不同,但与 Fa
- OPTConvert
- BaichuanConvert
- QwenConvert
- Qwen2Convert
- DeepseekConvert
## API 用法
Expand Down Expand Up @@ -225,7 +237,10 @@ std::cout << std::endl;
```
## 如何运行
建议预加载 `libiomp5.so` 以获得更好的性能。成功编译 xFasterTransformer 后,`libiomp5.so` 文件将位于 `3rdparty/mklml/lib` 目录中。
建议预加载 `libiomp5.so` 以获得更好的性能。
- **[推荐]** 如果已安装 xfastertransformer 的 Python wheel 包,请运行 `export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`。
- 如果从源代码构建 xFasterTransformer,成功构建后 `libiomp5.so` 文件将在 `3rdparty/mkl/lib` 目录下。
### 单进程
xFasterTransformer 会自动检查 MPI 环境,或者使用 `SINGLE_INSTANCE=1` 环境变量强制停用 MPI。
Expand All @@ -250,7 +265,9 @@ xFasterTransformer 会自动检查 MPI 环境,或者使用 `SINGLE_INSTANCE=1`
- 下面是一个本地环境的运行方式示例。
```bash
OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \
# 或者手动预加载 export LD_PRELOAD=libiomp5.so
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 ${RUN_WORKLOAD} : \
-n 1 numactl -N 1 -m 1 ${RUN_WORKLOAD}
```
Expand Down Expand Up @@ -297,13 +314,66 @@ while (1) {
```bash
# 推荐预加载`libiomp5.so`来获得更好的性能。
# `libiomp5.so`文件会位于编译后`3rdparty/mklml/lib`文件夹中。
LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \
--dtype=bf16 \
--token_path=${TOKEN_PATH} \
--model_path=${MODEL_PATH}
# 或者手动预加载LD_PRELOAD=libiomp5.so manually, `libiomp5.so`文件会位于编译后`3rdparty/mkl/lib`文件夹中
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
python examples/web_demo/ChatGLM.py \
--dtype=bf16 \
--token_path=${TOKEN_PATH} \
--model_path=${MODEL_PATH}
```

## 服务

### vLLM
vllm-xft项目创建了vLLM的一个分支版本,该版本集成了xFasterTransformer后端以提高性能,同时保持了与官方vLLM大多数功能的兼容性。详细信息请参考[此链接](serving/vllm-xft.md)。

#### Install
```bash
pip install vllm-xft
```
***注意:请不要在环境中同时安装 `vllm-xft``vllm` 。虽然包名不同,但实际上它们会互相覆盖。***

#### 兼容OpenAI-API的服务
***注意:需要预加载 `libiomp5`***
```bash
# 通过以下命令或手动设置 LD_PRELOAD=libiomp5.so 预加载 libiomp5.so
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`

python -m vllm.entrypoints.openai.api_server \
--model ${XFT_MODEL} \
--tokenizer ${TOKENIZER_DIR} \
--dtype fp16 \
--kv-cache-dtype fp16 \
--served-model-name xft \
--port 8000 \
--trust-remote-code
```
对于分布式模式,请使用 `python -m vllm.entrypoints.slave` 作为从节点,并确保从节点的参数与主节点一致。
```bash
# 通过以下命令或手动设置 LD_PRELOAD=libiomp5.so 预加载 libiomp5.so
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl --all -C 0-47 -m 0 \
python -m vllm.entrypoints.openai.api_server \
--model ${MODEL_PATH} \
--tokenizer ${TOKEN_PATH} \
--dtype bf16 \
--kv-cache-dtype fp16 \
--served-model-name xft \
--port 8000 \
--trust-remote-code \
: -n 1 numactl --all -C 48-95 -m 1 \
python -m vllm.entrypoints.slave \
--dtype bf16 \
--model ${MODEL_PATH} \
--kv-cache-dtype fp16
```

### FastChat
xFasterTransformer 是 [FastChat](https://github.com/lm-sys/FastChat)的官方推理后端。详细信息请参考 [FastChat 中的 xFasterTransformer](https://github.com/lm-sys/FastChat/blob/main/docs/xFasterTransformer.md) 和 [FastChat 服务](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md)。

### MLServer
[MLServer 服务示例](serving/mlserver/README.md) 支持 REST 和 gRPC 接口,并具有自适应批处理功能,可即时将推理请求分组。

## [性能测试](benchmark/README.md)
Expand Down
9 changes: 6 additions & 3 deletions examples/cpp/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,14 @@ Please refer to [Prepare model](../README.md#prepare-model)
## Step 3: Run binary
```bash
# Recommend preloading `libiomp5.so` to get a better performance.
# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer.
LD_PRELOAD=libiomp5.so ./example -m ${MODEL_PATH} -t ${TOKEN_PATH}
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
# run single instance like
./example -m ${MODEL_PATH} -t ${TOKEN_PATH}
# run multi-instance like
OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 ./example -m ${MODEL_PATH} -t ${TOKEN_PATH} : \
-n 1 numactl -N 1 -m 1 ./example -m ${MODEL_PATH} -t ${TOKEN_PATH}
```
Expand Down
9 changes: 6 additions & 3 deletions examples/pytorch/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,14 @@ Please refer to [Prepare model](../README.md#prepare-model)
## Step 4: Run
```bash
# Recommend preloading `libiomp5.so` to get a better performance.
# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer.
LD_PRELOAD=libiomp5.so python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`

# run single instance like
python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}

# run multi-rank like
OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} : \
-n 1 numactl -N 1 -m 1 python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}
```
Expand Down
15 changes: 9 additions & 6 deletions examples/web_demo/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -28,14 +28,17 @@ Please refer to [Prepare model](../README.md#prepare-model)
After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. `transformer`'s tokenizer is used to encode and decode text so `${TOKEN_PATH}` means the huggingface model directory.
```bash
# Recommend preloading `libiomp5.so` to get a better performance.
# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer.
LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \
--dtype=bf16 \
--token_path=${TOKEN_PATH} \
--model_path=${MODEL_PATH}
# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer.
export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`
# run single instance like
python examples/web_demo/ChatGLM.py \
--dtype=bf16 \
--token_path=${TOKEN_PATH} \
--model_path=${MODEL_PATH}
# run multi-rank like
OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \
OMP_NUM_THREADS=48 mpirun \
-n 1 numactl -N 0 -m 0 python examples/web_demo/ChatGLM.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}: \
-n 1 numactl -N 1 -m 1 python examples/web_demo/ChatGLM.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}:
```
Expand Down
Loading

0 comments on commit e23c8f8

Please sign in to comment.