diff --git a/README.md b/README.md index bf136fe5..0c0dfa25 100644 --- a/README.md +++ b/README.md @@ -20,6 +20,7 @@ xFasterTransformer is an exceptionally optimized solution for large language mod - [Built from source](#built-from-source) - [Prepare Environment](#prepare-environment) - [Manually](#manually) + - [Install dependent libraries](#install-dependent-libraries) - [How to build](#how-to-build) - [Models Preparation](#models-preparation) - [API usage](#api-usage) @@ -34,6 +35,11 @@ xFasterTransformer is an exceptionally optimized solution for large language mod - [C++](#c) - [Web Demo](#web-demo) - [Serving](#serving) + - [vLLM](#vllm) + - [Install](#install) + - [OpenAI Compatible Server](#openai-compatible-server) + - [FastChat](#fastchat) + - [MLServer](#mlserver) - [Benchmark](#benchmark) - [Support](#support) - [Q\&A](#qa) @@ -55,7 +61,8 @@ xFasterTransformer provides a series of APIs, both of C++ and Python, for end us | Llama | ✔ | ✔ | ✔ | | Llama2 | ✔ | ✔ | ✔ | | Llama3 | ✔ | ✔ | ✔ | -| Baichuan | ✔ | ✔ | ✔ | +| Baichuan1 | ✔ | ✔ | ✔ | +| Baichuan2 | ✔ | ✔ | ✔ | | QWen | ✔ | ✔ | ✔ | | QWen2 | ✔ | ✔ | ✔ | | SecLLM(YaRN-Llama) | ✔ | ✔ | ✔ | @@ -114,12 +121,12 @@ docker run -it \ ### Built from source #### Prepare Environment ##### Manually -- [PyTorch](https://pytorch.org/get-started/locally/) v2.0 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.) +- [PyTorch](https://pytorch.org/get-started/locally/) v2.3 (When using the PyTorch API, it's required, but it's not needed when using the C++ API.) ```bash pip install torch --index-url https://download.pytorch.org/whl/cpu ``` -- For GPU, xFT needs ABI=1 from [torch==2.0.1+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.0.1%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=fbe35a5c60aef0c4b5463caab10ba905bdfa07d6d16b7be5d510225c966a0b46) in [torch-whl-list](https://download.pytorch.org/whl/torch/) due to DPC++ need ABI=1. +- For GPU, xFT needs ABI=1 from [torch==2.3.0+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.3.0%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=c34512c3e07efe9b7fb5c3a918fef1a7c6eb8969c6b2eea92ee5c16a0583fe12) in [torch-whl-list](https://download.pytorch.org/whl/torch/) due to DPC++ need ABI=1. ##### Install dependent libraries @@ -229,7 +236,10 @@ std::cout << std::endl; ``` ## How to run -Recommend preloading `libiomp5.so` to get a better performance. `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after building xFasterTransformer successfully. +Recommend preloading `libiomp5.so` to get a better performance. +- ***[Recommended]*** Run `export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` if xfastertransformer's python wheel package is installed. +- `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after building xFasterTransformer successfully if building from source code. + ### Single rank FasterTransformer will automatically check the MPI environment, or you can use the `SINGLE_INSTANCE=1` environment variable to forcefully deactivate MPI. @@ -254,7 +264,9 @@ Use MPI to run in the multi-ranks mode, please install oneCCL firstly. - Here is a example on local. ```bash - OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \ + # or export LD_PRELOAD=libiomp5.so manually + export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + OMP_NUM_THREADS=48 mpirun \ -n 1 numactl -N 0 -m 0 ${RUN_WORKLOAD} : \ -n 1 numactl -N 1 -m 1 ${RUN_WORKLOAD} ``` @@ -300,14 +312,65 @@ A web demo based on [Gradio](https://www.gradio.app/) is provided in repo. Now s - Run the script corresponding to the model. After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. `transformer`'s tokenizer is used to encode and decode text so `${TOKEN_PATH}` means the huggingface model directory. This demo also support multi-rank. ```bash # Recommend preloading `libiomp5.so` to get a better performance. -# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer. -LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \ - --dtype=bf16 \ - --token_path=${TOKEN_PATH} \ - --model_path=${MODEL_PATH} +# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer. +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` +python examples/web_demo/ChatGLM.py \ + --dtype=bf16 \ + --token_path=${TOKEN_PATH} \ + --model_path=${MODEL_PATH} ``` ## Serving +### vLLM +A fork of vLLM has been created to integrate the xFasterTransformer backend, maintaining compatibility with most of the official vLLM's features. Refer [this link](serving/vllm-xft.md) for more detail. + +#### Install +```bash +pip install vllm-xft +``` +***Notice: Please do not install both `vllm-xft` and `vllm` simultaneously in the environment. Although the package names are different, they will actually overwrite each other.*** + +#### OpenAI Compatible Server +***Notice: Preload libiomp5.so is required!*** +```bash +# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +python -m vllm.entrypoints.openai.api_server \ + --model ${XFT_MODEL} \ + --tokenizer ${TOKENIZER_DIR} \ + --dtype fp16 \ + --kv-cache-dtype fp16 \ + --served-model-name xft \ + --port 8000 \ + --trust-remote-code +``` +For multi-rank mode, please use `python -m vllm.entrypoints.slave` as slave and keep params of slaves align with master. +```bash +# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +OMP_NUM_THREADS=48 mpirun \ + -n 1 numactl --all -C 0-47 -m 0 \ + python -m vllm.entrypoints.openai.api_server \ + --model ${MODEL_PATH} \ + --tokenizer ${TOKEN_PATH} \ + --dtype bf16 \ + --kv-cache-dtype fp16 \ + --served-model-name xft \ + --port 8000 \ + --trust-remote-code \ + : -n 1 numactl --all -C 48-95 -m 1 \ + python -m vllm.entrypoints.slave \ + --dtype bf16 \ + --model ${MODEL_PATH} \ + --kv-cache-dtype fp16 +``` + +### FastChat +xFasterTransformer is an official inference backend of [FastChat](https://github.com/lm-sys/FastChat). Please refer to [xFasterTransformer in FastChat](https://github.com/lm-sys/FastChat/blob/main/docs/xFasterTransformer.md) and [FastChat's serving](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md) for more details. + +### MLServer [A example serving of MLServer](serving/mlserver/README.md) is provided which supports REST and gRPC interface and adaptive batching feature to group inference requests together on the fly. ## [Benchmark](benchmark/README.md) diff --git a/README_CN.md b/README_CN.md index 7ba1968b..24973294 100644 --- a/README_CN.md +++ b/README_CN.md @@ -20,6 +20,7 @@ xFasterTransformer为大语言模型(LLM)在CPU X86平台上的部署提供 - [从源码构建](#从源码构建) - [准备环境](#准备环境) - [手动操作](#手动操作) + - [安装依赖的库](#安装依赖的库) - [如何编译](#如何编译) - [模型准备](#模型准备) - [API 用法](#api-用法) @@ -34,6 +35,11 @@ xFasterTransformer为大语言模型(LLM)在CPU X86平台上的部署提供 - [C++](#c) - [网页示例](#网页示例) - [服务](#服务) + - [vLLM](#vllm) + - [Install](#install) + - [兼容OpenAI-API的服务](#兼容openai-api的服务) + - [FastChat](#fastchat) + - [MLServer](#mlserver) - [性能测试](#性能测试) - [技术支持](#技术支持) - [问题与回答](#问题与回答) @@ -46,7 +52,7 @@ xFasterTransformer 提供了一系列 C++ 和 Python 应用程序接口,终端 ### 支持的模型 -| 模型 | 框架 | | 分布式支持 | +| 模型 | 框架 | | 分布式支持 | | :----------------: | :------: | :------: | :--------: | | | PyTorch | C++ | | | ChatGLM | ✔ | ✔ | ✔ | @@ -54,8 +60,11 @@ xFasterTransformer 提供了一系列 C++ 和 Python 应用程序接口,终端 | ChatGLM3 | ✔ | ✔ | ✔ | | Llama | ✔ | ✔ | ✔ | | Llama2 | ✔ | ✔ | ✔ | -| Baichuan | ✔ | ✔ | ✔ | +| Llama3 | ✔ | ✔ | ✔ | +| Baichuan1 | ✔ | ✔ | ✔ | +| Baichuan2 | ✔ | ✔ | ✔ | | QWen | ✔ | ✔ | ✔ | +| QWen2 | ✔ | ✔ | ✔ | | SecLLM(YaRN-Llama) | ✔ | ✔ | ✔ | | Opt | ✔ | ✔ | ✔ | | Deepseek-coder | ✔ | ✔ | ✔ | @@ -112,11 +121,13 @@ docker run -it \ ### 从源码构建 #### 准备环境 ##### 手动操作 -- [PyTorch](https://pytorch.org/get-started/locally/) v2.0 (使用 PyTorch API 时需要,但使用 C++ API 时不需要。) +- [PyTorch](https://pytorch.org/get-started/locally/) v2.3 (使用 PyTorch API 时需要,但使用 C++ API 时不需要。) ```bash pip install torch --index-url https://download.pytorch.org/whl/cpu ``` +- 对于 GPU 版本的 xFT,由于 DPC++ 要求 ABI=1,因此需要安装 [torch-whl-list](https://download.pytorch.org/whl/torch/) 中 ABI=1 的 [torch==2.3.0+cpu.cxx11.abi](https://download.pytorch.org/whl/cpu-cxx11-abi/torch-2.3.0%2Bcpu.cxx11.abi-cp38-cp38-linux_x86_64.whl#sha256=c34512c3e07efe9b7fb5c3a918fef1a7c6eb8969c6b2eea92ee5c16a0583fe12)。 + ##### 安装依赖的库 请安装所依赖的libnuma库: @@ -165,6 +176,7 @@ xFasterTransformer 支持的模型格式与 Huggingface 有所不同,但与 Fa - OPTConvert - BaichuanConvert - QwenConvert + - Qwen2Convert - DeepseekConvert ## API 用法 @@ -225,7 +237,10 @@ std::cout << std::endl; ``` ## 如何运行 -建议预加载 `libiomp5.so` 以获得更好的性能。成功编译 xFasterTransformer 后,`libiomp5.so` 文件将位于 `3rdparty/mklml/lib` 目录中。 +建议预加载 `libiomp5.so` 以获得更好的性能。 +- **[推荐]** 如果已安装 xfastertransformer 的 Python wheel 包,请运行 `export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')`。 +- 如果从源代码构建 xFasterTransformer,成功构建后 `libiomp5.so` 文件将在 `3rdparty/mkl/lib` 目录下。 + ### 单进程 xFasterTransformer 会自动检查 MPI 环境,或者使用 `SINGLE_INSTANCE=1` 环境变量强制停用 MPI。 @@ -250,7 +265,9 @@ xFasterTransformer 会自动检查 MPI 环境,或者使用 `SINGLE_INSTANCE=1` - 下面是一个本地环境的运行方式示例。 ```bash - OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \ + # 或者手动预加载 export LD_PRELOAD=libiomp5.so + export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + OMP_NUM_THREADS=48 mpirun \ -n 1 numactl -N 0 -m 0 ${RUN_WORKLOAD} : \ -n 1 numactl -N 1 -m 1 ${RUN_WORKLOAD} ``` @@ -297,13 +314,66 @@ while (1) { ```bash # 推荐预加载`libiomp5.so`来获得更好的性能。 # `libiomp5.so`文件会位于编译后`3rdparty/mklml/lib`文件夹中。 -LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \ - --dtype=bf16 \ - --token_path=${TOKEN_PATH} \ - --model_path=${MODEL_PATH} +# 或者手动预加载LD_PRELOAD=libiomp5.so manually, `libiomp5.so`文件会位于编译后`3rdparty/mkl/lib`文件夹中 +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` +python examples/web_demo/ChatGLM.py \ + --dtype=bf16 \ + --token_path=${TOKEN_PATH} \ + --model_path=${MODEL_PATH} ``` ## 服务 + +### vLLM +vllm-xft项目创建了vLLM的一个分支版本,该版本集成了xFasterTransformer后端以提高性能,同时保持了与官方vLLM大多数功能的兼容性。详细信息请参考[此链接](serving/vllm-xft.md)。 + +#### Install +```bash +pip install vllm-xft +``` +***注意:请不要在环境中同时安装 `vllm-xft` 和 `vllm` 。虽然包名不同,但实际上它们会互相覆盖。*** + +#### 兼容OpenAI-API的服务 +***注意:需要预加载 `libiomp5`!*** +```bash +# 通过以下命令或手动设置 LD_PRELOAD=libiomp5.so 预加载 libiomp5.so +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +python -m vllm.entrypoints.openai.api_server \ + --model ${XFT_MODEL} \ + --tokenizer ${TOKENIZER_DIR} \ + --dtype fp16 \ + --kv-cache-dtype fp16 \ + --served-model-name xft \ + --port 8000 \ + --trust-remote-code +``` +对于分布式模式,请使用 `python -m vllm.entrypoints.slave` 作为从节点,并确保从节点的参数与主节点一致。 +```bash +# 通过以下命令或手动设置 LD_PRELOAD=libiomp5.so 预加载 libiomp5.so +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +OMP_NUM_THREADS=48 mpirun \ + -n 1 numactl --all -C 0-47 -m 0 \ + python -m vllm.entrypoints.openai.api_server \ + --model ${MODEL_PATH} \ + --tokenizer ${TOKEN_PATH} \ + --dtype bf16 \ + --kv-cache-dtype fp16 \ + --served-model-name xft \ + --port 8000 \ + --trust-remote-code \ + : -n 1 numactl --all -C 48-95 -m 1 \ + python -m vllm.entrypoints.slave \ + --dtype bf16 \ + --model ${MODEL_PATH} \ + --kv-cache-dtype fp16 +``` + +### FastChat +xFasterTransformer 是 [FastChat](https://github.com/lm-sys/FastChat)的官方推理后端。详细信息请参考 [FastChat 中的 xFasterTransformer](https://github.com/lm-sys/FastChat/blob/main/docs/xFasterTransformer.md) 和 [FastChat 服务](https://github.com/lm-sys/FastChat/blob/main/docs/openai_api.md)。 + +### MLServer [MLServer 服务示例](serving/mlserver/README.md) 支持 REST 和 gRPC 接口,并具有自适应批处理功能,可即时将推理请求分组。 ## [性能测试](benchmark/README.md) diff --git a/examples/cpp/README.md b/examples/cpp/README.md index d1a40b44..ab8f9e4c 100644 --- a/examples/cpp/README.md +++ b/examples/cpp/README.md @@ -10,11 +10,14 @@ Please refer to [Prepare model](../README.md#prepare-model) ## Step 3: Run binary ```bash # Recommend preloading `libiomp5.so` to get a better performance. -# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer. -LD_PRELOAD=libiomp5.so ./example -m ${MODEL_PATH} -t ${TOKEN_PATH} +# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer. +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +# run single instance like +./example -m ${MODEL_PATH} -t ${TOKEN_PATH} # run multi-instance like -OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \ +OMP_NUM_THREADS=48 mpirun \ -n 1 numactl -N 0 -m 0 ./example -m ${MODEL_PATH} -t ${TOKEN_PATH} : \ -n 1 numactl -N 1 -m 1 ./example -m ${MODEL_PATH} -t ${TOKEN_PATH} ``` diff --git a/examples/pytorch/README.md b/examples/pytorch/README.md index 0417da94..8482ff97 100644 --- a/examples/pytorch/README.md +++ b/examples/pytorch/README.md @@ -19,11 +19,14 @@ Please refer to [Prepare model](../README.md#prepare-model) ## Step 4: Run ```bash # Recommend preloading `libiomp5.so` to get a better performance. -# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer. -LD_PRELOAD=libiomp5.so python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} +# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer. +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +# run single instance like +python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} # run multi-rank like -OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \ +OMP_NUM_THREADS=48 mpirun \ -n 1 numactl -N 0 -m 0 python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} : \ -n 1 numactl -N 1 -m 1 python demo.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH} ``` diff --git a/examples/web_demo/README.md b/examples/web_demo/README.md index c3f9bdb3..15216bad 100644 --- a/examples/web_demo/README.md +++ b/examples/web_demo/README.md @@ -28,14 +28,17 @@ Please refer to [Prepare model](../README.md#prepare-model) After the web server started, open the output URL in the browser to use the demo. Please specify the paths of model and tokenizer directory, and data type. `transformer`'s tokenizer is used to encode and decode text so `${TOKEN_PATH}` means the huggingface model directory. ```bash # Recommend preloading `libiomp5.so` to get a better performance. -# `libiomp5.so` file will be in `3rdparty/mklml/lib` directory after build xFasterTransformer. -LD_PRELOAD=libiomp5.so python examples/web_demo/ChatGLM.py \ - --dtype=bf16 \ - --token_path=${TOKEN_PATH} \ - --model_path=${MODEL_PATH} +# or LD_PRELOAD=libiomp5.so manually, `libiomp5.so` file will be in `3rdparty/mkl/lib` directory after build xFasterTransformer. +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +# run single instance like +python examples/web_demo/ChatGLM.py \ + --dtype=bf16 \ + --token_path=${TOKEN_PATH} \ + --model_path=${MODEL_PATH} # run multi-rank like -OMP_NUM_THREADS=48 LD_PRELOAD=libiomp5.so mpirun \ +OMP_NUM_THREADS=48 mpirun \ -n 1 numactl -N 0 -m 0 python examples/web_demo/ChatGLM.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}: \ -n 1 numactl -N 1 -m 1 python examples/web_demo/ChatGLM.py --dtype=bf16 --token_path=${TOKEN_PATH} --model_path=${MODEL_PATH}: ``` diff --git a/serving/vllm-xft.md b/serving/vllm-xft.md new file mode 100644 index 00000000..db1e68ef --- /dev/null +++ b/serving/vllm-xft.md @@ -0,0 +1,71 @@ +# vLLM-xft +vLLM-xFT is a fork of vLLM to integrate the xfastertransformer backend, maintaining compatibility with most of the official vLLM's features. + +## Install +```bash +pip install vllm-xft +``` + +## Usage +***Notice: Preload libiomp5.so is required!*** + +### Serving(OpenAI Compatible Server) +```shell +# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +python -m vllm.entrypoints.openai.api_server \ + --model ${XFT_MODEL} \ + --tokenizer ${TOKENIZER_DIR} \ + --dtype fp16 \ + --kv-cache-dtype fp16 \ + --served-model-name xft \ + --port 8000 \ + --trust-remote-code +``` +- `--max-num-batched-tokens`: max batched token, default value is max(MAX_SEQ_LEN_OF_MODEL, 2048). +- `--max-num-seqs`: max seqs batch, default is 256. + +More Arguments please refer to [vllm official docs](https://docs.vllm.ai/en/latest/models/engine_args.html) + +### Query example +```shell + curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "xft", + "prompt": "San Francisco is a", + "max_tokens": 16, + "temperature": 0 + }' +``` + +## Distributed(Multi-rank) +Use oneCCL's `mpirun` to run the workload. The master (`rank 0`) is the same as the single-rank above, and the slaves (`rank > 0`) should use the following command: +```bash +python -m vllm.entrypoints.slave --dtype bf16 --model ${MODEL_PATH} --kv-cache-dtype fp16 +``` +Please keep params of slaves align with master. + +### Serving(OpenAI Compatible Server) +Here is a example on 2Socket platform, 48 cores pre socket. +```bash +# Preload libiomp5.so by following cmd or LD_PRELOAD=libiomp5.so manually +export $(python -c 'import xfastertransformer as xft; print(xft.get_env())')` + +OMP_NUM_THREADS=48 mpirun \ + -n 1 numactl --all -C 0-47 -m 0 \ + python -m vllm.entrypoints.openai.api_server \ + --model ${MODEL_PATH} \ + --tokenizer ${TOKEN_PATH} \ + --dtype bf16 \ + --kv-cache-dtype fp16 \ + --served-model-name xft \ + --port 8000 \ + --trust-remote-code \ + : -n 1 numactl --all -C 48-95 -m 1 \ + python -m vllm.entrypoints.slave \ + --dtype bf16 \ + --model ${MODEL_PATH} \ + --kv-cache-dtype fp16 +``` \ No newline at end of file