The Python code in the <path_to_dashinfer>/examples/python/0_basic
directory provides examples of how to call the DashInfer Python interface.
Functions include:
- Model Download: By default, models are downloaded from the ModelScope community, but you can also refer to comments in the code to download models from HuggingFace.
- Model Conversion: Convert models from the HuggingFace format to the DashInfer format.
- Model Inference: Use the DashInfer for multi-batch parallel inference, and obtain inference results in a streaming manner.
Install the DashInfer python package before running the example:
# install from pip
pip install dashinfer
# install from local package
pip install dashinfer-<dashinfer-version>-xxx.whl
Run the python example under <path_to_dashinfer>/examples/python/0_basic
:
python basic_example_qwen_v10.py
The models in DashInfer format obtained from the conversion are stored in the ~/dashinfer_models/
. You can also specify the target path by modifying the model_path
field of the model configuration file.
DashInfer support Single/Multi-NUMA inference.
The examples use single-NUMA inference by default. On CPUs with single NUMA node, no special configuration is required.
On CPUs with multiple NUMA nodes, if only 1 NUMA node is needed for inference, numactl
is necessary for core binding.
This approach requires --cap-add SYS_NICE --cap-add SYS_PTRACE --ipc=host
arguments when creating containers by docker run.
To use single-NUMA inference, you need to change the device_ids
field in the model configuration file to the NUMA node number you want to use, and set the multinode_mode
to be true.
"device_ids": [
0
],
"multinode_mode": true,
For multi-NUMA inference, use mpirun
+ numactl
to bind cpu cores for optimal performance.
This approach requires --cap-add SYS_NICE --cap-add SYS_PTRACE --ipc=host
arguments when creating containers by docker run.
To use multi-NUMA inference, you need to change the device_ids
field in the model configuration file to the NUMA node numbers you want to use, and set the multinode_mode
to be true.
"device_ids": [
0,
1
],
"multinode_mode": true,
Replacing other models of the same structure requires following changes in basic_example_xxx.py:
- config_file
config_file = "model_config/config_qwen_v10_1_8b.json"
- HuggingFace (or ModelScope) model information
# download model from huggingface
original_model = {
"source": "huggingface",
"model_id": "Qwen/Qwen-1_8B-Chat",
"revision": "",
"model_path": ""
}
# download model from modelscope
original_model = {
"source": "modelscope",
"model_id": "qwen/Qwen-1_8B-Chat",
"revision": "v1.0.0",
"model_path": ""
}
- format of prompt
start_text = "<|im_start|>"
end_text = "<|im_end|>"
system_msg = {"role": "system", "content": "You are a helpful assistant."}
user_msg = {"role": "user", "content": ""}
assistant_msg = {"role": "assistant", "content": ""}
prompt_template = Template(
"{{start_text}}" + "{{system_role}}\n" + "{{system_content}}" + "{{end_text}}\n" +
"{{start_text}}" + "{{user_role}}\n" + "{{user_content}}" + "{{end_text}}\n" +
"{{start_text}}" + "{{assistant_role}}\n\n")
Adding a new model requires the following changes.
- Add C++ source code and header files of the new model to the
<path_to_dashinfer>/csrc/core/model
directory. - Write the corresponding model adapter code in the
<path_to_dashinfer>/python/dashinfer/allspark/model
directory. - Add the new model type in the model_map in the
<path_to_dashinfer>/python/dashinfer/allspark/engine.py
file. - Import the new model adapter code in the
<path_to_dashinfer>/python/dashinfer/allspark/model/__init__.py
file and add the new model type.
The Python code in the <path_to_dashinfer>/examples/python/1_performance
directory provides examples of inference performance testing using random numbers.
The difference with the basic example is that the performance test example uses random inputs to test the context performance and generation performance under various batch_size, input_len, and output_len configurations.
batch_size_list = [1, 2, 4, 8]
output_len_list = [128]
input_len_list = [128, 1200]
During these performance evaluations, the early_stopping
parameter is set to false, indicating that generation will no stop, even if an EOS token is produced.
Enter the directory ``<path_to_dashinfer>/examples/python/1_performance`, and execute following command to run the example:
python performance_test_qwen_v15.py
python performance_test_qwen_v15.py --device_ids 0 1 # test multi-NUMA performance
On CPUs with multiple NUMA nodes, please refer to [Single/Multi-NUMA Inference] (examples_python.md#L33) section for best performance.
The code in the <path_to_dashinfer>/examples/python/2_evaluation
directory is from QwenLM/Qwen. The original code uses transformers for inference. In this repository, the accuracy testing code substitutes the inference engine with DashInfer.
For accuracy evaluation, please refer to EVALUATION.md.
The Gradio demo in the <path_to_dashinfer>/examples/python/3_gradio
directory demonstrates how to deploy a chat service using DashInfer as the backend inference engine.
Run basic_example_qwen_v10.py
first to get the converted model.
This step can be skipped for local deployments.
Some network configuration is required to access the created gradio demo from the external network.
There are two ways to configure it:
-
Using the official Gradio proxy
- No root permissions required.
- Anyone who gets the link can access the demo.
- The free link comes with a 72-hour time limit.
-
Using a custom Nginx proxy
- The link has no time restrictions.
- Installing Nginx requires root permissions.
Set share=True
in launch()
method in the gradio_demo_qwen.py, like this:
demo.queue(api_open=False).launch(height=800, share=True, server_name="127.0.0.1", server_port=7860)
An error may be reported, just follow the instructions.
Don't forget to add executable permission: chmod +x frpc_linux_amd64_v0.2
Could not create share link. Missing file: /miniconda/envs/py38/lib/python3.8/site-packages/gradio/frpc_linux_amd64_v0.2.
Please check your internet connection. This can happen if your antivirus software blocks the download of this file. You can install manually by following these steps:
1. Download this file: https://cdn-media.huggingface.co/frpc-gradio-0.2/frpc_linux_amd64
2. Rename the downloaded file to: frpc_linux_amd64_v0.2
3. Move the file to this location: /miniconda/envs/py38/lib/python3.8/site-packages/gradio
Following message will appear at the end of a successful run:
Running on local URL: http://127.0.0.1:7860
Running on public URL: https://xxx.gradio.live
This share link expires in 72 hours. For free permanent hosting and GPU upgrades, run `gradio deploy` from Terminal to deploy to Spaces (https://huggingface.co/spaces)
Then you can access your demo via the public URL.
Check to if nginx is already installed:
nginx -v
If nginx is not installed, execute following command to install:
Ubuntu:
apt-get install -y nginx
CentOS:
yum install -y nginx
Ubuntu users please refer to nginx_ubuntu.conf
under <path_to_dashinfer>/examples/nginx_config
directory and modify /etc/nginx/sites-available/default
.
CentOS users please refer to nginx_centos.conf
under <path_to_dashinfer>/examples/nginx_config
directory and modify /etc/nginx/nginx.conf
.
location /gradio-demo/ {
proxy_pass http://127.0.0.1:7860/;
proxy_buffering off;
proxy_redirect off;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection "upgrade";
proxy_set_header Host $host;
}
-
- The location should be consistent with the
root_path
parameter in gradio demolaunch()
method. In the sample code,root_path="/gradio-demo/"
.
- The location should be consistent with the
- The port in proxy_pass should be consistent with the gradio's server port, the default port is
7860
. - If you are deploying multiple gradio demos on the same server, you can configure multiple locations to map to different ports.
Check the configuration file for syntax errors: nginx -t
Start nginx service: nginx
Restart nginx service: nginx -s reload
Set root_path="/gradio-demo/"
in launch()
method in the gradio_demo_qwen.py, like this:
demo.queue(api_open=False).launch(root_path="/gradio-demo/",
height=800,
share=False,
server_name="127.0.0.1",
server_port=7860)
After successful deployment, you can interact with the application by visiting the URL: http://server_ip_address/gradio-demo/
.
- For servers within the same local area network (LAN), access can be achieved using the LAN IP address.
- For servers not on the same LAN, access requires the use of the server's public IP address.
Run the python example under <path_to_dashinfer>/examples/python/3_gradio
:
python gradio_demo_qwen.py
Once the following output appears in the terminal, you can access the deployed Gradio application through a web browser:
Running on local URL: http://127.0.0.1:7860
To create a public link, set `share=True` in `launch()`.
FastChat is an open-source platform designed for training, serving, and evaluating large language model chatbots. It facilitates integrating an inference engine backend into the platform in a worker-based manner, providing services compatible with the OpenAI API.
In the examples/python/4_fastchat/dashinfer_worker.py file, we supply a sample code demonstrating the implementation of a worker using FastChat and DashInfer. Users can readily substitute the default fastchat.serve.model_worker
in the FastChat service component with dashinfer_worker
, thereby achieving a solution that is not only compatible with the OpenAI API but also optimizes CPU usage for efficient inference.
pip install "fschat[model_worker]"
python -m fastchat.serve.controller
python -m fastchat.serve.openai_api_server --host localhost --port 8000
python dashinfer_worker.py --model-path qwen/Qwen-7B-Chat ../model_config/config_qwen_v10_7b.json
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen-7B-Chat",
"messages": [{"role": "user", "content": "Hello! What is your name?"}]
}'
Furthermore, we provide a convenient Docker image, enabling rapid deployment of an HTTP service that integrates dashinfer_worker and is compatible with the OpenAI API. Execute the following command, ensuring to replace bracketed paths with actual paths:
docker run -d \
--network host \
-v <host_path_to_your_model>:<container_path_to_your_model> \
-v <host_path_to_dashinfer_json_config_file>:<container_path_to_dashinfer_json_config_file> \
dashinfer/fschat_ubuntu_x86:v1.2.1 \
<container_path_to_your_model> \
<container_path_to_dashinfer_json_config_file>
<host_path_to_your_model>
: The path on the host where ModelScope/HuggingFace models reside.<container_path_to_your_model>
: The path within the container for mounting ModelScope/HuggingFace models.<host_path_to_dashinfer_json_config_file>
: The location of the DashInfer JSON configuration file on the host.<container_path_to_dashinfer_json_config_file>
: The destination path in the container for the DashInfer JSON configuration file.- The
-m
flag denotes the path to ModelScope/HuggingFace within the container, which is determined by the host-to-container path binding specified in-v
. If this refers to a standard ModelScope/HuggingFace path (e.g.,qwen/Qwen-7B-Chat
), there's no need to bind the model path from the host; the container will automatically download the model for you.
Below is an example of launching a Qwen-7B-Chat model service, with the default host set to localhost and the port to 8000.
docker run -d \
--network host \
-v ~/.cache/modelscope/hub/qwen/Qwen-7B-Chat:/workspace/qwen/Qwen-7B-Chat \
-v examples/python/model_config/config_qwen_v10_7b.json:/workspace/config_qwen_v10_7b.json \
dashinfer/fschat_ubuntu_x86:v1.2.1 \
/workspace/qwen/Qwen-7B-Chat \
/workspace/config_qwen_v10_7b.json
The <path_to_dashinfer>/examples/python/model_config
directory provides several configuration examples.
Here is an explanation of the parameters within the config:
model_name
: Custom name for the DashInfer model.model_type
: The type of the DashInfer model. Options include: LLaMA_v2, ChatGLM_v2, ChatGLM_v3, Qwen_v10, Qwen_v15, Qwen_v20.model_path
: The export path for the DashInfer model.data_type
: The data type of the output. Options include: float32.device_type
: The inference hardware. Options include: CPU.device_ids
: The NUMA node used for inference. NUMA information of your CPU can be viewed with the Linux commandlscpu
.multinode_mode
: Whether or not the engine is running on a multi-NUMA CPU. Options include: true, false.convert_config
: Parameters related to model conversion.do_dynamic_quantize_convert
: Whether to quantize the weights. Options include: true, false. Currently, only ARM CPUs support quantization.
engine_config
: Inference engine parameters.engine_max_length
: The maximum inference length, <= 11000.engine_max_batch
: The maximum batch size.do_profiling
: Whether to profile the inference process. Options include: true, false. To conduct profiling,do_profiling
should be set to true and the environment variableAS_PROFILE=ON
configured.num_threads
: The number of threads. For optimal performance, this should match the number of physical cores in a single NUMA node. If set to 0, EngineHelper will automatically parselscpu
output and set the value. If > 0, the set value is used.matmul_precision
: The computational precision for matrix multiplication. Options include: high, medium. When set to high, fp32 is used for matrix multiplication; when set to medium, bf16 is used.
generation_config
: Generation parameters.temperature
: The temperature for randomness.early_stopping
: Whether to stop generating after generatingstop_words_ids
. Options include: true, false.top_k
: The top k parameter for sampling. When top_k = 0, it ranks the entire vocabulary.top_p
: The top p parameter for sampling, 0 <= top_p <= 1.0. top_p = 0 means not using topp.repetition_penalty
: The parameter for repetition penalty. 1.0 means no penalty.presence_penalty
: The parameter for presence penalty. 0.0 means no penalty.min_length
: The minimum length for input+output. Default is 0, not enabling the filter.max_length
: The maximum length for input+output.no_repeat_ngram_size
: Controls the generation of repeat words. Default is 0. If set to int > 0, all ngrams of that size can only occur once.eos_token_id
: The token id corresponding to EOS, dependent on the model.seed
: The seed for randomness.stop_words_ids
: A list of token ids for stop words.
quantization_config
: Quantization parameters, required whendo_dynamic_quantize_convert
is set to true.activation_type
: The data type for the input matrix in matrix multiplication. Options include: bfloat16.weight_type
: The data type for weights in matrix multiplication. Options include: uint8.SubChannel
: Whether to perform sub-channel quantization on weights. Options include: true, false.GroupSize
: The granularity of sub-channel quantization. Options include: 64, 128, 256, 512.