-
Notifications
You must be signed in to change notification settings - Fork 65
Qwen‐72B on Ali EMR with eRDMA
EMR is recently being gradually launched on Alibaba Cloud. In the EMR launch media event, Intel demonstrated the live demo of the distributed Qwen-72B, achieving excellent on-site results. This BKM will use the same approach, EMR*4 + eRDMA network, to guide everyone in building a distributed inference demo of Qwen-72B using XFT (Intel/xFasterTransformer).
XFT(xFasterTransformer) is an exceptionally optimized solution for large language models (LLM) on the X86 platform, similar to FasterTransformer on the GPU platform. xFasterTransformer can operate in distributed mode across multiple sockets and nodes to support inference on larger models. Additionally, it provides both C++ and Python APIs, spanning from high-level to low-level interfaces, making it easy to adopt and integrate.
eRDMA (Elastic Remote Direct Memory Access) is Alibaba Cloud's self-developed elastic RDMA network in the cloud. The underlying link reuses the VPC network and employs a self-developed full-stack Congestion Control (CC) algorithm. While enjoying the high throughput and low latency characteristics of traditional RDMA networks, eRDMA can also support large-scale RDMA networking with sub-second latency. It is compatible with traditional HPC applications as well as traditional TCP/IP applications.
Based on eRDMA, you can deploy HPC application software in the cloud to obtain a cost-effective, more elastic high-performance application cluster. Alternatively, you can replace the VPC network with an eRDMA network to accelerate the performance of your other applications.
Purchase cloud instances of the following configuration types on Alibaba Cloud:
name | xftest01 |
---|---|
user | root |
passwd | xxxxxx |
OS | Alibaba Cloud Linux 3.2104 LTS 64位 |
CPU Model | INTEL(R) XEON(R) PLATINUM 8575C |
NUMA node0 CPU(s): | 0-95 |
NIC0 (with eRDMA Capability) | eth0 (192.168.0.1) |
disk | 200GB |
NFS-disk(mount in /mnt) | 1TB |
name | xftest01 |
---|---|
user | root |
passwd | xxxxxx |
OS | Alibaba Cloud Linux 3.2104 LTS 64位 |
CPU Model | INTEL(R) XEON(R) PLATINUM 8575C |
NUMA node0 CPU(s): | 0-95 |
NIC0 (with eRDMA Capability) | eth0 (192.168.0.2) |
disk | 200GB |
NFS-disk(mount in /mnt) | 1TB |
name | xftest01 |
---|---|
user | root |
passwd | xxxxxx |
OS | Alibaba Cloud Linux 3.2104 LTS 64位 |
CPU Model | INTEL(R) XEON(R) PLATINUM 8575C |
NUMA node0 CPU(s): | 0-95 |
NIC0 (with eRDMA Capability) | eth0 (192.168.0.3) |
disk | 200GB |
NFS-disk(mount in /mnt) | 1TB |
name | xftest01 |
---|---|
user | root |
passwd | xxxxxx |
OS | Alibaba Cloud Linux 3.2104 LTS 64位 |
CPU Model | INTEL(R) XEON(R) PLATINUM 8575C |
NUMA node0 CPU(s): | 0-95 |
NIC0 (with eRDMA Capability) | eth0 (192.168.0.4) |
disk | 200GB |
NFS-disk(mount in /mnt) | 1TB |
ID | Node name | IP | Ranks
0 | EMR instance 0 | 192.168.0.1 | 0
1 | EMR instance 1 | 192.168.0.2 | 1
2 | EMR instance 2 | 192.168.0.3 | 2
3 | EMR instance 3 | 192.168.0.4 | 3
ID | 0 1 2 3
0 | shm eRDMA eRDMA eRDMA
1 | eRDMA shm eRDMA eRDMA
2 | eRDMA eRDMA shm eRDMA
3 | eRDMA eRDMA eRDMA shm
Important Notes:
1. Ensure configuration deployment sets(结合部属集策略实现更低的eRDMA时延). Alibaba Cloud ECS provides deployment set strategies that control the physical distribution of ECS instances. Deployment sets support various strategies:
- High Availability Strategy: All ECS instances in the deployment set are strictly distributed across different physical servers within a specified region, ensuring high availability of business on ECS instances and the underlying physical server's disaster recovery capability.
- Low Latency Strategy: In this mode, all ECS instances in the deployment set are deployed as centrally as possible within the same network topology range in the available zone, reducing network communication latency.
We know that RDMA itself has the characteristics of low latency and high throughput. In practical use, it is also influenced by the actual physical network distance: the farther the distance, the greater the latency between nodes. In Alibaba Cloud, we can combine deployment set strategies to enable ECS to provide elastic RDMA acceleration, obtaining lower latency as much as possible.
2. Ensure optimal configuration for eRDMA:
- Disable Delay ACK: Currently, Alibaba support is needed for manual modification of Alibaba Cloud network parameters. Waiting for the new version update, eadm tool can be used for 'Delay ACK' configuration by users themselves.
-
Enable Congestion Control (CC) algorithm: running
$ eadm conf -d erdma_0 -t cc -v 1
3. Ensure the following VMs are distributed on different physical resources. Cloud instances do not have isolated memory bandwidth resources. If deployed on the same physical resource, there may be bandwidth contention issues.
4. Ensure consistent code environment; machines mount files to /mnt directory using NFS for file synchronization.
# Initially, we need to perform some basic environment configuration, such as:
# 1. Install system dependencies;
# 2. Install Python environment dependencies;
# 3. Configure passwordless login between machines;
# ...
$ yum install -y htop tmux git git-lfs python38 python38-pip numactl numactl-devel"
$ pip3.8 install --upgrade pip
$ cd /mnt/xFasterTransformer/
# install torch / transformers dependencies
$ pip3.8 install -r requirements.txt
# install distributed test dependencies.
# (optinal) install jpeg dependencies:
# $ yum -y install libjpeg-turbo-devel
$ cd distributed/
$ pip3.8 install -r requirements.txt
# modify host configuration
# Default user is using root. you can modify from ansible.cfg.
$ vim hosts
[all_hosts:vars]
ansible_ssh_pass=<server password>
# modify host IP address
[all_hosts]
192.168.0.1
192.168.0.2
192.168.0.3
192.168.0.4
# replace the default python + pip version
$ ansible all_hosts -m shell -a "rm -rf /usr/bin/pip /usr/bin/python && ln -s /usr/bin/pip3.8 /usr/bin/pip && ln -s /usr/bin/python3.8 /usr/bin/python"
# using the ansible ping plugin to check the network.
$ ansible all_hosts -m ping
...
192.168.0.1 | SUCCESS => {
"ansible_facts": {
"discovered_interpreter_python": "/usr/bin/python3.6"
},
"changed": false,
"ping": "pong"
}
...
# Configure passwordless login between machines
$ ansible-playbook 000-ssh-wo-passwd-update-hosts.xml
# Download the deployment environment on NFS and
# synchronize code files across multiple machines.
$ cd /mnt
$ git clone -b test/distributed https://github.com/intel/xFasterTransformer.git
$ cd /mnt/xFasterTransformer/
# Compile and install the oneCCL environment.
$ cd 3rdparty/
$ sh prepare_oneccl.sh
# Synchronize the multi-node environment configuration.
$ cd /mnt/xFasterTransformer/distributed/
$ ansible all_hosts -m shell -a "echo 'source /mnt/xFasterTransformer/3rdparty/oneccl/build/_install/env/setvars.sh' >> ~/.bashrc"
# Update environment variables, run benchmarks,
# and if the output is similar to the reference
# content below, then the oneCCL configuration is successful.
$ bash
$ cd /mnt/xFasterTransformer/3rdparty/oneccl/build/
$ mpirun -print-rank-map -prot -n 4 -ppn 1 -hosts 192.168.0.1,192.168.0.2 ./_install/examples/benchmark/benchmark
(192.168.0.1:0)
(192.168.0.2:1)
ID | Node name | IP | Ranks
0 | iZ2ze2krbyeyuro5gcr1uiZ | 192.168.0.1 | 0
1 | iZ2zeb3qtvacqss8re3pcrZ | 192.168.0.2 | 1
ID | 0 1
0 | verbs;ofi_rxm verbs;ofi_rxm
1 | verbs;ofi_rxm verbs;ofi_rxm
options:
processes: 2
backend: host
iters: 16
warmup_iters: 16
iter_policy: auto
buf_count: 1
min_elem_count: 1
max_elem_count: 128
elem_counts: [1 2 4 8 16 32 64 128]
check: last
cache: 1
inplace: 0
collectives: allreduce
datatypes: float32
reductions: sum
extended info: auto
csv_filepath:
datatype: float32
reduction: sum
#------------------------------------------------------------
# Benchmarking: allreduce
# #processes: 2
#------------------------------------------------------------
#bytes #repetitions t_min[usec] t_max[usec] t_avg[usec] stddev[%]
4 16 30.50 30.86 30.68 0.59
8 16 29.70 29.77 29.73 0.11
16 16 29.81 30.08 29.95 0.44
32 16 30.25 30.98 30.62 1.20
64 16 30.41 31.09 30.75 1.12
128 16 30.11 31.14 30.62 1.68
256 16 30.47 30.52 30.49 0.08
512 16 30.28 30.41 30.34 0.21
# All done
$ cd /mnt/xFasterTransformer/
$ mkdir build && cd build
# Compile and install; if there are any issues, feel free to submit an issue.
# https://github.com/intel/xFasterTransformer/issues/new
# cmake -DPython_EXECUTABLE=/usr/bin/python3.8 ..
$ cmake .. && make -j
-
modelscope: https://www.modelscope.cn/models/qwen/Qwen-72B-Chat/summary
-
huggingface: https://huggingface.co/Qwen/Qwen-72B
$ cd /mnt/xFasterTransformer/distributed/
# 修改分布式脚本:
# 1. ############# HW configuration #############
# a. 修改IFACE为指定网卡;
# b. 修改IP_A/IP_B/IP_C/...;
# c. 如果在云上测试则: export is_ali_cloud=1;
# d. TCP or eRDMA 二选一: 打开# enable TCP/eRDMA 注释;
# 2. ############# XFT configuration #############
# a. XFT_COMM_TIME=1: 打印单次allreduce时间日志;
# a. XFT_FAKE_MODEL=1: 模型文件未就位可以使用fake model,
# 此时模型输出没有意义, 仅仅性能测试使用;
$ vim run_benchmark.sh
$ bash run_benchmark.sh