Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Chatglm3 6b #705

Merged
merged 3 commits into from
Aug 20, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
47 changes: 47 additions & 0 deletions base/toolkits/interconnect-P2P_intraserver/ascend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# 参评AI芯片信息

* 厂商:Ascend


* 产品名称:Atlas800T A2
* 产品型号:Atlas800T A2
* TDP:350W

# 所用服务器配置

* 服务器数量:1


* 单服务器内使用卡数:8
* 服务器型号:Atlas 800T A2训练服务器
* 操作系统版本:Ubuntu 22.04 LTS
* 操作系统内核:5.15.0-25-generic
* CPU:Kunpeng 920
* docker版本:次测评样例无需docker环境
* 内存:1TiB
* 服务器间AI芯片直连规格及宽带:此测评样例无需服务器间通信

# 评测结果

## 核心评测结果

| 评测项 | 服务器内P2P互联带宽测试值 | 服务器P2P互联带宽标定值 | 测试标定比例 |
| ---- | ----------- | -------- | ------ |
| 评测结果 | 54.68GB/s | / | / |

## 能耗监控结果

| 监控项 | 系统平均功耗 | 系统最大功耗 | 系统功耗标准差 | 单机TDP | 单卡平均功耗(2卡平均) | 单卡最大功耗(2卡最大) | 单卡功耗标准差(2卡最大) | 单卡TDP |
| ---- | ------- | ------- | ------- | ----- | ------- | ------ | ------- | ----- |
| 监控结果 | / | / | / | / |96.162W | 102.9W | 4.693W | / |

## 其他重要监控结果

| 监控项 | 系统平均CPU占用 | 系统平均内存占用 | 单卡平均温度(2卡平均) | 单卡平均显存占用(2卡平均) |
| ---- | --------- | -------- | ------- | -------- |
| 监控结果 | 6.7% | 0.1% | 34.62°C | 0.155% |

# 厂商测试工具原理说明

使用cudaMemcpy,进行服务器内AI芯片通信操作,计算服务器AI芯片内P2P互联带宽

4 changes: 4 additions & 0 deletions base/toolkits/interconnect-P2P_intraserver/ascend/main.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
#!/bin/bash
source /usr/local/Ascend/toolbox/set_env.sh
ascend-dmi --bw -t p2p

48 changes: 48 additions & 0 deletions base/toolkits/interconnect-h2d/ascend/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
# 参评AI芯片信息

* 厂商:Ascend


* 产品名称:Atlas800T A2
* 产品型号:Atlas800T A2
* TDP:350W

# 所用服务器配置

* 服务器数量:1


* 单服务器内使用卡数:8
* 服务器型号:Atlas 800T A2训练服务器
* 操作系统版本:Ubuntu 22.04 LTS
* 操作系统内核:5.15.0-25-generic
* CPU:Kunpeng 920
* docker版本:次测评样例无需docker环境
* 内存:1TiB
* 服务器间AI芯片直连规格及带宽:此评测样例无需服务器间通信

# 评测结果

## 核心评测结果

| 评测项 | CPU-芯片互联带宽测试值 | CPU-芯片互联带宽标定值 | 测试标定比例 |
| ---- | ----------- | -------- | ------ |
| 评测结果 | 24.926GB/s | / | / |

注: h2d/d2h带宽受到CPU、PCIE、内存等服务器内AI芯片以外的模块影响,无标定值

## 能耗监控结果

| 监控项 | 系统平均功耗 | 系统最大功耗 | 系统功耗标准差 | 单机TDP | 单卡平均功耗 | 单卡最大功耗 | 单卡功耗标准差 | 单卡TDP |
| ---- | ------- | ------- | ------- | ----- | ------- | ------ | ------- | ----- |
| 监控结果 | / | / | / | / | 91.512W | 95.7W | 2.0W | / |

## 其他重要监控结果

| 监控项 | 系统平均CPU占用 | 系统平均内存占用 | 单卡平均温度 | 单卡平均显存占用 |
| ---- | --------- | -------- | ------- | -------- |
| 监控结果 | 6.7% | 0.0% | 34.75°C | 0.5% |

# 厂商测试工具原理说明

使用ascend-dmi,进行hosttodevice的CPU-AI芯片互联操作,计算CPU-AI芯片互联带宽
5 changes: 5 additions & 0 deletions base/toolkits/interconnect-h2d/ascend/main.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
source /usr/local/Ascend/toolbox/set_env.sh
for i in {0..7}
do
ascend-dmi --bw -t d2h -d $i -s 536870912 --et 10
done
48 changes: 48 additions & 0 deletions training/ascend/chatglm3_6b-deepspeed/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
### Ascend Npu配置与运行信息参考
#### 环境配置
- ##### 硬件环境
- 机器型号: Atlas 800T A2
- 加速卡型号: Atla s800T A2
- CPU型号: Kunpeng 920
- 多机网络类型、带宽: 此评测样例无需多机网络

- ##### 软件环境
- OS版本:Ubuntu 22.04 LTS
- OS kernel版本: 5.15.0-25-generic
- 加速卡驱动版本:24.1.rc2.b020
- Docker 版本:此评测样例无需docker环境
- 训练框架版本:deepspeed 0.13.1

- ##### 并行策略

- 并行技术:sharded data parallel
- 实施者:deepspeed ZeRO-DP
- 实施细节:ZeRO-DP O3, ZDP_SIE=8

### 运行情况

* 输入批尺寸
1. local_batchsize(micro_batchsize),简写为LBS,即实际进入模型的张量批尺寸,为config_A100x1x8.py中所写,在本case中默认为**1**
2. seqlength(max_position_embedding),简写为MPE,即实际进入模型的序列长度,为config_A100x1x8.py中所写,在本case中默认为**8192**
3. gradient_accumulate_steps,简写为GAS,即梯度累加步数,为ds_config.json中所写,在本case中默认为**1**
4. global_batchsize恒等于local_batchsize\*gradient_accumulate_steps\*data_parallel_size,简写为GBS。在本case中,只存在数据并行,因此data_parallel_size=world_size。

* 通用指标

| 指标名称 | 指标值 | 特殊说明 |
| ------------ | -------------------------- | ---------------------------------- |
| 任务类别 | 自然语言理解 | |
| 模型 | chatglm3_6b | |
| 数据集 | openwebtext | 如无特殊说明,训练前1亿个token |
| 数据精度 | amp | |
| 超参修改 | fix_hp,见“性能指标” | 运行必要特殊超参,例如需要改小seqlength避免OOM |
| 硬件设备简称 | Atlas 800T A2 | |
| 硬件存储使用 | mem,见“性能指标” | 通常称为“显存”,单位为GiB |
| 计算使用率 | MFU,见“性能指标” | 参见PaLM论文定义 |
| **吞吐量** | **token/p/s,见“性能指标”** | 平均单卡每秒处理的token数 |

* 性能指标

| 配置 | fix_hp | token/p/s | loss | mem | MFU |
| ------------------- | ---------------- | ------ | ------- | --------- | --------- |
| Atlas 800T A2单机8卡(1x8) | mpe=4096 | 3586.9 | 4.25 | 57/64 | 41.3% |
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
seqlength = 4096
batchsize = 1
datafilename = "openwebtext_chatglm3_100M.npy"
theoryflops = 312000000000000.0
epochs = 1
flashattn = True
3 changes: 3 additions & 0 deletions training/ascend/chatglm3_6b-deepspeed/config/ds_config.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
{
"gradient_accumulation_steps": 64
}
1 change: 1 addition & 0 deletions training/ascend/chatglm3_6b-deepspeed/config/net.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
export CUDA_DEVICE_MAX_CONNECTIONS=1;export NCCL_SOCKET_IFNAME=enp;export NCCL_IB_DISABLE=1;
2 changes: 2 additions & 0 deletions training/ascend/chatglm3_6b-deepspeed/config/requirements.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@
sentencepiece
transformers==4.30.2