FlagOpen · KerwinKai · Aug 20, 2024 · Jun 28, 2024 · Jun 28, 2024 · Aug 16, 2024
diff --git a/base/toolkits/interconnect-P2P_intraserver/ascend/README.md b/base/toolkits/interconnect-P2P_intraserver/ascend/README.md
@@ -0,0 +1,47 @@
+# 参评AI芯片信息
+
+* 厂商：Ascend
+
+
+* 产品名称：Atlas800T A2
+* 产品型号：Atlas800T A2
+* TDP：350W
+
+# 所用服务器配置
+
+* 服务器数量：1
+
+
+* 单服务器内使用卡数：8
+* 服务器型号：Atlas 800T A2训练服务器 
+* 操作系统版本：Ubuntu 22.04 LTS
+* 操作系统内核：5.15.0-25-generic
+* CPU：Kunpeng 920
+* docker版本：次测评样例无需docker环境
+* 内存：1TiB
+* 服务器间AI芯片直连规格及宽带：此测评样例无需服务器间通信
+
+# 评测结果
+
+## 核心评测结果
+
+| 评测项  | 服务器内P2P互联带宽测试值    | 服务器P2P互联带宽标定值 | 测试标定比例 |
+| ---- | ----------- | -------- | ------ |
+| 评测结果 | 54.68GB/s  |  /     |  /      |
+
+## 能耗监控结果
+
+| 监控项  | 系统平均功耗  | 系统最大功耗  | 系统功耗标准差 | 单机TDP | 单卡平均功耗(2卡平均) | 单卡最大功耗(2卡最大) | 单卡功耗标准差(2卡最大) | 单卡TDP |
+| ---- | ------- | ------- | ------- | ----- | ------- | ------ | ------- | ----- |
+| 监控结果 |  /     |  /     |  /        | /     |96.162W  | 102.9W | 4.693W   |  /     |
+
+## 其他重要监控结果
+
+| 监控项  | 系统平均CPU占用 | 系统平均内存占用 | 单卡平均温度(2卡平均) | 单卡平均显存占用(2卡平均) |
+| ---- | --------- | -------- | ------- | -------- |
+| 监控结果 | 6.7%    | 0.1%   | 34.62°C | 0.155%  |
+
+# 厂商测试工具原理说明
+
+使用cudaMemcpy，进行服务器内AI芯片通信操作，计算服务器AI芯片内P2P互联带宽
+
diff --git a/base/toolkits/interconnect-P2P_intraserver/ascend/main.sh b/base/toolkits/interconnect-P2P_intraserver/ascend/main.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+source /usr/local/Ascend/toolbox/set_env.sh
+ascend-dmi --bw -t p2p
+
diff --git a/base/toolkits/interconnect-h2d/ascend/README.md b/base/toolkits/interconnect-h2d/ascend/README.md
@@ -0,0 +1,48 @@
+# 参评AI芯片信息
+
+* 厂商：Ascend
+
+
+* 产品名称：Atlas800T A2
+* 产品型号：Atlas800T A2
+* TDP：350W
+
+# 所用服务器配置
+
+* 服务器数量：1
+
+
+* 单服务器内使用卡数：8
+* 服务器型号：Atlas 800T A2训练服务器 
+* 操作系统版本：Ubuntu 22.04 LTS
+* 操作系统内核：5.15.0-25-generic
+* CPU：Kunpeng 920
+* docker版本：次测评样例无需docker环境
+* 内存：1TiB
+* 服务器间AI芯片直连规格及带宽：此评测样例无需服务器间通信
+
+# 评测结果
+
+## 核心评测结果
+
+| 评测项  | CPU-芯片互联带宽测试值    | CPU-芯片互联带宽标定值 | 测试标定比例 |
+| ---- | ----------- | -------- | ------ |
+| 评测结果 | 24.926GB/s | / | /  |
+
+注: h2d/d2h带宽受到CPU、PCIE、内存等服务器内AI芯片以外的模块影响，无标定值
+
+## 能耗监控结果
+
+| 监控项  | 系统平均功耗  | 系统最大功耗  | 系统功耗标准差 | 单机TDP | 单卡平均功耗  | 单卡最大功耗 | 单卡功耗标准差 | 单卡TDP |
+| ---- | ------- | ------- | ------- | ----- | ------- | ------ | ------- | ----- |
+| 监控结果 | /   | /   | /   | /     | 91.512W   | 95.7W    | 2.0W     | /     |
+
+## 其他重要监控结果
+
+| 监控项  | 系统平均CPU占用 | 系统平均内存占用 | 单卡平均温度  | 单卡平均显存占用 |
+| ---- | --------- | -------- | ------- | -------- |
+| 监控结果 | 6.7%     | 0.0%   | 34.75°C | 0.5%  |
+
+# 厂商测试工具原理说明
+
+使用ascend-dmi，进行hosttodevice的CPU-AI芯片互联操作，计算CPU-AI芯片互联带宽
diff --git a/base/toolkits/interconnect-h2d/ascend/main.sh b/base/toolkits/interconnect-h2d/ascend/main.sh
@@ -0,0 +1,5 @@
+source /usr/local/Ascend/toolbox/set_env.sh
+for i in {0..7}
+do 
+    ascend-dmi --bw -t d2h -d $i -s 536870912 --et 10
+done
diff --git a/training/ascend/chatglm3_6b-deepspeed/README.md b/training/ascend/chatglm3_6b-deepspeed/README.md
@@ -0,0 +1,48 @@
+### Ascend Npu配置与运行信息参考
+#### 环境配置
+- ##### 硬件环境
+    - 机器型号: Atlas 800T A2
+    - 加速卡型号: Atla s800T A2
+    - CPU型号: Kunpeng 920
+    - 多机网络类型、带宽: 此评测样例无需多机网络
+
+- ##### 软件环境
+   - OS版本：Ubuntu 22.04 LTS
+   - OS kernel版本: 5.15.0-25-generic
+   - 加速卡驱动版本：24.1.rc2.b020
+   - Docker 版本：此评测样例无需docker环境
+   - 训练框架版本：deepspeed 0.13.1
+
+- ##### 并行策略
+
+   - 并行技术：sharded data parallel
+   - 实施者：deepspeed ZeRO-DP
+   - 实施细节：ZeRO-DP O3, ZDP_SIE=8
+
+### 运行情况
+
+* 输入批尺寸
+  1. local_batchsize(micro_batchsize)，简写为LBS，即实际进入模型的张量批尺寸，为config_A100x1x8.py中所写，在本case中默认为**1**
+  2. seqlength(max_position_embedding)，简写为MPE，即实际进入模型的序列长度，为config_A100x1x8.py中所写，在本case中默认为**8192**
+  3. gradient_accumulate_steps，简写为GAS，即梯度累加步数，为ds_config.json中所写，在本case中默认为**1**
+  4. global_batchsize恒等于local_batchsize\*gradient_accumulate_steps\*data_parallel_size，简写为GBS。在本case中，只存在数据并行，因此data_parallel_size=world_size。
+
+* 通用指标
+
+| 指标名称     | 指标值                     | 特殊说明                           |
+| ------------ | -------------------------- | ---------------------------------- |
+| 任务类别     | 自然语言理解               |                                    |
+| 模型         | chatglm3_6b             |                                    |
+| 数据集       | openwebtext                | 如无特殊说明，训练前1亿个token |
+| 数据精度     | amp                        |                                    |
+| 超参修改     | fix_hp,见“性能指标”        | 运行必要特殊超参，例如需要改小seqlength避免OOM |
+| 硬件设备简称 | Atlas 800T A2                |                                    |
+| 硬件存储使用 | mem,见“性能指标”           | 通常称为“显存”,单位为GiB           |
+| 计算使用率 | MFU,见“性能指标”           | 参见PaLM论文定义 |
+| **吞吐量**   | **token/p/s,见“性能指标”** | 平均单卡每秒处理的token数          |
+
+* 性能指标
+
+| 配置                   |  fix_hp           | token/p/s | loss | mem       | MFU       |
+| ------------------- | ---------------- | ------ | ------- | --------- | --------- |
+| Atlas 800T A2单机8卡（1x8）  |  mpe=4096        | 3586.9 | 4.25 | 57/64 | 41.3% |
diff --git a/training/ascend/chatglm3_6b-deepspeed/config/config_A100x1x8.py b/training/ascend/chatglm3_6b-deepspeed/config/config_A100x1x8.py
@@ -0,0 +1,6 @@
+seqlength = 4096
+batchsize = 1
+datafilename = "openwebtext_chatglm3_100M.npy"
+theoryflops = 312000000000000.0
+epochs = 1
+flashattn = True
diff --git a/training/ascend/chatglm3_6b-deepspeed/config/ds_config.json b/training/ascend/chatglm3_6b-deepspeed/config/ds_config.json
@@ -0,0 +1,3 @@
+{
+    "gradient_accumulation_steps": 64
+  }
diff --git a/training/ascend/chatglm3_6b-deepspeed/config/net.sh b/training/ascend/chatglm3_6b-deepspeed/config/net.sh
@@ -0,0 +1 @@
+export CUDA_DEVICE_MAX_CONNECTIONS=1;export NCCL_SOCKET_IFNAME=enp;export NCCL_IB_DISABLE=1;
diff --git a/training/ascend/chatglm3_6b-deepspeed/config/requirements.txt b/training/ascend/chatglm3_6b-deepspeed/config/requirements.txt
@@ -0,0 +1,2 @@
+sentencepiece
+transformers==4.30.2
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		export CUDA_DEVICE_MAX_CONNECTIONS=1;export NCCL_SOCKET_IFNAME=enp;export NCCL_IB_DISABLE=1;