https://github.com/microsoft/AzureML-BERT/blob/master/docs/dataprep.md
Please store wikipedia.segmented.nltk.txt
file under the bert_data/
directory.
N个月之后,想要启动Chimera的时候,需要做的事情
-
单node(2GPU)
分别运行
scripts/manual.sh
上的两个命令 -
多node(2GPU*2)
分别运行
scripts/manual4.sh
上的两个命令
Fisrt 要启动 10022 结点上的 ssh 服务器,用于代码同步,和ray结点启动(这个docker缓存有问题
若要profile的话,在启动ray的文件中,将 profile设置为True
-
单node
两个都需要改 microbatch大小
# 启动计算 python scripts/prof_steps_ray.py # plot call_pipeline timeline sh scripts/plot_cuda_timeline.sh
-
多node
注意,只需要运行一遍的命令放在 pre_cmd 里面
# pkill & run-ray sh shutdown.sh # 启动计算 python scripts/prof_steps_ray_mutlinode.py # plot call_pipeline timeline sh scripts/plot_cuda_timeline_mutlinode.sh
在 pipeline.py 中改
pip install -r requirements.txt
For training, we use apex.optimizers.FusedLAMB
of NVIDIA's Apex library. Please follow the instruction for installing apex
.
For profiling, we use NVIDIA Nsight Systems. Please make sure you can execute nsys
command.
Our scripts are intended to run through the SLURM workload manager on a GPU cluster with 1 GPU per node.
sbatch scripts/prof_steps.sh
sh scripts/plot_cuda_timeline.sh
output: bert_prof/bert-large_chimera_8stages_8gpus_microbs32_acc1.pdf
Chimera is pulished in SC'21, Best Paper Finalist. See the paper and the video talk for more details. To cite our work:
@inproceedings{li143,
author = {Li, Shigang and Hoefler, Torsten},
title = {Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines},
year = {2021},
isbn = {9781450384421},
publisher = {Association for Computing Machinery},
address = {New York, NY, USA},
url = {https://doi.org/10.1145/3458817.3476145},
doi = {10.1145/3458817.3476145},
booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
articleno = {27},
numpages = {14},
location = {St. Louis, Missouri},
series = {SC '21}
}
See LICENSE.