手动启动

Data preparation

https://github.com/microsoft/AzureML-BERT/blob/master/docs/dataprep.md

Please store wikipedia.segmented.nltk.txt file under the bert_data/ directory.

N个月之后，想要启动Chimera的时候，需要做的事情

notion笔记

手动启动

单node（2GPU）

分别运行 scripts/manual.sh 上的两个命令
多node（2GPU*2）

分别运行 scripts/manual4.sh 上的两个命令

ray 自动运行

Fisrt 要启动 10022 结点上的 ssh 服务器，用于代码同步，和ray结点启动（这个docker缓存有问题

若要profile的话，在启动ray的文件中，将 profile设置为True

单node

两个都需要改 microbatch大小

# 启动计算
python scripts/prof_steps_ray.py

# plot call_pipeline timeline
sh scripts/plot_cuda_timeline.sh

多node

注意，只需要运行一遍的命令放在 pre_cmd 里面

# pkill & run-ray
sh shutdown.sh

# 启动计算
python scripts/prof_steps_ray_mutlinode.py

# plot call_pipeline timeline
sh scripts/plot_cuda_timeline_mutlinode.sh

Single Byte

在 pipeline.py 中改

AllReduce

把 nb_sync_grad_very_nb 函数改回 nb_sync_grad
send/recv

在 recv_comm_thread 和 send_comm_thread

Installation

pip install -r requirements.txt

For training, we use apex.optimizers.FusedLAMB of NVIDIA's Apex library. Please follow the instruction for installing apex.

For profiling, we use NVIDIA Nsight Systems. Please make sure you can execute nsys command.

Our scripts are intended to run through the SLURM workload manager on a GPU cluster with 1 GPU per node.

Profiling Chimera with 8 stages for BERT-Large on 8 GPUs

sbatch scripts/prof_steps.sh

sh scripts/plot_cuda_timeline.sh

output: bert_prof/bert-large_chimera_8stages_8gpus_microbs32_acc1.pdf

Publication

Chimera is pulished in SC'21, Best Paper Finalist. See the paper and the video talk for more details. To cite our work:

@inproceedings{li143,
  author = {Li, Shigang and Hoefler, Torsten},
  title = {Chimera: Efficiently Training Large-Scale Neural Networks with Bidirectional Pipelines},
  year = {2021},
  isbn = {9781450384421},
  publisher = {Association for Computing Machinery},
  address = {New York, NY, USA},
  url = {https://doi.org/10.1145/3458817.3476145},
  doi = {10.1145/3458817.3476145},
  booktitle = {Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis},
  articleno = {27},
  numpages = {14},
  location = {St. Louis, Missouri},
  series = {SC '21}
}

License

See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
.vscode		.vscode
__pycache__		__pycache__
bert_data		bert_data
bert_prof		bert_prof
configs		configs
logs		logs
playground		playground
scripts		scripts
LICENSE		LICENSE
README.md		README.md
Untitled-1.py		Untitled-1.py
Untitled-2.json		Untitled-2.json
auto_schedule.py		auto_schedule.py
bert_dataset.py		bert_dataset.py
bert_model.py		bert_model.py
bert_optim.py		bert_optim.py
chimera_pipeline_rank.py		chimera_pipeline_rank.py
main_bert.py		main_bert.py
main_bert_simple.py		main_bert_simple.py
my_profile_output.qdrep		my_profile_output.qdrep
nohup		nohup
null.log		null.log
pipeline.py		pipeline.py
pipeline_parallel.py		pipeline_parallel.py
plot_cuda_time.txt		plot_cuda_time.txt
requirements.txt		requirements.txt
shutdown.sh		shutdown.sh
sync.sh		sync.sh
test.py		test.py
threadsafe_counter.py		threadsafe_counter.py
threadsafe_queue.py		threadsafe_queue.py
utils.py		utils.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Data preparation

手动启动

ray 自动运行

Single Byte

Installation

Profiling Chimera with 8 stages for BERT-Large on 8 GPUs

Publication

License

About

Releases

Packages

Languages

License

Lssyes/Chimera_study

Folders and files

Latest commit

History

Repository files navigation

Data preparation

手动启动

ray 自动运行

Single Byte

Installation

Profiling Chimera with 8 stages for BERT-Large on 8 GPUs

Publication

License

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages