Pytorch-profile

We can use tensorboard and pytorch-profile to see the trace, GPU kernel and memory of LLM training.

Command of benchmark.py :

For simple node：

torchrun --standalone --nproc_per_node 8 benchmark.py --profile

For multiple nodes：

torchrun --nnodes 2 --node_rank=0 --master_addr=10.20.1.170 --nproc_per_node 8 benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile

torchrun --nnodes 2 --node_rank=1 --master_addr=10.20.1.170 --nproc_per_node 8 benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile

or: colossalai run --nproc_per_node 8 --hostfile hostfile benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile

Make sure the SSH password-less link between the machines should be build first.

Environment

pip install torch_tb_profiler

tensorboard --logdir=/home/nvme-share/home/wangbinluo/ColossalAI/examples/language/llama/profile/

ssh -L 6006:localhost:6006 [email protected] -p 30956

Notice: Profile will reduce the tflops of training, as profile needs extra communication to record. Like from 21.50 to 18.8.

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
README.md		README.md
llama_benchmark.py		llama_benchmark.py
profile2.py		profile2.py