We can use tensorboard and pytorch-profile to see the trace, GPU kernel and memory of LLM training.
Command of benchmark.py :
For simple node:
torchrun --standalone --nproc_per_node 8 benchmark.py --profile
For multiple nodes:
torchrun --nnodes 2 --node_rank=0 --master_addr=10.20.1.170 --nproc_per_node 8 benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile
torchrun --nnodes 2 --node_rank=1 --master_addr=10.20.1.170 --nproc_per_node 8 benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile
or:
colossalai run --nproc_per_node 8 --hostfile hostfile benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile
Make sure the SSH password-less link between the machines should be build first.
pip install torch_tb_profiler
tensorboard --logdir=/home/nvme-share/home/wangbinluo/ColossalAI/examples/language/llama/profile/
ssh -L 6006:localhost:6006 [email protected] -p 30956
Notice: Profile will reduce the tflops of training, as profile needs extra communication to record. Like from 21.50 to 18.8.