Skip to content

Use pytorch profile api to further analysis the training detailed information, like heaps and stacks, time consuming.

Notifications You must be signed in to change notification settings

wangbluo/Pytorch-profile

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 

Repository files navigation

Pytorch-profile

We can use tensorboard and pytorch-profile to see the trace, GPU kernel and memory of LLM training.

Command of benchmark.py :

For simple node:

torchrun --standalone --nproc_per_node 8 benchmark.py --profile

For multiple nodes:

torchrun --nnodes 2 --node_rank=0 --master_addr=10.20.1.170 --nproc_per_node 8 benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile

torchrun --nnodes 2 --node_rank=1 --master_addr=10.20.1.170 --nproc_per_node 8 benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile

or: colossalai run --nproc_per_node 8 --hostfile hostfile benchmark.py -p 3d -b 20 -s 10 --zero 2 --use_fp8 -g -x --profile

Make sure the SSH password-less link between the machines should be build first.

Environment

pip install torch_tb_profiler

Use tensorboard:

tensorboard --logdir=/home/nvme-share/home/wangbinluo/ColossalAI/examples/language/llama/profile/

ssh -L 6006:localhost:6006 [email protected] -p 30956

Result:

img_v3_02ej_81920f5e-cb94-4000-a4d0-ad8ea1dcfd4g img_v3_02ek_f5a3bb5c-7859-450d-8009-1c0e32d07e4g img_v3_02ej_f48ed326-6f03-46fc-b3f8-1293be3c67cg

Notice: Profile will reduce the tflops of training, as profile needs extra communication to record. Like from 21.50 to 18.8.

About

Use pytorch profile api to further analysis the training detailed information, like heaps and stacks, time consuming.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages