You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We can reproduce this problem using the following command: torchrun --master_addr=127.0.0.1 --master_port=1234 --nnodes=1 --nproc-per-node=1 --node_rank=0 test_optimizer_state.py --sharding_type $SHARDING_TYPE, and use the enviroment torchrec==0.8.0+cu121, torch==2.4.0+cu121, fbgemm-gpu==0.8.0+cu121
We may load the model to continue training on clusters with different scales, which can lead to different Sharding Plans, and consequently result in the optimizer's parameters not being loaded correctly.
We can reproduce this problem using the following command:
torchrun --master_addr=127.0.0.1 --master_port=1234 --nnodes=1 --nproc-per-node=1 --node_rank=0 test_optimizer_state.py --sharding_type $SHARDING_TYPE
, and use the enviromenttorchrec==0.8.0+cu121, torch==2.4.0+cu121, fbgemm-gpu==0.8.0+cu121
when SHARDING_TYPE=row_wise, it will print
when SHARDING_TYPE=data_parallel, it will print
xxx.weight.table_0.momentum1 -> xxx.weight.exp_avg,xxx.weight.table_0.exp_avg_sq -> xxx.weight.exp_avg_sq
We may load the model to continue training on clusters with different scales, which can lead to different Sharding Plans, and consequently result in the optimizer's parameters not being loaded correctly.
test_optimizer_state.py
The text was updated successfully, but these errors were encountered: