Skip to content

Commit

Permalink
Fix batch size computation for multi-node training
Browse files Browse the repository at this point in the history
hotfix
  • Loading branch information
LeeDoYup authored Apr 12, 2021
1 parent 0f22e7f commit f167d64
Showing 1 changed file with 3 additions and 0 deletions.
3 changes: 3 additions & 0 deletions train.py
Original file line number Diff line number Diff line change
Expand Up @@ -54,6 +54,9 @@ def main(args):
args.distributed = args.world_size > 1 or args.multiprocessing_distributed
ngpus_per_node = torch.cuda.device_count() # number of gpus of each node

#divide the batch_size according to the number of nodes
args.batch_size = int(args.batch_size / args.world_size)

if args.multiprocessing_distributed:
# now, args.world_size means num of total processes in all nodes
args.world_size = ngpus_per_node * args.world_size
Expand Down

0 comments on commit f167d64

Please sign in to comment.