This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Adding Distributed Data Parallel #261
Adding Distributed Data Parallel #261
Changes from 1 commit
cb0ee83
90488d6
c1f5191
1e24971
411d4b9
2e30ba4
1755e69
cb47c4d
464f701
e6f7744
983afce
94c4fde
50403dc
55b1297
fdc67cd
0d1b8a5
c2cebf6
9ac02b0
32c8597
4f8efbd
6417c42
82ca448
b326ee3
f7c58a3
00134a3
a42d9a2
c0c86a7
dd5d904
2dad14d
ceed0ef
32b4d0e
ba218e6
9c69809
78f51c1
a3027b2
f4c4e65
2f6904f
a5980e4
1bc6653
179f1bf
972c794
7aef7d2
726ad5f
7c4bcd8
42d6cba
393547a
7acdc98
d7bf3ba
1130c9b
f7508de
ea195fd
0435b6c
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to match node_count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each device will call this so setting num_workers to 0 will create 1 process on each device (preventing too many processes being spawned on each device , which was leading to CUDA memory errors). However this really slows down the data loading so another way to do it is is to use
int((config.num_dataload_workers + n_gpus_per_node - 1) / n_gpus_per_node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd spell out distributed_data_parallel
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just from the names it is not clear what the difference between use_data_parallel and use_ddp is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is almost the same code as in use_data_parallel, DRY?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there is little logic in get_global_size. Normally, I'd always go for "put things into functions", but here it could be clearer to handle world size for offline/AzureML runs right here (expand the get_global_size here, and pass as argument into train()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused. Shouldn't offline runs (outside AzureML) be going through the same codepath as "do not use ddp"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, this is for cases such as a machine with multiple GPUs on it