-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Single-node ImageNet DDP implementation #38
base: main
Are you sure you want to change the base?
Conversation
examples/imagenet.py
Outdated
best_acc1 = 0 | ||
|
||
|
||
@hydra.main(config_name="imagenetconf") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you forget to include the config?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure how to use configen and I was mostly looking at mnist_00.py
, and I couldn't find the corresponding configuration files for mnist_00.py
. Should I just handcraft one and stick it in the example folder?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. configen is used to generate configs for libraries. for Examples and user code people should write their own configs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In mnist_00.py
, I included the config in the top of the file.
hydra-torch/examples/mnist_00.py
Lines 20 to 34 in 2656165
@dataclass | |
class MNISTConf: | |
batch_size: int = 64 | |
test_batch_size: int = 1000 | |
epochs: int = 14 | |
no_cuda: bool = False | |
dry_run: bool = False | |
seed: int = 1 | |
log_interval: int = 10 | |
save_model: bool = False | |
checkpoint_name: str = "unnamed.pt" | |
adadelta: AdadeltaConf = AdadeltaConf() | |
steplr: StepLRConf = StepLRConf( | |
step_size=1 | |
) # we pass a default for step_size since it is required, but missing a default in PyTorch (and consequently in hydra-torch) |
See the second code block of https://github.com/pytorch/hydra-torch/blob/master/examples/mnist_00.md#parting-with-argparse for the explanation!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, I think it is cleaner to make your own config python files and then import them. I simply included it in one flat file so that the tutorial reader can see it all in one place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the same thing in the draft mnist_01.py
which I have yet to finalize. It's introducing some more hierarchical composition, so you'll see multiple 'non-configen' configs defined. Here's the whole 'hydra block':
https://github.com/pytorch/hydra-torch/blob/examples/mnist_01/examples/mnist_01.py#L10-L56
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I'll do as you suggested.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional comments:
-
Can you add the example command for running when we are training over 2 nodes (say 16 gpus). I think Some additional support needed for running the script on more than 8 gpus as setting rank=9 would not work because
device("cuda:9")
would not exist. -
Do we only care about the DDP usecase here? If yes, we should probably remove the logic corresponding to DataParallel and CPU training. I do not feel strongly about removing them but if we care only about DDP, may be that logic should go.
Thank you for the review @shagunsodhani. To reply to your comments:
Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup? |
We launch one process per gpu. These gpus can live on any mode (the extreme case being one gpu per node). We need to handle two things:
The way ahead will be to add the example config (for single node) and then we will see how is the master address set. That will give us some hint about how to get the master address on multi-node training.
Yeah so regarding this, imo a better way is to request n nodes and launch one process per node which spawns 8 workers. We can probably come back to this point later as it is orthogonal to the other changes we discussed and should be straightforward change. |
examples/imagenet.py
Outdated
world_size=cfg.world_size, | ||
rank=cfg.rank, | ||
) | ||
if cfg.pretrained: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not that we can handle this now, but just putting this out there:
Statements conditioned solely on the config often show up in main files and IMO reduce the readability of the code.
My wish for the future (once we have the hammers) would be to push this logic into the config. There are many pros:
main.py
is shorter and cleaner with less nesting or extraneous code.main.py
can be more general. This implies greater extensibility for users and less duplicate code.- As the config is resolved in real time through interpolation / logic resolution, you get to see the 'final' config all in one structure before it is run. This is very powerful in a sense that we now see the outcome of all the intermediate logic that happens in main before training begins.
Obviously this won't work for every case, but there are many (like this one) where I think it would be a trivial change on the user's end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I completely agree with you, and I think it'd be really nice to get rid of as many conditional statements as possible. I think a combination of using configs and refactoring should get the job done.
High level feedback: |
I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think? In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well? |
Sounds good. Right now, I'm thinking each of these can be separated into their own issues [listed by priority in my mind]: Single-node, multi-GPU:
Multi-node, multi-GPU:
|
I have something in mind for (3) Will be easier to show once (1) has been pushed. |
Right now, this PR is blocked by this issue, so I will be focusing more on #42 and fixing the blocking issue. |
Implements ImageNet DDP, as mentioned in #33.
Most of the code is the same, and the major differences are the handling of distributed processes and the configuration.
One can use Hydra's multirun capability to launch the distributed processes, instead of using PyTorch or Python's multiprocessing API.