-
-
Notifications
You must be signed in to change notification settings - Fork 663
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Minimal DDP implementation with multirun #1141
Conversation
This pull request introduces 3 alerts when merging ad576ad into 599205f - view on LGTM.com new alerts:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, thanks!
Can you try to make it work by using the explicit IP address of the parent process instead of localhost? this would translate better for distributed scenarios (using the submitit launcher, for example).
dist.reduce(tensor, dst=0, op=dist.ReduceOp.SUM, group=group) | ||
if cfg.rank == 0: | ||
tensor /= 4 | ||
print("Rank {} has average: {}".format(cfg.rank, tensor[0])) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Please use Python 3 fstrings.
print("Rank {} has average: {}".format(cfg.rank, tensor[0])) | |
print(f"Rank {cfg.rank} has average: {tensor[0]}") |
- Please use logging:
log.info(f"Rank {cfg.rank} has average: {tensor[0]}")
def main(cfg: DictConfig): | ||
setup(cfg.master_addr, cfg.master_port, cfg.rank, cfg.world_size, cfg.backend) | ||
group = dist.new_group(list(range(cfg.world_size))) | ||
tensor = torch.rand(1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can fill the tensors according to the rank and make sure the average is what you expect (not random).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
- It is not guaranteed that all the jobs in JobLib will start at the same time (or around the same time). This is probably outside the scope of this PR but something to think about in general.
- It is not guaranteed that the different processes will be scheduled on the same node. As Omry mentioned, it will be useful to pass the parent's ip (I think that means that the parent process participates in the rendezvous).
- A nit-picky comment - the example shows how to use
torch.distributed
(and not DDP). Maybe we can change the scope of the PR to focus on an example oftorch.distributed
? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.
|
||
def cleanup(): | ||
"""Cleans up distributed backend resources.""" | ||
dist.destroy_process_group() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We generally do not need to do this. Did you see any error without doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.
A more complete example can live in https://github.com/pytorch/hydra-torch (which is still WIP).
I am also considering if this example belongs there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A nit-picky comment - the example shows how to use
torch.distributed
(and not DDP). Maybe we can change the scope of the PR to focus on an example oftorch.distributed
? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.
I completely agree with you. The reason why I implemented a torch.distributed
example for now is to create a prototypical or minimal example that demonstrates multirun and Joblib being used to launch the distributed processes. I'm in the middle of fixing my GPU cluster, after which I plan to commit a full DDP example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not guaranteed that all the jobs in JobLib will start at the same time (or around the same time). This is probably outside the scope of this PR but something to think about in general.
Is there a way to guarantee that the jobs start together?
It is not guaranteed that the different processes will be scheduled on the same node. As Omry mentioned, it will be useful to pass the parent's ip (I think that means that the parent process participates in the rendezvous).
I intended this PR to be the single node case. I will definitely try to work on the multi-node case, but I'm less familiar with it compared to the single node case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@omry should I create PR in https://github.com/pytorch/hydra-torch instead for the ImageNet DDP example?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@briankosw,
If you want to create a DDP example based on ImageNet - yes.
I also think that this example probably should go there as well. (hydra-torch repository did not exist when I first created the issue).
#394 would be super helpful for this, since I need to manually specify multirun every time I want to run the script. |
Any consensus on what to do with this particular PR @omry @shagunsodhani? I'll open a separate issue and PR in hydra-torch, but I'll let you guys decide whether this PR (or something similar in spirit) belongs to this repo as well. |
The reason hydra-torch is called |
Sounds good! Closing this PR then. |
Thanks @briankosw! |
Motivation
Created a minimal implementation of using distributed computation with multirun for a single DDP group, as mentioned in #951.
Have you read the Contributing Guidelines on pull requests?
(Yes)/No
Test Plan
(How should this PR be tested? Do you require special setup to run the test or repro the fixed bug?)