Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Minimal DDP implementation with multirun #1141

Closed
wants to merge 3 commits into from
Closed

Minimal DDP implementation with multirun #1141

wants to merge 3 commits into from

Conversation

briankosw
Copy link
Contributor

@briankosw briankosw commented Nov 16, 2020

Motivation

Created a minimal implementation of using distributed computation with multirun for a single DDP group, as mentioned in #951.

>>> python main.py -m rank=0,1,2,3

[2020-11-16 19:19:22,512][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-11-16 19:19:22,513][HYDRA] Launching jobs, sweep output dir : multirun/2020-11-16/19-19-21
[2020-11-16 19:19:22,513][HYDRA]        #0 : rank=0
[2020-11-16 19:19:22,513][HYDRA]        #1 : rank=1
[2020-11-16 19:19:22,513][HYDRA]        #2 : rank=2
[2020-11-16 19:19:22,513][HYDRA]        #3 : rank=3
Rank 1 - 0.035238564014434814
Rank 0 - 0.7433204650878906
Rank 2 - 0.836401641368866
Rank 3 - 0.3215029239654541
Rank 0 has average: 0.4841158986091614

Have you read the Contributing Guidelines on pull requests?

(Yes)/No

Test Plan

(How should this PR be tested? Do you require special setup to run the test or repro the fixed bug?)

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 16, 2020
@briankosw briankosw changed the title Implemented minimal distributed implementation with multirun Minimal DDP implementation with multirun Nov 16, 2020
@lgtm-com
Copy link
Contributor

lgtm-com bot commented Nov 16, 2020

This pull request introduces 3 alerts when merging ad576ad into 599205f - view on LGTM.com

new alerts:

  • 3 for Unused import

Copy link
Collaborator

@omry omry left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, thanks!

Can you try to make it work by using the explicit IP address of the parent process instead of localhost? this would translate better for distributed scenarios (using the submitit launcher, for example).

dist.reduce(tensor, dst=0, op=dist.ReduceOp.SUM, group=group)
if cfg.rank == 0:
tensor /= 4
print("Rank {} has average: {}".format(cfg.rank, tensor[0]))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Please use Python 3 fstrings.
Suggested change
print("Rank {} has average: {}".format(cfg.rank, tensor[0]))
print(f"Rank {cfg.rank} has average: {tensor[0]}")
  1. Please use logging:
log.info(f"Rank {cfg.rank} has average: {tensor[0]}")

def main(cfg: DictConfig):
setup(cfg.master_addr, cfg.master_port, cfg.rank, cfg.world_size, cfg.backend)
group = dist.new_group(list(range(cfg.world_size)))
tensor = torch.rand(1)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can fill the tensors according to the rank and make sure the average is what you expect (not random).

Copy link
Contributor

@shagunsodhani shagunsodhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

  1. It is not guaranteed that all the jobs in JobLib will start at the same time (or around the same time). This is probably outside the scope of this PR but something to think about in general.
  2. It is not guaranteed that the different processes will be scheduled on the same node. As Omry mentioned, it will be useful to pass the parent's ip (I think that means that the parent process participates in the rendezvous).
  3. A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.


def cleanup():
"""Cleans up distributed backend resources."""
dist.destroy_process_group()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We generally do not need to do this. Did you see any error without doing this?

Copy link
Collaborator

@omry omry Nov 16, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.

A more complete example can live in https://github.com/pytorch/hydra-torch (which is still WIP).
I am also considering if this example belongs there.

Copy link
Contributor Author

@briankosw briankosw Nov 17, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.

I completely agree with you. The reason why I implemented a torch.distributed example for now is to create a prototypical or minimal example that demonstrates multirun and Joblib being used to launch the distributed processes. I'm in the middle of fixing my GPU cluster, after which I plan to commit a full DDP example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not guaranteed that all the jobs in JobLib will start at the same time (or around the same time). This is probably outside the scope of this PR but something to think about in general.

Is there a way to guarantee that the jobs start together?

It is not guaranteed that the different processes will be scheduled on the same node. As Omry mentioned, it will be useful to pass the parent's ip (I think that means that the parent process participates in the rendezvous).

I intended this PR to be the single node case. I will definitely try to work on the multi-node case, but I'm less familiar with it compared to the single node case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@omry should I create PR in https://github.com/pytorch/hydra-torch instead for the ImageNet DDP example?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briankosw,
If you want to create a DDP example based on ImageNet - yes.
I also think that this example probably should go there as well. (hydra-torch repository did not exist when I first created the issue).

@briankosw
Copy link
Contributor Author

#394 would be super helpful for this, since I need to manually specify multirun every time I want to run the script.

@briankosw
Copy link
Contributor Author

Any consensus on what to do with this particular PR @omry @shagunsodhani? I'll open a separate issue and PR in hydra-torch, but I'll let you guys decide whether this PR (or something similar in spirit) belongs to this repo as well.

@omry
Copy link
Collaborator

omry commented Nov 17, 2020

The reason hydra-torch is called hydra-torch and not hydra-torch-configs is exactly to give space for things like DDP examples :).
This example belongs there. Once the repo there is ready we will be able to link to it from the Hydra website.

@briankosw
Copy link
Contributor Author

Sounds good! Closing this PR then.

@briankosw briankosw closed this Nov 18, 2020
@omry
Copy link
Collaborator

omry commented Nov 18, 2020

Thanks @briankosw!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants