Minimal DDP implementation with multirun #1141

briankosw · 2020-11-16T10:25:38Z

Motivation

Created a minimal implementation of using distributed computation with multirun for a single DDP group, as mentioned in #951.

>>> python main.py -m rank=0,1,2,3

[2020-11-16 19:19:22,512][HYDRA] Joblib.Parallel(n_jobs=-1,backend=loky,prefer=processes,require=None,verbose=0,timeout=None,pre_dispatch=2*n_jobs,batch_size=auto,temp_folder=None,max_nbytes=None,mmap_mode=r) is launching 4 jobs
[2020-11-16 19:19:22,513][HYDRA] Launching jobs, sweep output dir : multirun/2020-11-16/19-19-21
[2020-11-16 19:19:22,513][HYDRA]        #0 : rank=0
[2020-11-16 19:19:22,513][HYDRA]        #1 : rank=1
[2020-11-16 19:19:22,513][HYDRA]        #2 : rank=2
[2020-11-16 19:19:22,513][HYDRA]        #3 : rank=3
Rank 1 - 0.035238564014434814
Rank 0 - 0.7433204650878906
Rank 2 - 0.836401641368866
Rank 3 - 0.3215029239654541
Rank 0 has average: 0.4841158986091614

Have you read the Contributing Guidelines on pull requests?

(Yes)/No

Test Plan

(How should this PR be tested? Do you require special setup to run the test or repro the fixed bug?)

lgtm-com · 2020-11-16T10:33:36Z

This pull request introduces 3 alerts when merging ad576ad into 599205f - view on LGTM.com

new alerts:

3 for Unused import

omry

Nice, thanks!

Can you try to make it work by using the explicit IP address of the parent process instead of localhost? this would translate better for distributed scenarios (using the submitit launcher, for example).

omry · 2020-11-16T17:25:21Z

examples/advanced/pytorch_ddp_example/main.py

+    dist.reduce(tensor, dst=0, op=dist.ReduceOp.SUM, group=group)
+    if cfg.rank == 0:
+        tensor /= 4
+        print("Rank {} has average: {}".format(cfg.rank, tensor[0]))


Please use Python 3 fstrings.

Suggested change

print("Rank {} has average: {}".format(cfg.rank, tensor[0]))

print(f"Rank {cfg.rank} has average: {tensor[0]}")

Please use logging:

log.info(f"Rank {cfg.rank} has average: {tensor[0]}")

omry · 2020-11-16T17:27:08Z

examples/advanced/pytorch_ddp_example/main.py

+def main(cfg: DictConfig):
+    setup(cfg.master_addr, cfg.master_port, cfg.rank, cfg.world_size, cfg.backend)
+    group = dist.new_group(list(range(cfg.world_size)))
+    tensor = torch.rand(1)


You can fill the tensors according to the rank and make sure the average is what you expect (not random).

shagunsodhani

Thanks for the PR.

It is not guaranteed that all the jobs in JobLib will start at the same time (or around the same time). This is probably outside the scope of this PR but something to think about in general.
It is not guaranteed that the different processes will be scheduled on the same node. As Omry mentioned, it will be useful to pass the parent's ip (I think that means that the parent process participates in the rendezvous).
A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.

shagunsodhani · 2020-11-16T18:19:25Z

examples/advanced/pytorch_ddp_example/main.py

+
+def cleanup():
+    """Cleans up distributed backend resources."""
+    dist.destroy_process_group()


We generally do not need to do this. Did you see any error without doing this?

A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.

A more complete example can live in https://github.com/pytorch/hydra-torch (which is still WIP).
I am also considering if this example belongs there.

A nit-picky comment - the example shows how to use torch.distributed (and not DDP). Maybe we can change the scope of the PR to focus on an example of torch.distributed? Alternatively, the ImageNet example will be easy to port here or maybe a simpler example, but with a model and dataloader etc.

I completely agree with you. The reason why I implemented a torch.distributed example for now is to create a prototypical or minimal example that demonstrates multirun and Joblib being used to launch the distributed processes. I'm in the middle of fixing my GPU cluster, after which I plan to commit a full DDP example.

It is not guaranteed that all the jobs in JobLib will start at the same time (or around the same time). This is probably outside the scope of this PR but something to think about in general.

Is there a way to guarantee that the jobs start together?

It is not guaranteed that the different processes will be scheduled on the same node. As Omry mentioned, it will be useful to pass the parent's ip (I think that means that the parent process participates in the rendezvous).

I intended this PR to be the single node case. I will definitely try to work on the multi-node case, but I'm less familiar with it compared to the single node case.

@omry should I create PR in https://github.com/pytorch/hydra-torch instead for the ImageNet DDP example?

@briankosw,
If you want to create a DDP example based on ImageNet - yes.
I also think that this example probably should go there as well. (hydra-torch repository did not exist when I first created the issue).

briankosw · 2020-11-17T05:51:56Z

#394 would be super helpful for this, since I need to manually specify multirun every time I want to run the script.

briankosw · 2020-11-17T09:58:04Z

Any consensus on what to do with this particular PR @omry @shagunsodhani? I'll open a separate issue and PR in hydra-torch, but I'll let you guys decide whether this PR (or something similar in spirit) belongs to this repo as well.

omry · 2020-11-17T23:54:46Z

The reason hydra-torch is called hydra-torch and not hydra-torch-configs is exactly to give space for things like DDP examples :).
This example belongs there. Once the repo there is ready we will be able to link to it from the Hydra website.

briankosw · 2020-11-18T01:53:09Z

Sounds good! Closing this PR then.

omry · 2020-11-18T02:43:20Z

Thanks @briankosw!

Implemented minimal distributed implementation with joblib launcher

ad576ad

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 16, 2020

briankosw changed the title ~~Implemented minimal distributed implementation with multirun~~ Minimal DDP implementation with multirun Nov 16, 2020

briankosw added 2 commits November 16, 2020 20:33

Fixing imports and adding docstring

5192dde

Reordering imports to fix isort

91b5db8

omry reviewed Nov 16, 2020

View reviewed changes

omry requested review from jieru-hu and shagunsodhani November 16, 2020 17:30

shagunsodhani reviewed Nov 16, 2020

View reviewed changes

briankosw closed this Nov 18, 2020

briankosw deleted the feature/pytorch_ddp branch November 18, 2020 02:58

shagunsodhani mentioned this pull request Dec 1, 2020

Single-node ImageNet DDP implementation pytorch/hydra-torch#38

Draft

briankosw mentioned this pull request Dec 5, 2020

Single-node distributed processing with Hydra pytorch/hydra-torch#42

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Minimal DDP implementation with multirun #1141

Minimal DDP implementation with multirun #1141

briankosw commented Nov 16, 2020 •

edited

Loading

lgtm-com bot commented Nov 16, 2020

omry left a comment

omry Nov 16, 2020

omry Nov 16, 2020

shagunsodhani left a comment

shagunsodhani Nov 16, 2020

omry Nov 16, 2020 •

edited

Loading

briankosw Nov 17, 2020 •

edited

Loading

briankosw Nov 17, 2020

briankosw Nov 17, 2020

omry Nov 17, 2020

briankosw commented Nov 17, 2020

briankosw commented Nov 17, 2020

omry commented Nov 17, 2020

briankosw commented Nov 18, 2020

omry commented Nov 18, 2020

	print("Rank {} has average: {}".format(cfg.rank, tensor[0]))
	print(f"Rank {cfg.rank} has average: {tensor[0]}")

Minimal DDP implementation with multirun #1141

Minimal DDP implementation with multirun #1141

Conversation

briankosw commented Nov 16, 2020 • edited Loading

Motivation

Have you read the Contributing Guidelines on pull requests?

Test Plan

lgtm-com bot commented Nov 16, 2020

omry left a comment

Choose a reason for hiding this comment

omry Nov 16, 2020

Choose a reason for hiding this comment

omry Nov 16, 2020

Choose a reason for hiding this comment

shagunsodhani left a comment

Choose a reason for hiding this comment

shagunsodhani Nov 16, 2020

Choose a reason for hiding this comment

omry Nov 16, 2020 • edited Loading

Choose a reason for hiding this comment

briankosw Nov 17, 2020 • edited Loading

Choose a reason for hiding this comment

briankosw Nov 17, 2020

Choose a reason for hiding this comment

briankosw Nov 17, 2020

Choose a reason for hiding this comment

omry Nov 17, 2020

Choose a reason for hiding this comment

briankosw commented Nov 17, 2020

briankosw commented Nov 17, 2020

omry commented Nov 17, 2020

briankosw commented Nov 18, 2020

omry commented Nov 18, 2020

briankosw commented Nov 16, 2020 •

edited

Loading

omry Nov 16, 2020 •

edited

Loading

briankosw Nov 17, 2020 •

edited

Loading