Basic distributed processing with Hydra #43

briankosw · 2020-12-06T11:57:10Z

Implemented a basic script that demonstrates distributed processing with Hydra, as mentioned in #42.
The command to run the script is:

python ddp_00.py -m rank=... init_method=...

where rank is a list of the ranks (either a comma-separated list of integers or range(start, stop)) and init_method is a string that specifies one of the two possible initialization methods: TCP initialization and shared file-system initialization (environment variable initialization is not related to init_method).

I'll add documentation (Markdown) that explains the distributed processing in PyTorch as well as how Hydra can kick off distributed processes as demonstrated in the script.

omry · 2020-12-06T17:20:41Z

rank can also be a range(start,stop).

romesco · 2020-12-07T09:39:07Z

examples/ddp_00.py

+]
+
+@dataclass
+class DDPConf:


If we plan to use general structured configs like this one often, we should probably provide them somewhere in this repo. Although they're not configen projects, we could still have a directory for them with a namespace package to import from. Perhaps hydra-configs-torchdistributed or -torchtools?

Alternatively, maybe there are some classes in torch.distributed we could include in hydra-configs-torch.

@briankosw Could you take a look and see if there are any classes in there that make sense for generating confs from?

I'll look into it. I feel like most of PyTorch's distributed processing API aren't user-facing, but there are certain ones that come up over and over again and it could be useful to have generated configs for them.

I would defer any attempts for creating reusable distributed related configs to after we have a solid understanding what we can support and how.

I agree. Just something to keep in mind as we move forward.

shagunsodhani · 2020-12-07T17:48:10Z

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

omry · 2020-12-07T17:51:24Z

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

romesco · 2021-01-13T20:53:16Z

examples/ddp_00.py

+
+
+defaults = [
+    {"hydra/launcher": "joblib"}


Suggested change

{"hydra/launcher": "joblib"}

{"override hydra/launcher": "joblib"}

I believe this will be required moving forward, correct @omry ?

As of Hydra 1.1, which we are now official shooting for.

romesco · 2021-01-13T20:58:06Z

examples/ddp_00.py

+class DDPConf:
+    defaults: List[Any] = field(default_factory=lambda: defaults)
+    backend: str = "gloo"
+    init_method: str = MISSING


Suggested change

init_method: str = MISSING

MASTER_ADDR: str = "127.0.0.1"

MASTER_PORT: str = str(random.randint(48620,49150))

init_method: str = "tcp://" + MASTER_ADDR + ":" + MASTER_PORT

Howabout something like this?

What happens if you override the MASTER_PORT externally?
(my guess is that it will not work correctly)

What do you mean, externally? As in change the os environment variable? I'm not sure. I think they recommend either handling this fully within init_method or by setting ENV variables, but not both.

romesco · 2021-01-13T20:59:21Z

examples/ddp_00.py

+    defaults: List[Any] = field(default_factory=lambda: defaults)
+    backend: str = "gloo"
+    init_method: str = MISSING
+    world_size: int = 4


Suggested change

world_size: int = 4

nodes: int = 1

gpus_per_node: int = 2

world_size: int = nodes * gpus_per_node

Similarly, what about doing this?

my only concern will be, gpus_per_node is an implementation detail :) I need 8 gpus. Wether they come from 2 nodes (4 gpus each) or 3 nodes (2, 2, 4 gpus) doesnt make any functional difference to DDP.

good point.

For some reason in the past I thought it was helpful to have more control over the distribution of gpus per node, but I can't recall why and it seems it makes no difference to init_process_group().

So you can probably disregard this particular suggestion!

romesco · 2021-01-13T21:02:51Z

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

shagunsodhani · 2021-01-15T19:29:43Z

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

Oops sorry missed this comment :) @omry will have a better insight about that. ccing @jieru-hu who is working on callbacks.

omry · 2021-01-15T22:19:36Z

Callbacks will likely be pushed back to Hydra 1.2.

romesco · 2021-04-22T18:35:18Z

@briankosw I'd love to help you push this forward. How are you doing? Bogged down in other work - because I know that feeling haha! Let me know how I can help.

omry · 2021-04-22T19:58:27Z

Callbacks will likely be pushed back to Hydra 1.2.

We will have callbacks in 1.1, but I am no longer sure we should use them here.
This is not documented yet, but will be before 1.1 is released (Example app).

I am no longer sure we should leverage callbacks here, but it should now be possible to play with it on master.

Init commit

ac47650

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 6, 2020

romesco reviewed Dec 7, 2020

View reviewed changes

romesco linked an issue Dec 17, 2020 that may be closed by this pull request

Single-node distributed processing with Hydra #42

Open

4 tasks

romesco assigned briankosw Dec 17, 2020

romesco reviewed Jan 13, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Basic distributed processing with Hydra #43

Basic distributed processing with Hydra #43

briankosw commented Dec 6, 2020 •

edited

Loading

omry commented Dec 6, 2020

romesco Dec 7, 2020

briankosw Dec 7, 2020

omry Dec 7, 2020

romesco Dec 7, 2020

shagunsodhani commented Dec 7, 2020

omry commented Dec 7, 2020

romesco Jan 13, 2021

omry Jan 13, 2021

omry Jan 13, 2021

romesco Jan 13, 2021

omry Jan 13, 2021

romesco Jan 15, 2021

romesco Jan 13, 2021

shagunsodhani Jan 15, 2021

romesco Jan 16, 2021 •

edited

Loading

romesco commented Jan 13, 2021

shagunsodhani commented Jan 15, 2021

omry commented Jan 15, 2021

romesco commented Apr 22, 2021

omry commented Apr 22, 2021

	{"hydra/launcher": "joblib"}
	{"override hydra/launcher": "joblib"}

-    init_method: str = MISSING
+    MASTER_ADDR: str = "127.0.0.1"
+    MASTER_PORT: str = str(random.randint(48620,49150))
+    init_method: str = "tcp://" + MASTER_ADDR + ":" + MASTER_PORT

-    world_size: int = 4
+    nodes: int = 1
+    gpus_per_node: int = 2
+    world_size: int = nodes * gpus_per_node

Basic distributed processing with Hydra #43

Are you sure you want to change the base?

Basic distributed processing with Hydra #43

Conversation

briankosw commented Dec 6, 2020 • edited Loading

omry commented Dec 6, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

shagunsodhani commented Dec 7, 2020

omry commented Dec 7, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

romesco Jan 16, 2021 • edited Loading

Choose a reason for hiding this comment

romesco commented Jan 13, 2021

shagunsodhani commented Jan 15, 2021

omry commented Jan 15, 2021

romesco commented Apr 22, 2021

omry commented Apr 22, 2021

briankosw commented Dec 6, 2020 •

edited

Loading

romesco Jan 16, 2021 •

edited

Loading