Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic distributed processing with Hydra #43

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions examples/ddp_00.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
from dataclasses import dataclass, field
import logging
import os
from typing import Any, List

import hydra
from hydra.core.config_store import ConfigStore
from omegaconf import DictConfig, MISSING
import torch
import torch.distributed as dist

log = logging.getLogger(__name__)


defaults = [
{"hydra/launcher": "joblib"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"hydra/launcher": "joblib"}
{"override hydra/launcher": "joblib"}

I believe this will be required moving forward, correct @omry ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of Hydra 1.1, which we are now official shooting for.

]

@dataclass
class DDPConf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we plan to use general structured configs like this one often, we should probably provide them somewhere in this repo. Although they're not configen projects, we could still have a directory for them with a namespace package to import from. Perhaps hydra-configs-torchdistributed or -torchtools?

Alternatively, maybe there are some classes in torch.distributed we could include in hydra-configs-torch.

@briankosw Could you take a look and see if there are any classes in there that make sense for generating confs from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it. I feel like most of PyTorch's distributed processing API aren't user-facing, but there are certain ones that come up over and over again and it could be useful to have generated configs for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would defer any attempts for creating reusable distributed related configs to after we have a solid understanding what we can support and how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Just something to keep in mind as we move forward.

defaults: List[Any] = field(default_factory=lambda: defaults)
backend: str = "gloo"
init_method: str = MISSING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
init_method: str = MISSING
MASTER_ADDR: str = "127.0.0.1"
MASTER_PORT: str = str(random.randint(48620,49150))
init_method: str = "tcp://" + MASTER_ADDR + ":" + MASTER_PORT

Howabout something like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you override the MASTER_PORT externally?
(my guess is that it will not work correctly)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, externally? As in change the os environment variable? I'm not sure. I think they recommend either handling this fully within init_method or by setting ENV variables, but not both.

world_size: int = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
world_size: int = 4
nodes: int = 1
gpus_per_node: int = 2
world_size: int = nodes * gpus_per_node

Similarly, what about doing this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only concern will be, gpus_per_node is an implementation detail :) I need 8 gpus. Wether they come from 2 nodes (4 gpus each) or 3 nodes (2, 2, 4 gpus) doesnt make any functional difference to DDP.

Copy link
Contributor

@romesco romesco Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

For some reason in the past I thought it was helpful to have more control over the distribution of gpus per node, but I can't recall why and it seems it makes no difference to init_process_group().

So you can probably disregard this particular suggestion!

rank: int = 0


cs = ConfigStore.instance()
cs.store(name="ddp", node=DDPConf)


@hydra.main(config_name="ddp")
def main(cfg: DictConfig):
dist.init_process_group(
backend=cfg.backend,
init_method=cfg.init_method,
world_size=cfg.world_size,
rank=cfg.rank,
)
group = dist.new_group(list(range(cfg.world_size)))
value = torch.tensor([cfg.rank])
log.info(f"Rank {cfg.rank} - Value: {value}")
dist.reduce(value, dst=0, op=dist.ReduceOp.SUM, group=group)
if cfg.rank == 0:
average = value / 4.0
log.info(f"Rank {cfg.rank} - Average: {average}")


if __name__ == "__main__":
main()