-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic distributed processing with Hydra #43
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change | ||||||||
---|---|---|---|---|---|---|---|---|---|---|
@@ -0,0 +1,50 @@ | ||||||||||
from dataclasses import dataclass, field | ||||||||||
import logging | ||||||||||
import os | ||||||||||
from typing import Any, List | ||||||||||
|
||||||||||
import hydra | ||||||||||
from hydra.core.config_store import ConfigStore | ||||||||||
from omegaconf import DictConfig, MISSING | ||||||||||
import torch | ||||||||||
import torch.distributed as dist | ||||||||||
|
||||||||||
log = logging.getLogger(__name__) | ||||||||||
|
||||||||||
|
||||||||||
defaults = [ | ||||||||||
{"hydra/launcher": "joblib"} | ||||||||||
] | ||||||||||
|
||||||||||
@dataclass | ||||||||||
class DDPConf: | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. If we plan to use general structured configs like this one often, we should probably provide them somewhere in this repo. Although they're not configen projects, we could still have a directory for them with a namespace package to import from. Perhaps Alternatively, maybe there are some classes in torch.distributed we could include in @briankosw Could you take a look and see if there are any classes in there that make sense for generating confs from? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'll look into it. I feel like most of PyTorch's distributed processing API aren't user-facing, but there are certain ones that come up over and over again and it could be useful to have generated configs for them. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would defer any attempts for creating reusable distributed related configs to after we have a solid understanding what we can support and how. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I agree. Just something to keep in mind as we move forward. |
||||||||||
defaults: List[Any] = field(default_factory=lambda: defaults) | ||||||||||
backend: str = "gloo" | ||||||||||
init_method: str = MISSING | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Howabout something like this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens if you override the MASTER_PORT externally? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean, externally? As in change the os environment variable? I'm not sure. I think they recommend either handling this fully within |
||||||||||
world_size: int = 4 | ||||||||||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Suggested change
Similarly, what about doing this? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. my only concern will be, gpus_per_node is an implementation detail :) I need 8 gpus. Wether they come from 2 nodes (4 gpus each) or 3 nodes (2, 2, 4 gpus) doesnt make any functional difference to DDP. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. good point. For some reason in the past I thought it was helpful to have more control over the distribution of gpus per node, but I can't recall why and it seems it makes no difference to So you can probably disregard this particular suggestion! |
||||||||||
rank: int = 0 | ||||||||||
|
||||||||||
|
||||||||||
cs = ConfigStore.instance() | ||||||||||
cs.store(name="ddp", node=DDPConf) | ||||||||||
|
||||||||||
|
||||||||||
@hydra.main(config_name="ddp") | ||||||||||
def main(cfg: DictConfig): | ||||||||||
dist.init_process_group( | ||||||||||
backend=cfg.backend, | ||||||||||
init_method=cfg.init_method, | ||||||||||
world_size=cfg.world_size, | ||||||||||
rank=cfg.rank, | ||||||||||
) | ||||||||||
group = dist.new_group(list(range(cfg.world_size))) | ||||||||||
value = torch.tensor([cfg.rank]) | ||||||||||
log.info(f"Rank {cfg.rank} - Value: {value}") | ||||||||||
dist.reduce(value, dst=0, op=dist.ReduceOp.SUM, group=group) | ||||||||||
if cfg.rank == 0: | ||||||||||
average = value / 4.0 | ||||||||||
log.info(f"Rank {cfg.rank} - Average: {average}") | ||||||||||
|
||||||||||
|
||||||||||
if __name__ == "__main__": | ||||||||||
main() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this will be required moving forward, correct @omry ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of Hydra 1.1, which we are now official shooting for.