-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Basic distributed processing with Hydra #43
base: main
Are you sure you want to change the base?
Conversation
rank can also be a |
] | ||
|
||
@dataclass | ||
class DDPConf: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we plan to use general structured configs like this one often, we should probably provide them somewhere in this repo. Although they're not configen projects, we could still have a directory for them with a namespace package to import from. Perhaps hydra-configs-torchdistributed
or -torchtools
?
Alternatively, maybe there are some classes in torch.distributed we could include in hydra-configs-torch
.
@briankosw Could you take a look and see if there are any classes in there that make sense for generating confs from?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll look into it. I feel like most of PyTorch's distributed processing API aren't user-facing, but there are certain ones that come up over and over again and it could be useful to have generated configs for them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would defer any attempts for creating reusable distributed related configs to after we have a solid understanding what we can support and how.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree. Just something to keep in mind as we move forward.
Not sure if the user even needs to provide the |
I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case). |
|
||
|
||
defaults = [ | ||
{"hydra/launcher": "joblib"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
{"hydra/launcher": "joblib"} | |
{"override hydra/launcher": "joblib"} |
I believe this will be required moving forward, correct @omry ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As of Hydra 1.1, which we are now official shooting for.
class DDPConf: | ||
defaults: List[Any] = field(default_factory=lambda: defaults) | ||
backend: str = "gloo" | ||
init_method: str = MISSING |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
init_method: str = MISSING | |
MASTER_ADDR: str = "127.0.0.1" | |
MASTER_PORT: str = str(random.randint(48620,49150)) | |
init_method: str = "tcp://" + MASTER_ADDR + ":" + MASTER_PORT |
Howabout something like this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens if you override the MASTER_PORT externally?
(my guess is that it will not work correctly)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean, externally? As in change the os environment variable? I'm not sure. I think they recommend either handling this fully within init_method
or by setting ENV variables, but not both.
defaults: List[Any] = field(default_factory=lambda: defaults) | ||
backend: str = "gloo" | ||
init_method: str = MISSING | ||
world_size: int = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
world_size: int = 4 | |
nodes: int = 1 | |
gpus_per_node: int = 2 | |
world_size: int = nodes * gpus_per_node |
Similarly, what about doing this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my only concern will be, gpus_per_node is an implementation detail :) I need 8 gpus. Wether they come from 2 nodes (4 gpus each) or 3 nodes (2, 2, 4 gpus) doesnt make any functional difference to DDP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point.
For some reason in the past I thought it was helpful to have more control over the distribution of gpus per node, but I can't recall why and it seems it makes no difference to init_process_group()
.
So you can probably disregard this particular suggestion!
@shagunsodhani What would it take to implement a callback? |
Oops sorry missed this comment :) @omry will have a better insight about that. ccing @jieru-hu who is working on callbacks. |
Callbacks will likely be pushed back to Hydra 1.2. |
@briankosw I'd love to help you push this forward. How are you doing? Bogged down in other work - because I know that feeling haha! Let me know how I can help. |
We will have callbacks in 1.1, but I am no longer sure we should use them here. I am no longer sure we should leverage callbacks here, but it should now be possible to play with it on master. |
Implemented a basic script that demonstrates distributed processing with Hydra, as mentioned in #42.
The command to run the script is:
where
rank
is a list of the ranks (either a comma-separated list of integers orrange(start, stop)
) andinit_method
is a string that specifies one of the two possible initialization methods: TCP initialization and shared file-system initialization (environment variable initialization is not related toinit_method
).I'll add documentation (Markdown) that explains the distributed processing in PyTorch as well as how Hydra can kick off distributed processes as demonstrated in the script.