Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Basic distributed processing with Hydra #43

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Basic distributed processing with Hydra #43

wants to merge 1 commit into from

Conversation

briankosw
Copy link

@briankosw briankosw commented Dec 6, 2020

Implemented a basic script that demonstrates distributed processing with Hydra, as mentioned in #42.
The command to run the script is:

python ddp_00.py -m rank=... init_method=...

where rank is a list of the ranks (either a comma-separated list of integers or range(start, stop)) and init_method is a string that specifies one of the two possible initialization methods: TCP initialization and shared file-system initialization (environment variable initialization is not related to init_method).

I'll add documentation (Markdown) that explains the distributed processing in PyTorch as well as how Hydra can kick off distributed processes as demonstrated in the script.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 6, 2020
@omry
Copy link
Contributor

omry commented Dec 6, 2020

rank can also be a range(start,stop).

]

@dataclass
class DDPConf:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we plan to use general structured configs like this one often, we should probably provide them somewhere in this repo. Although they're not configen projects, we could still have a directory for them with a namespace package to import from. Perhaps hydra-configs-torchdistributed or -torchtools?

Alternatively, maybe there are some classes in torch.distributed we could include in hydra-configs-torch.

@briankosw Could you take a look and see if there are any classes in there that make sense for generating confs from?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll look into it. I feel like most of PyTorch's distributed processing API aren't user-facing, but there are certain ones that come up over and over again and it could be useful to have generated configs for them.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would defer any attempts for creating reusable distributed related configs to after we have a solid understanding what we can support and how.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. Just something to keep in mind as we move forward.

@shagunsodhani
Copy link

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

@omry
Copy link
Contributor

omry commented Dec 7, 2020

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@romesco romesco linked an issue Dec 17, 2020 that may be closed by this pull request
4 tasks


defaults = [
{"hydra/launcher": "joblib"}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
{"hydra/launcher": "joblib"}
{"override hydra/launcher": "joblib"}

I believe this will be required moving forward, correct @omry ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of Hydra 1.1, which we are now official shooting for.

class DDPConf:
defaults: List[Any] = field(default_factory=lambda: defaults)
backend: str = "gloo"
init_method: str = MISSING
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
init_method: str = MISSING
MASTER_ADDR: str = "127.0.0.1"
MASTER_PORT: str = str(random.randint(48620,49150))
init_method: str = "tcp://" + MASTER_ADDR + ":" + MASTER_PORT

Howabout something like this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens if you override the MASTER_PORT externally?
(my guess is that it will not work correctly)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean, externally? As in change the os environment variable? I'm not sure. I think they recommend either handling this fully within init_method or by setting ENV variables, but not both.

defaults: List[Any] = field(default_factory=lambda: defaults)
backend: str = "gloo"
init_method: str = MISSING
world_size: int = 4
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
world_size: int = 4
nodes: int = 1
gpus_per_node: int = 2
world_size: int = nodes * gpus_per_node

Similarly, what about doing this?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my only concern will be, gpus_per_node is an implementation detail :) I need 8 gpus. Wether they come from 2 nodes (4 gpus each) or 3 nodes (2, 2, 4 gpus) doesnt make any functional difference to DDP.

Copy link
Contributor

@romesco romesco Jan 16, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point.

For some reason in the past I thought it was helpful to have more control over the distribution of gpus per node, but I can't recall why and it seems it makes no difference to init_process_group().

So you can probably disregard this particular suggestion!

@romesco
Copy link
Contributor

romesco commented Jan 13, 2021

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

@shagunsodhani
Copy link

rank can also be a range(start,stop).

Not sure if the user even needs to provide the rank argument explicitly. rank has to vary from 0 to num_gpus- 1 (for standard usecases). So we might just infer it ourselves. I understand why does it have to be specified manually right now but this could be a useful example for callbacks.

I agree that it's not the best user experience but we don't have calllback right now and even if we did, we need to make sure the design would actually support this. (It's not obvious that this is the case).

@shagunsodhani What would it take to implement a callback?

Oops sorry missed this comment :) @omry will have a better insight about that. ccing @jieru-hu who is working on callbacks.

@omry
Copy link
Contributor

omry commented Jan 15, 2021

Callbacks will likely be pushed back to Hydra 1.2.

@romesco
Copy link
Contributor

romesco commented Apr 22, 2021

@briankosw I'd love to help you push this forward. How are you doing? Bogged down in other work - because I know that feeling haha! Let me know how I can help.

@omry
Copy link
Contributor

omry commented Apr 22, 2021

Callbacks will likely be pushed back to Hydra 1.2.

We will have callbacks in 1.1, but I am no longer sure we should use them here.
This is not documented yet, but will be before 1.1 is released (Example app).

I am no longer sure we should leverage callbacks here, but it should now be possible to play with it on master.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single-node distributed processing with Hydra
5 participants