Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Single-node ImageNet DDP implementation #38

Draft
wants to merge 3 commits into
base: main
Choose a base branch
from
Draft

Single-node ImageNet DDP implementation #38

wants to merge 3 commits into from

Conversation

briankosw
Copy link

@briankosw briankosw commented Dec 1, 2020

Implements ImageNet DDP, as mentioned in #33.

Most of the code is the same, and the major differences are the handling of distributed processes and the configuration.

One can use Hydra's multirun capability to launch the distributed processes, instead of using PyTorch or Python's multiprocessing API.

python imagenet.py -m rank=0,1,2,3

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 1, 2020
@briankosw briankosw changed the title Transcribed imagenet code ImageNet DDP implementation Dec 1, 2020
examples/imagenet.py Outdated Show resolved Hide resolved
best_acc1 = 0


@hydra.main(config_name="imagenetconf")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you forget to include the config?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure how to use configen and I was mostly looking at mnist_00.py, and I couldn't find the corresponding configuration files for mnist_00.py. Should I just handcraft one and stick it in the example folder?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. configen is used to generate configs for libraries. for Examples and user code people should write their own configs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@briankosw

In mnist_00.py, I included the config in the top of the file.

@dataclass
class MNISTConf:
batch_size: int = 64
test_batch_size: int = 1000
epochs: int = 14
no_cuda: bool = False
dry_run: bool = False
seed: int = 1
log_interval: int = 10
save_model: bool = False
checkpoint_name: str = "unnamed.pt"
adadelta: AdadeltaConf = AdadeltaConf()
steplr: StepLRConf = StepLRConf(
step_size=1
) # we pass a default for step_size since it is required, but missing a default in PyTorch (and consequently in hydra-torch)

See the second code block of https://github.com/pytorch/hydra-torch/blob/master/examples/mnist_00.md#parting-with-argparse for the explanation!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally, I think it is cleaner to make your own config python files and then import them. I simply included it in one flat file so that the tutorial reader can see it all in one place.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the same thing in the draft mnist_01.py which I have yet to finalize. It's introducing some more hierarchical composition, so you'll see multiple 'non-configen' configs defined. Here's the whole 'hydra block':
https://github.com/pytorch/hydra-torch/blob/examples/mnist_01/examples/mnist_01.py#L10-L56

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I'll do as you suggested.

examples/imagenet.py Outdated Show resolved Hide resolved
Copy link

@shagunsodhani shagunsodhani left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional comments:

  1. Can you add the example command for running when we are training over 2 nodes (say 16 gpus). I think Some additional support needed for running the script on more than 8 gpus as setting rank=9 would not work because device("cuda:9") would not exist.

  2. Do we only care about the DDP usecase here? If yes, we should probably remove the logic corresponding to DataParallel and CPU training. I do not feel strongly about removing them but if we care only about DDP, may be that logic should go.

examples/imagenet.py Outdated Show resolved Hide resolved
examples/imagenet.py Outdated Show resolved Hide resolved
@briankosw
Copy link
Author

briankosw commented Dec 1, 2020

Thank you for the review @shagunsodhani. To reply to your comments:

  1. I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers on those, I'd appreciate it, but I'll look into them myself. I'm more familiar with single-node multiprocessing, which is why I've been inclined to writing single-node code.
  2. I've only included those so that it's consistent with the training script in PyTorch. I do agree with you that it is extraneous to this example, so I will clean those up!

Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup?

@shagunsodhani
Copy link

Thank you for the review @shagunsodhani. To reply to your comments:

  1. I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers on those, I'd appreciate it, but I'll look into them myself. I'm more familiar with single-node multiprocessing, which is why I've been inclined to writing single-node code.

We launch one process per gpu. These gpus can live on any mode (the extreme case being one gpu per node). We need to handle two things:

  1. How do the nodes discover each other (ie how do we know the master address). This relates to the comment on the previous PR.

  2. I think the second change is smaller, we need to set the device (cfg.gpu) correctly. This is easy to fix once we know how many nodes are participating and how many gpus does each node have.

The way ahead will be to add the example config (for single node) and then we will see how is the master address set. That will give us some hint about how to get the master address on multi-node training.

  1. I've only included those so that it's consistent with the training script in PyTorch. I do agree with you that it is extraneous to this example, so I will clean those up!

Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup?

Yeah so regarding this, imo a better way is to request n nodes and launch one process per node which spawns 8 workers. We can probably come back to this point later as it is orthogonal to the other changes we discussed and should be straightforward change.

world_size=cfg.world_size,
rank=cfg.rank,
)
if cfg.pretrained:
Copy link
Contributor

@romesco romesco Dec 1, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not that we can handle this now, but just putting this out there:

Statements conditioned solely on the config often show up in main files and IMO reduce the readability of the code.

My wish for the future (once we have the hammers) would be to push this logic into the config. There are many pros:

  1. main.py is shorter and cleaner with less nesting or extraneous code.
  2. main.py can be more general. This implies greater extensibility for users and less duplicate code.
  3. As the config is resolved in real time through interpolation / logic resolution, you get to see the 'final' config all in one structure before it is run. This is very powerful in a sense that we now see the outcome of all the intermediate logic that happens in main before training begins.

Obviously this won't work for every case, but there are many (like this one) where I think it would be a trivial change on the user's end.

Copy link
Author

@briankosw briankosw Dec 3, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I completely agree with you, and I think it'd be really nice to get rid of as many conditional statements as possible. I think a combination of using configs and refactoring should get the job done.

@omry
Copy link
Contributor

omry commented Dec 1, 2020

High level feedback:
We will probably have multiple examples for distributed data parallel, with different limitations and advantages.
It's good to group them together and have a top level page explaining what each one is to help users navigate.

@briankosw
Copy link
Author

High level feedback:
We will probably have multiple examples for distributed data parallel, with different limitations and advantages.
It's good to group them together and have a top level page explaining what each one is to help users navigate.

I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think?

In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well?

examples/imagenetconf.yaml Outdated Show resolved Hide resolved
@romesco
Copy link
Contributor

romesco commented Dec 4, 2020

High level feedback:
We will probably have multiple examples for distributed data parallel, with different limitations and advantages.
It's good to group them together and have a top level page explaining what each one is to help users navigate.

I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think?

In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well?

Sounds good. Right now, I'm thinking each of these can be separated into their own issues [listed by priority in my mind]:

Single-node, multi-GPU:

  1. Fundamentals for DDP via hydra while limiting extraneous code (minimum viable example)
  2. DDP Imagenet example (this issue/PR)

Multi-node, multi-GPU:

  1. Turn example (1) into multi-node?

@shagunsodhani
Copy link

I have something in mind for (3) Will be easier to show once (1) has been pushed.

@briankosw briankosw changed the title ImageNet DDP implementation Single-node ImageNet DDP implementation Dec 5, 2020
@briankosw
Copy link
Author

Right now, this PR is blocked by this issue, so I will be focusing more on #42 and fixing the blocking issue.

@romesco romesco linked an issue Dec 17, 2020 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Single-node ImageNet DDP
5 participants