Single-node ImageNet DDP implementation #38

briankosw · 2020-12-01T04:24:04Z

Implements ImageNet DDP, as mentioned in #33.

Most of the code is the same, and the major differences are the handling of distributed processes and the configuration.

One can use Hydra's multirun capability to launch the distributed processes, instead of using PyTorch or Python's multiprocessing API.

python imagenet.py -m rank=0,1,2,3

examples/imagenet.py

omry · 2020-12-01T06:47:40Z

examples/imagenet.py

+best_acc1 = 0
+
+
+@hydra.main(config_name="imagenetconf")


Did you forget to include the config?

I wasn't sure how to use configen and I was mostly looking at mnist_00.py, and I couldn't find the corresponding configuration files for mnist_00.py. Should I just handcraft one and stick it in the example folder?

Yes. configen is used to generate configs for libraries. for Examples and user code people should write their own configs.

@briankosw

In mnist_00.py, I included the config in the top of the file.

hydra-torch/examples/mnist_00.py

Lines 20 to 34 in 2656165

@dataclass

class MNISTConf:

batch_size: int = 64

test_batch_size: int = 1000

epochs: int = 14

no_cuda: bool = False

dry_run: bool = False

seed: int = 1

log_interval: int = 10

save_model: bool = False

checkpoint_name: str = "unnamed.pt"

adadelta: AdadeltaConf = AdadeltaConf()

steplr: StepLRConf = StepLRConf(

step_size=1

) # we pass a default for step_size since it is required, but missing a default in PyTorch (and consequently in hydra-torch)

See the second code block of https://github.com/pytorch/hydra-torch/blob/master/examples/mnist_00.md#parting-with-argparse for the explanation!

Generally, I think it is cleaner to make your own config python files and then import them. I simply included it in one flat file so that the tutorial reader can see it all in one place.

This is the same thing in the draft mnist_01.py which I have yet to finalize. It's introducing some more hierarchical composition, so you'll see multiple 'non-configen' configs defined. Here's the whole 'hydra block':
https://github.com/pytorch/hydra-torch/blob/examples/mnist_01/examples/mnist_01.py#L10-L56

Thanks! I'll do as you suggested.

examples/imagenet.py

shagunsodhani

Additional comments:

Can you add the example command for running when we are training over 2 nodes (say 16 gpus). I think Some additional support needed for running the script on more than 8 gpus as setting rank=9 would not work because device("cuda:9") would not exist.
Do we only care about the DDP usecase here? If yes, we should probably remove the logic corresponding to DataParallel and CPU training. I do not feel strongly about removing them but if we care only about DDP, may be that logic should go.

examples/imagenet.py

briankosw · 2020-12-01T10:36:04Z

Thank you for the review @shagunsodhani. To reply to your comments:

I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers on those, I'd appreciate it, but I'll look into them myself. I'm more familiar with single-node multiprocessing, which is why I've been inclined to writing single-node code.
I've only included those so that it's consistent with the training script in PyTorch. I do agree with you that it is extraneous to this example, so I will clean those up!

Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup?

shagunsodhani · 2020-12-01T11:13:29Z

Thank you for the review @shagunsodhani. To reply to your comments:

I'm not exactly sure how multirun and Joblib interacts with multi-node multiprocessing. If you have any helpful pointers on those, I'd appreciate it, but I'll look into them myself. I'm more familiar with single-node multiprocessing, which is why I've been inclined to writing single-node code.

We launch one process per gpu. These gpus can live on any mode (the extreme case being one gpu per node). We need to handle two things:

How do the nodes discover each other (ie how do we know the master address). This relates to the comment on the previous PR.
I think the second change is smaller, we need to set the device (cfg.gpu) correctly. This is easy to fix once we know how many nodes are participating and how many gpus does each node have.

The way ahead will be to add the example config (for single node) and then we will see how is the master address set. That will give us some hint about how to get the master address on multi-node training.

I've only included those so that it's consistent with the training script in PyTorch. I do agree with you that it is extraneous to this example, so I will clean those up!

Another thing that comes to mind is your comment on the other PR about how Joblib doesn't guarantee that all the subprocesses are launched simultaneously. Would they have any implications for the multi-node setup?

Yeah so regarding this, imo a better way is to request n nodes and launch one process per node which spawns 8 workers. We can probably come back to this point later as it is orthogonal to the other changes we discussed and should be straightforward change.

romesco · 2020-12-01T20:39:01Z

examples/imagenet.py

+            world_size=cfg.world_size,
+            rank=cfg.rank,
+        )
+    if cfg.pretrained:


Not that we can handle this now, but just putting this out there:

Statements conditioned solely on the config often show up in main files and IMO reduce the readability of the code.

My wish for the future (once we have the hammers) would be to push this logic into the config. There are many pros:

main.py is shorter and cleaner with less nesting or extraneous code.

main.py can be more general. This implies greater extensibility for users and less duplicate code.

As the config is resolved in real time through interpolation / logic resolution, you get to see the 'final' config all in one structure before it is run. This is very powerful in a sense that we now see the outcome of all the intermediate logic that happens in main before training begins.

Obviously this won't work for every case, but there are many (like this one) where I think it would be a trivial change on the user's end.

I completely agree with you, and I think it'd be really nice to get rid of as many conditional statements as possible. I think a combination of using configs and refactoring should get the job done.

omry · 2020-12-01T21:12:37Z

High level feedback:
We will probably have multiple examples for distributed data parallel, with different limitations and advantages.
It's good to group them together and have a top level page explaining what each one is to help users navigate.

briankosw · 2020-12-03T07:23:30Z

High level feedback:
We will probably have multiple examples for distributed data parallel, with different limitations and advantages.
It's good to group them together and have a top level page explaining what each one is to help users navigate.

I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think?

In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well?

examples/imagenetconf.yaml

romesco · 2020-12-04T20:05:41Z

High level feedback:
We will probably have multiple examples for distributed data parallel, with different limitations and advantages.
It's good to group them together and have a top level page explaining what each one is to help users navigate.

I think that's a good idea. I can try to organize and structure the examples where each example highlights something different, e.g. one example explains fundamental distributed processing as demonstrated in the other PR and another example showing ImageNet DDP. I think it'd be better if I open one or two additional issues that separate these implementations. What do you guys think?

In addition, I've given some thoughts on handling multi-node distributed processing, and I think it's easier if I have a separate example for single-node multi-GPU and multi-node multi-GPU. Thoughts on that as well?

Sounds good. Right now, I'm thinking each of these can be separated into their own issues [listed by priority in my mind]:

Single-node, multi-GPU:

Fundamentals for DDP via hydra while limiting extraneous code (minimum viable example)
DDP Imagenet example (this issue/PR)

Multi-node, multi-GPU:

Turn example (1) into multi-node?

shagunsodhani · 2020-12-04T23:56:28Z

I have something in mind for (3) Will be easier to show once (1) has been pushed.

briankosw · 2020-12-05T03:52:34Z

Right now, this PR is blocked by this issue, so I will be focusing more on #42 and fixing the blocking issue.

Transcribed imagenet code

aff3f20

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Dec 1, 2020

briankosw changed the title ~~Transcribed imagenet code~~ ImageNet DDP implementation Dec 1, 2020

omry reviewed Dec 1, 2020

View reviewed changes

shagunsodhani reviewed Dec 1, 2020

View reviewed changes

examples/imagenet.py Outdated Show resolved Hide resolved

examples/imagenet.py Outdated Show resolved Hide resolved

romesco reviewed Dec 1, 2020

View reviewed changes

Pruned most non-DDP code

54040e8

omry reviewed Dec 3, 2020

View reviewed changes

examples/imagenetconf.yaml Outdated Show resolved Hide resolved

Fixing conf

f5dff54

briankosw changed the title ~~ImageNet DDP implementation~~ Single-node ImageNet DDP implementation Dec 5, 2020

briankosw mentioned this pull request Dec 5, 2020

Single-node distributed processing with Hydra #42

Open

4 tasks

romesco linked an issue Dec 17, 2020 that may be closed by this pull request

Single-node ImageNet DDP #33

Open

romesco assigned briankosw Dec 17, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single-node ImageNet DDP implementation #38

Single-node ImageNet DDP implementation #38

briankosw commented Dec 1, 2020 •

edited

Loading

omry Dec 1, 2020

briankosw Dec 1, 2020

omry Dec 1, 2020

romesco Dec 1, 2020

romesco Dec 1, 2020

romesco Dec 1, 2020

briankosw Dec 3, 2020

shagunsodhani left a comment

briankosw commented Dec 1, 2020 •

edited

Loading

shagunsodhani commented Dec 1, 2020

romesco Dec 1, 2020 •

edited

Loading

briankosw Dec 3, 2020 •

edited

Loading

omry commented Dec 1, 2020

briankosw commented Dec 3, 2020

romesco commented Dec 4, 2020 •

edited

Loading

shagunsodhani commented Dec 4, 2020

briankosw commented Dec 5, 2020

	@dataclass
	class MNISTConf:
	batch_size: int = 64
	test_batch_size: int = 1000
	epochs: int = 14
	no_cuda: bool = False
	dry_run: bool = False
	seed: int = 1
	log_interval: int = 10
	save_model: bool = False
	checkpoint_name: str = "unnamed.pt"
	adadelta: AdadeltaConf = AdadeltaConf()
	steplr: StepLRConf = StepLRConf(
	step_size=1
	) # we pass a default for step_size since it is required, but missing a default in PyTorch (and consequently in hydra-torch)

		best_acc1 = 0


		@hydra.main(config_name="imagenetconf")

Single-node ImageNet DDP implementation #38

Are you sure you want to change the base?

Single-node ImageNet DDP implementation #38

Conversation

briankosw commented Dec 1, 2020 • edited Loading

omry Dec 1, 2020

Choose a reason for hiding this comment

briankosw Dec 1, 2020

Choose a reason for hiding this comment

omry Dec 1, 2020

Choose a reason for hiding this comment

romesco Dec 1, 2020

Choose a reason for hiding this comment

romesco Dec 1, 2020

Choose a reason for hiding this comment

romesco Dec 1, 2020

Choose a reason for hiding this comment

briankosw Dec 3, 2020

Choose a reason for hiding this comment

shagunsodhani left a comment

Choose a reason for hiding this comment

briankosw commented Dec 1, 2020 • edited Loading

shagunsodhani commented Dec 1, 2020

romesco Dec 1, 2020 • edited Loading

Choose a reason for hiding this comment

briankosw Dec 3, 2020 • edited Loading

Choose a reason for hiding this comment

omry commented Dec 1, 2020

briankosw commented Dec 3, 2020

romesco commented Dec 4, 2020 • edited Loading

shagunsodhani commented Dec 4, 2020

briankosw commented Dec 5, 2020

briankosw commented Dec 1, 2020 •

edited

Loading

briankosw commented Dec 1, 2020 •

edited

Loading

romesco Dec 1, 2020 •

edited

Loading

briankosw Dec 3, 2020 •

edited

Loading

romesco commented Dec 4, 2020 •

edited

Loading