Adding Distributed Data Parallel #261

ant0nsc · 2020-09-30T13:08:02Z

No description provided.

InnerEye/Azure/azure_runner.py

ant0nsc · 2020-09-30T13:09:33Z

InnerEye/ML/dataset/full_image_dataset.py

+            return DataLoader(self,
+                              batch_size=batch_size,
+                              shuffle=False,
+                              num_workers=0,


Does this need to match node_count?

Each device will call this so setting num_workers to 0 will create 1 process on each device (preventing too many processes being spawned on each device , which was leading to CUDA memory errors). However this really slows down the data loading so another way to do it is is to use int((config.num_dataload_workers + n_gpus_per_node - 1) / n_gpus_per_node)

InnerEye/ML/deep_learning_config.py

InnerEye/ML/model_training_steps.py

InnerEye/ML/utils/metrics_util.py

ant0nsc · 2020-09-30T13:31:14Z

InnerEye/ML/utils/model_util.py

@@ -243,14 +267,14 @@ def generate_and_print_model_summary(config: ModelConfigBase, model: DeviceAware
    # when another model is later built on the CPU (for example, before loading from a checkpoint)
    # https://github.com/NVIDIA/apex/issues/694
    # Hence, move the model to the GPU before doing model summary.
-    if config.use_gpu:
+    if config.use_gpu and not config.use_ddp:


I don't get that - even if there's no ddp, we'd still have to move the model to the GPU?

This is for data parallel - the addition of not config.use_ddp was to ensure that the model doesn't get moved if we're running ddp

ant0nsc · 2020-09-30T13:32:51Z

InnerEye/ML/utils/model_util.py

+        # Model was stored with DistributedDataParallel which stores the model in module, now loading without
+        new_state_dict = OrderedDict()
+        for k, v in checkpoint['state_dict'].items():
+            name = k.replace('module.', '')  # remove `module.`
+            new_state_dict[name] = v
+        model.load_state_dict(new_state_dict)


or do we want to unify that at the point where we store the model?

I think this is pretty standard for DDP but can look at moving it if you think it's better

InnerEye/train_variables.yml

InnerEye/ML/deep_learning_config.py

azure_runner.yml

Tests/ML/test_model_training.py

InnerEye/settings.yml

ant0nsc · 2020-10-26T14:21:23Z

InnerEye/Azure/azure_config.py

@@ -127,6 +127,8 @@ class AzureConfig(GenericConfig):
    _workspace: Workspace = param.ClassSelector(class_=Workspace,
                                                doc="The cached workspace object that has been created in the first"
                                                    "call to get_workspace")
+    workers_per_node: int = param.Integer(1, doc="The number of workers to assign per machine")


Can we simplify the setup, and populate that from the number of GPUs that are available? Or a hybrid: If value set to 0, auto-populate. Otherwise use the given number of workers.

ant0nsc · 2020-10-26T14:23:04Z

InnerEye/Azure/azure_runner.py

+        source_config.script_params.update({'--dist_backend': 'nccl',
+                                            '--init_method': 'tcp://' + '$AZ_BATCH_MASTER_NODE'})


you are defining backend and initmethod as fields in azureconfig, but you don't use them here?

maybe we don't need those config fields at all? can we always go with nccl and tcp?

dist_backend is already set as nccl as default in deep_learning_config so it is True that we can remove that here, but the init method should be 'env://' by default (reading from environment vars) and only tcp' if it's an AML MPI job.

Regarding why I set it here - it's so that it will be passed as an arg to model_config in runner.parse_and_load_model. Is there a better place to set it?

see is_aml_mpi_run (this function is hacky, it relies on the fact that AML updates the init_backend to TCP, whereas for our local runs we are using environment vars

InnerEye/Azure/azure_runner.py

ant0nsc · 2020-10-26T14:25:32Z

InnerEye/Azure/azure_runner.py

-        max_run_duration_seconds=max_run_duration
+        node_count=azure_config.node_count,
+        distributed_training=distributed_training_backend,
+        pip_packages=['azureml-dataprep[pandas,fuse]'],


why do we need those packages here?

If I try to just include it in the environment.yml file I get the error ModuleNotFoundError: No module named 'azureml.dataprep'

that's odd. I think we should understand why. It is not great if our dependencies are spread across env.yml and the code itself.

InnerEye/Common/generic_parsing.py

InnerEye/ML/model_training.py

InnerEye/ML/models/architectures/base_model.py

InnerEye/ML/pipelines/scalar_inference.py

InnerEye/ML/utils/metrics_util.py

InnerEye/ML/utils/model_util.py

ant0nsc · 2021-01-29T10:12:39Z

Superseded by #323. Will need to pick up some parts of that for multi-node training.

mebristo and others added 30 commits September 30, 2020 12:51

added distributeddataparallel

cb0ee83

fix bug introduced in code cleanup

90488d6

merge changes from master

c1f5191

Get DDP working with model parallel on single machine

1e24971

run validation and testing on single device

411d4b9

add pytorch to azure_runner yaml

2e30ba4

update azure runner

1755e69

Merge recent changes from master

cb47c4d

fix bugs to run on AML

464f701

remove config

e6f7744

Merge branch 'master' into mebristo/ddp

983afce

switch global rank for local

94c4fde

Merge branch 'mebristo/ddp' of https://github.com/microsoft/InnerEye-…

50403dc

…DeepLearning into mebristo/ddp

undo changes to rank

55b1297

debug error with rank

fdc67cd

checkpoint only saves for 1 rank and distributed timing is different

0d1b8a5

fix sync bug

c2cebf6

fix bug in output_size

9ac02b0

Refactor

32c8597

bug fix

4f8efbd

debugging mem loss in inference on val set

6417c42

debug mem error in inference for val set

82ca448

debug cuda memory error in inference

b326ee3

temporarily make val set smaller for debugging

f7c58a3

debug memory error in inference on val set

00134a3

debug cuda mem error in inference

a42d9a2

debug slow inference

c0c86a7

save epoch only one device

dd5d904

compare time doing inference on gpu 0

2dad14d

tidy up

ceed0ef

mebristo added 4 commits October 22, 2020 17:04

tidy up

32b4d0e

tidy up and fix tests

ba218e6

restore model config after debugging finished

9c69809

tidy up

78f51c1

mebristo marked this pull request as ready for review October 23, 2020 09:20

mebristo added 2 commits October 23, 2020 09:54

merge recent changes from master

a3027b2

tidy up

f4c4e65

ant0nsc commented Oct 23, 2020

View reviewed changes

InnerEye/ML/deep_learning_config.py Show resolved Hide resolved

azure_runner.yml Outdated Show resolved Hide resolved

Tests/ML/test_model_training.py Outdated Show resolved Hide resolved

InnerEye/settings.yml Outdated Show resolved Hide resolved

ant0nsc commented Oct 26, 2020

View reviewed changes

mebristo and others added 16 commits October 27, 2020 21:29

work on 1 device

2f6904f

Address PR comments

a5980e4

address PR comments

1bc6653

bug fix in inference

179f1bf

debug inference mem error: try clearing cache

972c794

Destroy process group after trainingcomplete

7aef7d2

address PR comments

726ad5f

attempt to fix bug in import

7c4bcd8

fix problem with importing torch

42d6cba

override global and local size with command line args

393547a

address PR comments

7acdc98

fix tests

d7bf3ba

merge recent changes from master

1130c9b

fix test

f7508de

Merge remote-tracking branch 'origin/master' into mebristo/ddp

ea195fd

Merge branch 'master' into mebristo/ddp

0435b6c

ant0nsc closed this Jan 29, 2021

ant0nsc deleted the mebristo/ddp branch July 15, 2021 19:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding Distributed Data Parallel #261

Adding Distributed Data Parallel #261

ant0nsc commented Sep 30, 2020

ant0nsc Sep 30, 2020

mebristo Oct 29, 2020

ant0nsc Sep 30, 2020

mebristo Oct 27, 2020

ant0nsc Sep 30, 2020

mebristo Oct 27, 2020

ant0nsc Oct 26, 2020

ant0nsc Oct 26, 2020

ant0nsc Oct 26, 2020

mebristo Oct 28, 2020

mebristo Nov 13, 2020

ant0nsc Oct 26, 2020

mebristo Oct 28, 2020

ant0nsc Oct 28, 2020

ant0nsc commented Jan 29, 2021

		source_config.script_params.update({'--dist_backend': 'nccl',
		'--init_method': 'tcp://' + '$AZ_BATCH_MASTER_NODE'})

Adding Distributed Data Parallel #261

Adding Distributed Data Parallel #261

Conversation

ant0nsc commented Sep 30, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ant0nsc commented Jan 29, 2021