Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Adding Distributed Data Parallel #261

Closed
wants to merge 52 commits into from
Closed

Adding Distributed Data Parallel #261

wants to merge 52 commits into from

Conversation

ant0nsc
Copy link
Contributor

@ant0nsc ant0nsc commented Sep 30, 2020

No description provided.

@mebristo mebristo marked this pull request as ready for review October 23, 2020 09:20
InnerEye/Azure/azure_runner.py Outdated Show resolved Hide resolved
return DataLoader(self,
batch_size=batch_size,
shuffle=False,
num_workers=0,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this need to match node_count?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Each device will call this so setting num_workers to 0 will create 1 process on each device (preventing too many processes being spawned on each device , which was leading to CUDA memory errors). However this really slows down the data loading so another way to do it is is to use int((config.num_dataload_workers + n_gpus_per_node - 1) / n_gpus_per_node)

InnerEye/ML/deep_learning_config.py Outdated Show resolved Hide resolved
InnerEye/ML/deep_learning_config.py Outdated Show resolved Hide resolved
InnerEye/ML/deep_learning_config.py Outdated Show resolved Hide resolved
InnerEye/ML/model_training_steps.py Show resolved Hide resolved
InnerEye/ML/utils/metrics_util.py Outdated Show resolved Hide resolved
@@ -243,14 +267,14 @@ def generate_and_print_model_summary(config: ModelConfigBase, model: DeviceAware
# when another model is later built on the CPU (for example, before loading from a checkpoint)
# https://github.com/NVIDIA/apex/issues/694
# Hence, move the model to the GPU before doing model summary.
if config.use_gpu:
if config.use_gpu and not config.use_ddp:
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get that - even if there's no ddp, we'd still have to move the model to the GPU?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for data parallel - the addition of not config.use_ddp was to ensure that the model doesn't get moved if we're running ddp

Comment on lines 320 to 325
# Model was stored with DistributedDataParallel which stores the model in module, now loading without
new_state_dict = OrderedDict()
for k, v in checkpoint['state_dict'].items():
name = k.replace('module.', '') # remove `module.`
new_state_dict[name] = v
model.load_state_dict(new_state_dict)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

or do we want to unify that at the point where we store the model?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is pretty standard for DDP but can look at moving it if you think it's better

InnerEye/train_variables.yml Outdated Show resolved Hide resolved
InnerEye/ML/deep_learning_config.py Show resolved Hide resolved
azure_runner.yml Outdated Show resolved Hide resolved
Tests/ML/test_model_training.py Outdated Show resolved Hide resolved
InnerEye/settings.yml Outdated Show resolved Hide resolved
@@ -127,6 +127,8 @@ class AzureConfig(GenericConfig):
_workspace: Workspace = param.ClassSelector(class_=Workspace,
doc="The cached workspace object that has been created in the first"
"call to get_workspace")
workers_per_node: int = param.Integer(1, doc="The number of workers to assign per machine")
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we simplify the setup, and populate that from the number of GPUs that are available? Or a hybrid: If value set to 0, auto-populate. Otherwise use the given number of workers.

Comment on lines 301 to 302
source_config.script_params.update({'--dist_backend': 'nccl',
'--init_method': 'tcp://' + '$AZ_BATCH_MASTER_NODE'})
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are defining backend and initmethod as fields in azureconfig, but you don't use them here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we don't need those config fields at all? can we always go with nccl and tcp?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dist_backend is already set as nccl as default in deep_learning_config so it is True that we can remove that here, but the init method should be 'env://' by default (reading from environment vars) and only tcp' if it's an AML MPI job.

Regarding why I set it here - it's so that it will be passed as an arg to model_config in runner.parse_and_load_model. Is there a better place to set it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see is_aml_mpi_run (this function is hacky, it relies on the fact that AML updates the init_backend to TCP, whereas for our local runs we are using environment vars

InnerEye/Azure/azure_runner.py Outdated Show resolved Hide resolved
max_run_duration_seconds=max_run_duration
node_count=azure_config.node_count,
distributed_training=distributed_training_backend,
pip_packages=['azureml-dataprep[pandas,fuse]'],
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need those packages here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I try to just include it in the environment.yml file I get the error ModuleNotFoundError: No module named 'azureml.dataprep'

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's odd. I think we should understand why. It is not great if our dependencies are spread across env.yml and the code itself.

InnerEye/Common/generic_parsing.py Outdated Show resolved Hide resolved
InnerEye/ML/model_training.py Outdated Show resolved Hide resolved
InnerEye/ML/models/architectures/base_model.py Outdated Show resolved Hide resolved
InnerEye/ML/pipelines/scalar_inference.py Outdated Show resolved Hide resolved
InnerEye/ML/utils/metrics_util.py Outdated Show resolved Hide resolved
InnerEye/ML/utils/model_util.py Outdated Show resolved Hide resolved
@ant0nsc
Copy link
Contributor Author

ant0nsc commented Jan 29, 2021

Superseded by #323. Will need to pick up some parts of that for multi-node training.

@ant0nsc ant0nsc closed this Jan 29, 2021
@ant0nsc ant0nsc deleted the mebristo/ddp branch July 15, 2021 19:56
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants