-
Notifications
You must be signed in to change notification settings - Fork 143
Conversation
return DataLoader(self, | ||
batch_size=batch_size, | ||
shuffle=False, | ||
num_workers=0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this need to match node_count?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Each device will call this so setting num_workers to 0 will create 1 process on each device (preventing too many processes being spawned on each device , which was leading to CUDA memory errors). However this really slows down the data loading so another way to do it is is to use int((config.num_dataload_workers + n_gpus_per_node - 1) / n_gpus_per_node)
InnerEye/ML/utils/model_util.py
Outdated
@@ -243,14 +267,14 @@ def generate_and_print_model_summary(config: ModelConfigBase, model: DeviceAware | |||
# when another model is later built on the CPU (for example, before loading from a checkpoint) | |||
# https://github.com/NVIDIA/apex/issues/694 | |||
# Hence, move the model to the GPU before doing model summary. | |||
if config.use_gpu: | |||
if config.use_gpu and not config.use_ddp: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't get that - even if there's no ddp, we'd still have to move the model to the GPU?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is for data parallel - the addition of not config.use_ddp was to ensure that the model doesn't get moved if we're running ddp
InnerEye/ML/utils/model_util.py
Outdated
# Model was stored with DistributedDataParallel which stores the model in module, now loading without | ||
new_state_dict = OrderedDict() | ||
for k, v in checkpoint['state_dict'].items(): | ||
name = k.replace('module.', '') # remove `module.` | ||
new_state_dict[name] = v | ||
model.load_state_dict(new_state_dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or do we want to unify that at the point where we store the model?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is pretty standard for DDP but can look at moving it if you think it's better
@@ -127,6 +127,8 @@ class AzureConfig(GenericConfig): | |||
_workspace: Workspace = param.ClassSelector(class_=Workspace, | |||
doc="The cached workspace object that has been created in the first" | |||
"call to get_workspace") | |||
workers_per_node: int = param.Integer(1, doc="The number of workers to assign per machine") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we simplify the setup, and populate that from the number of GPUs that are available? Or a hybrid: If value set to 0, auto-populate. Otherwise use the given number of workers.
InnerEye/Azure/azure_runner.py
Outdated
source_config.script_params.update({'--dist_backend': 'nccl', | ||
'--init_method': 'tcp://' + '$AZ_BATCH_MASTER_NODE'}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are defining backend and initmethod as fields in azureconfig, but you don't use them here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we don't need those config fields at all? can we always go with nccl and tcp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dist_backend is already set as nccl as default in deep_learning_config so it is True that we can remove that here, but the init method should be 'env://' by default (reading from environment vars) and only tcp' if it's an AML MPI job.
Regarding why I set it here - it's so that it will be passed as an arg to model_config in runner.parse_and_load_model. Is there a better place to set it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see is_aml_mpi_run (this function is hacky, it relies on the fact that AML updates the init_backend to TCP, whereas for our local runs we are using environment vars
max_run_duration_seconds=max_run_duration | ||
node_count=azure_config.node_count, | ||
distributed_training=distributed_training_backend, | ||
pip_packages=['azureml-dataprep[pandas,fuse]'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need those packages here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I try to just include it in the environment.yml file I get the error ModuleNotFoundError: No module named 'azureml.dataprep'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that's odd. I think we should understand why. It is not great if our dependencies are spread across env.yml and the code itself.
Superseded by #323. Will need to pick up some parts of that for multi-node training. |
No description provided.