Simplify multiprocessing logic in DDPSpawn plugins #10059

awaelchli · 2021-10-20T22:20:53Z

Proposed refactoring or deprecation

When working on #9987 and the corresponding refactors around the spawn plugins, I realized that the logic around the multiprocessing queue and how results are handled and returned to the trainer is quite outdated and overly complicated. This logic has outlived many changes, but we never saw a way to make it simpler. With this issue I propose several steps towards a clean and easy to follow, easy to debug code path for

spawning processes
moving results from the subprocess back to the main process
returning the result back to the trainer

Motivation

There are several components in the DDPSpawn plugin around spawning processes and handling of results that are obscure and not well documented.
On top of that, result handling bleeds into the TrainingType base class

https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/plugins/training_type/training_type_plugin.py#L38

and also into the trainer:

https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/trainer/trainer.py#L1123-L1125

This is quite confusing to anyone not familiar with the peculiarities of ddp spawn. But it does not have to be that way. The situation can be drastically improved!

Pitch

Step 1

Remove the self.mp_queue attribute from the plugin. It is not required and can be locally created and used within the recently introduced DDPSpawnPlugin.spawn method #10018

Step 2

Instead of adding items like last_path, best_path, or results to the queue one by one, add all data at once as one result tuple to the queue.

This logic

https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/plugins/training_type/ddp_spawn.py#L224-L238

becomes

# inside the spawned processes
def new_process(self, ...):
    # ...
    return best_path, last_path, results

# in spawn wrapper:
def _wrapped_function(...):
    result = function(*args, **kwargs)
    if self.is_global_zero:
        queue.put(move_data_to_device(result, "cpu"))  # the tuple of data

# in main process:
def spawn(self, ...):
    queue = SimpleQueue()
    mp.spawn(self._wrapped_function, args=(function, args, kwargs, mp_queue), nprocs=self.num_processes)
    return mp_queue.get()  # the tuple of data

This allows us to standardize and limit the queue to a single put() and correspondingly a single get. This is less error prone and easier to understand for everyone working with custom plugins. The only really complicated part where a learning curve is steep for the reader is this code snippet above. Everything else will extremely simplify.

Step 3

With 1) and 2) in place, we can directly return the results from the spawned function instead of caching it in the attribute self._results.

Step 4

Finally, we can get rid of dispatch and post dispatch

https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/trainer/trainer.py#L1102-L1107

and combine it into a single plugin.run call or alike:

self.training_type_plugin.run(self.train)  # or whatever other trainer method

This then cleanly generalizes across all plugins. The confusing concept of dispatch and post dispatch is gone.

Step 5

Proposed by @ananthsub, the next step would be to directly spawn processes with the Trainer.fit() call. This how we do it in Lite as well:

https://github.com/PyTorchLightning/pytorch-lightning/blob/412d507a73c79f5e4f7ef14289cefe2eb2af94a7/pytorch_lightning/lite/lite.py#L387-L396

The benefits of this last step are ultimately (#10059 (comment)):

consistent execution flow between spawn and non-spawn versions since _run is shared across both
- No longer will it be required to exclude some hooks or change hook order based on the plugin spawn type. Example: setup() hook
- No longer regressions with Loggers being accidentally created on the main process for spawn plugins. With step 5, we will no longer need to artificially delay any logger.experiment calls.
we can move up initialization of distributed init for spawn plugins because processes are spawned earlier. this means no subtle difference in implementations of collectives, which @four4fish found out the tedious & difficult way with Deprecate LightningDistributed and keep logic in ddp/ddpSpawn directly Deprecate LightningDistributed and keep logic in ddp/ddpSpawn directly #9691

Nexts steps

A quick and dirty draft is available here in form of a PR (excluding some steps): #10034

Additional context

The add_to_queue and get_from_queue methods were recently introduced, initially on the LightningModule and now they are in a deprecation phase. We would need to incorporate them into this design as well. Suggestions welcome.

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-10-20T22:36:15Z

cc @four4fish and @ananthsub who might be interested in this proposal.
This would touch on several issues pointed out in the audit document you put together: https://docs.google.com/document/d/1xHU7-iQSpp9KJTjI3As2EM0mfNHHr37WZYpDpwLkivA

I would like to work with you on this after 1.5

aazzolini · 2021-10-28T22:54:43Z

this is great. question, do we still need DPP vs DPPSpawn as 2 separate plugins? and same for others?

awaelchli · 2021-10-29T04:25:44Z

Good question. This proposal here would definitely bring these two plugin types closer together, in terms of the API to start processes and how to return results. However, some fundamental differences between DDP and DDPSpawn remain. Mainly:

The time/place where process groups get initialized is different
The place where certain hooks need to be called is not consistent (e.g. setup hook for LM and DataModule).

There may be a way to bring these under one roof as a single plugin type. It requires a careful and thorough study of the differences we have today.

ananthsub · 2021-10-29T05:38:11Z

@awaelchli regarding both those points and the introduction of plugin.run - why do we need this when we could spawn processes immediately at either the trainer entry point or whenevertrainer._run is called?

the huge advantages this offers are:

consistent execution flow between spawn and non-spawn versions since _run is shared across both
we can move up initialization of distributed init for spawn plugins because processes are spawned earlier. this means no subtle difference in implementations of collectives, which @four4fish found out the tedious & difficult way with Deprecate LightningDistributed and keep logic in ddp/ddpSpawn directly #9691
we can still get rid of pre_dispatch and dispatch

awaelchli · 2021-10-29T07:58:57Z

Yes I agree, and that's also what I did for Lite:

    def _run_impl(self, run_method: Callable, *args: Any, **kwargs: Any) -> Any:
        self._set_plugin_specific_precision_variables()
        self._accelerator.setup_environment()

        # apply sharded context to prevent OOM
        run_method = partial(self._run_with_sharded_context, run_method)

        if isinstance(self._strategy, DDPSpawnPlugin):
            return self._strategy.spawn(run_method, *args, return_result=True, **kwargs)
        else:
            return run_method(*args, **kwargs)

Here, run_method would be the trainer's main fit implementation.

The issue posted here was more about the queue handling and less about the Trainer. Would you say the responsibility to spawn processes should be on the trainer, not the plugin? If so, the simplification steps here would benefit that, then we have to only move the spawn method, and the rest is already disentangled.

carmocca · 2021-11-26T15:44:57Z

This looks awesome. Since you worked on the initial plugin/accel revamp, do you know what motivated the structure we have in master today? Why wasn't this proposed structure done that way back then?

awaelchli · 2021-11-26T16:28:12Z

When we did the initial revamp a year ago, we didn't question the structure of how processes get called and how results are handled. We also didn't fully understand this part. The origins go all the way back to the beginning of Lightning when Trainer shared the most responsibility that the accelerator has today.

awaelchli added the refactor label Oct 20, 2021

awaelchli added this to the v1.6 milestone Oct 20, 2021

awaelchli added discussion In a discussion stage design Includes a design discussion labels Oct 20, 2021

awaelchli changed the title ~~Simplify multiprocessing queue logic in DDPSpawn plugins~~ Simplify multiprocessing logic in DDPSpawn plugins Oct 20, 2021

awaelchli mentioned this issue Oct 27, 2021

Let DDPSpawnPlugin.spawn return a result from rank 0 #10162

Merged

11 tasks

four4fish mentioned this issue Nov 8, 2021

[Main Issue] Accelerator and Plugin refactor #10416

Closed

4 tasks

four4fish mentioned this issue Dec 1, 2021

2/n Move Accelerator into strategy - remove dispatch functions from Accelerator #10885

Merged

12 tasks

This was referenced Dec 1, 2021

Removed unnecessary _move_optimizer_state method overrides #10849

Merged

2/n Simplify spawn plugins: Spawn immediately #10896

Merged

awaelchli mentioned this issue Dec 8, 2021

Interface for Process Creation (DDPSpawn vs. DDP) #10985

Closed

four4fish mentioned this issue Dec 8, 2021

[RFC] Revisit/merge/deprecate pre_dispatch, dispatch, post_dispatch and run_stage functions from strategies and trainer #10987

Closed

tchaton closed this as completed in #10896 Dec 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Simplify multiprocessing logic in DDPSpawn plugins #10059

Simplify multiprocessing logic in DDPSpawn plugins #10059

awaelchli commented Oct 20, 2021 •

edited

Loading

awaelchli commented Oct 20, 2021 •

edited

Loading

aazzolini commented Oct 28, 2021

awaelchli commented Oct 29, 2021

ananthsub commented Oct 29, 2021 •

edited

Loading

awaelchli commented Oct 29, 2021 •

edited

Loading

carmocca commented Nov 26, 2021

awaelchli commented Nov 26, 2021

Simplify multiprocessing logic in DDPSpawn plugins #10059

Simplify multiprocessing logic in DDPSpawn plugins #10059

Comments

awaelchli commented Oct 20, 2021 • edited Loading

Proposed refactoring or deprecation

Motivation

Pitch

Step 1

Step 2

Step 3

Step 4

Step 5

Nexts steps

Additional context

If you enjoy Lightning, check out our other projects! ⚡

awaelchli commented Oct 20, 2021 • edited Loading

aazzolini commented Oct 28, 2021

awaelchli commented Oct 29, 2021

ananthsub commented Oct 29, 2021 • edited Loading

awaelchli commented Oct 29, 2021 • edited Loading

carmocca commented Nov 26, 2021

awaelchli commented Nov 26, 2021

awaelchli commented Oct 20, 2021 •

edited

Loading

awaelchli commented Oct 20, 2021 •

edited

Loading

ananthsub commented Oct 29, 2021 •

edited

Loading

awaelchli commented Oct 29, 2021 •

edited

Loading