-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify multiprocessing logic in DDPSpawn plugins #10059
Comments
cc @four4fish and @ananthsub who might be interested in this proposal. I would like to work with you on this after 1.5 |
this is great. question, do we still need DPP vs DPPSpawn as 2 separate plugins? and same for others? |
Good question. This proposal here would definitely bring these two plugin types closer together, in terms of the API to start processes and how to return results. However, some fundamental differences between DDP and DDPSpawn remain. Mainly:
There may be a way to bring these under one roof as a single plugin type. It requires a careful and thorough study of the differences we have today. |
@awaelchli regarding both those points and the introduction of the huge advantages this offers are:
|
Yes I agree, and that's also what I did for Lite: def _run_impl(self, run_method: Callable, *args: Any, **kwargs: Any) -> Any:
self._set_plugin_specific_precision_variables()
self._accelerator.setup_environment()
# apply sharded context to prevent OOM
run_method = partial(self._run_with_sharded_context, run_method)
if isinstance(self._strategy, DDPSpawnPlugin):
return self._strategy.spawn(run_method, *args, return_result=True, **kwargs)
else:
return run_method(*args, **kwargs) Here, run_method would be the trainer's main fit implementation. The issue posted here was more about the queue handling and less about the Trainer. Would you say the responsibility to spawn processes should be on the trainer, not the plugin? If so, the simplification steps here would benefit that, then we have to only move the spawn method, and the rest is already disentangled. |
This looks awesome. Since you worked on the initial plugin/accel revamp, do you know what motivated the structure we have in master today? Why wasn't this proposed structure done that way back then? |
When we did the initial revamp a year ago, we didn't question the structure of how processes get called and how results are handled. We also didn't fully understand this part. The origins go all the way back to the beginning of Lightning when Trainer shared the most responsibility that the accelerator has today. |
Proposed refactoring or deprecation
When working on #9987 and the corresponding refactors around the spawn plugins, I realized that the logic around the multiprocessing queue and how results are handled and returned to the trainer is quite outdated and overly complicated. This logic has outlived many changes, but we never saw a way to make it simpler. With this issue I propose several steps towards a clean and easy to follow, easy to debug code path for
Motivation
There are several components in the DDPSpawn plugin around spawning processes and handling of results that are obscure and not well documented.
On top of that, result handling bleeds into the TrainingType base class
https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/plugins/training_type/training_type_plugin.py#L38
and also into the trainer:
https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/trainer/trainer.py#L1123-L1125
This is quite confusing to anyone not familiar with the peculiarities of ddp spawn. But it does not have to be that way. The situation can be drastically improved!
Pitch
Step 1
Remove the
self.mp_queue
attribute from the plugin. It is not required and can be locally created and used within the recently introducedDDPSpawnPlugin.spawn
method #10018Step 2
Instead of adding items like last_path, best_path, or results to the queue one by one, add all data at once as one result tuple to the queue.
This logic
https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/plugins/training_type/ddp_spawn.py#L224-L238
becomes
This allows us to standardize and limit the queue to a single
put()
and correspondingly a singleget
. This is less error prone and easier to understand for everyone working with custom plugins. The only really complicated part where a learning curve is steep for the reader is this code snippet above. Everything else will extremely simplify.Step 3
With 1) and 2) in place, we can directly return the results from the spawned function instead of caching it in the attribute
self._results
.Step 4
Finally, we can get rid of dispatch and post dispatch
https://github.com/PyTorchLightning/pytorch-lightning/blob/aa1540410ff55854e050ff117c2d66f22741d182/pytorch_lightning/trainer/trainer.py#L1102-L1107
and combine it into a single
plugin.run
call or alike:This then cleanly generalizes across all plugins. The confusing concept of dispatch and post dispatch is gone.
Step 5
Proposed by @ananthsub, the next step would be to directly spawn processes with the Trainer.fit() call. This how we do it in Lite as well:
https://github.com/PyTorchLightning/pytorch-lightning/blob/412d507a73c79f5e4f7ef14289cefe2eb2af94a7/pytorch_lightning/lite/lite.py#L387-L396
The benefits of this last step are ultimately (#10059 (comment)):
LightningDistributed
and keep logic in ddp/ddpSpawn directly #9691Nexts steps
A quick and dirty draft is available here in form of a PR (excluding some steps): #10034
Additional context
The
add_to_queue
andget_from_queue
methods were recently introduced, initially on the LightningModule and now they are in a deprecation phase. We would need to incorporate them into this design as well. Suggestions welcome.If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: