-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interface for Process Creation (DDPSpawn vs. DDP) #10985
Comments
+1 for pass spawn_executor proposal. I think have more unified behavior between spawning and non-spawning benefit the long run |
My opinions, but looking for more feedback Proposal 1 Pros:
Cons:
Proposal 2 Questions: Need to work out how strategies, cluster environments, and executors would work out. For instance, how does the executor take into account if processes are launched externally to the trainer (e.g. slurm, torchelastic)? In this case, we don't want the trainer to do any subprocess launch or spawn execution. This could be deduced by using the ClusterEnvironment. which is available to all parallel plugins: Pros:
Cons:
We still might end up with an A. we supply a default implementation in the base training type plugin to call the trainer function with arguments, and call it ffrom the trainer like this. Plugins end up overriding this baed on the executor's availability (e.g. parallel plugins) but this is very wide open: custom strategy could do anything at this point without a very clear contract. self.training_type_plugin.run(trainer_fn, args, kwargs) B. Only parallel plugins which can create processes really need an executor (as seen as cluster environments not being on the base training type plugin, only the parallel ones). then the trainer code might end up like this: if hasattr(self.training_type_plugin, "executor") and isinstance(self.training_type_plugin.executor, Executor):
self.training_type_plugin.executor.create_processes(....)
else:
# normal control flow which is essentially back to proposal 1, from the trainer POV
For discussions sake, mentioning other options, possibly far out:
Such that the current way of specifying strategies goes from: But breaking this up, especially something like The intention of raising this is to discuss whether "spawn" is really an integral part of the strategy, especially if the executor is pulled out to its own component and can be mixed in easily with strategies. And if it's not, whether it's worth carrying this forward on the way that strategies are specified to the Trainer. |
Personally I would vote for the executor approach and I believe most of @ananthsub concerns about it could be resolved by having a That means we don't need to special case anything in the trainer, it would be exposed in the strategy API, but it would be safe to call it since it doesn't do anything. Regarding
I can see how this could be a problem, but on the other hand, this could also allow us to have stuff like horovod only as a separate |
I prefer the Executor approach + |
I think @ananthsub makes a valid point about the complement of Cluster environments that creates processes externally vs. the executors (a name alternative could be "launcher"). @justusschock Who/what would determine which executor to use? Would the AcceleratorConnector be doing this check? if self.cluster_environment.creates_processes_externally:
strategy.executor = SingleProcessExecutor()
elif strategy=="ddp_spawn"
strategy.executor = SpawnExecutor()
elif strategy=="ddp"
strategy.executor = ScriptExecutor() One thing to observe here is that the executor is really only useful once and is then no longer needed. Making it an attribute of the strategy might lead to the impression that you can call it again in subprocesses, which could be a source of bugs and confusion. This tells me that the executor should probably be produced by a function/method and used locally. executor = ???.get_executor()
executor(trainer_fn, ...) |
@awaelchli For now I think the logic would live in the I like the idea of having that not as an instance, although you may have to request it once per entrypoint then :) |
@awaelchli @rohitgr7 another option could be to move out the definition of |
@ananthsub definitely something we could explore. Now I think it is time for deciding on the name!! The issue here names the executor, but that's just the name we came up with at that time. There are other suitable names we could go for:
Please leave your suggestions :) |
🚀 Feature
Extrude an interface that allows us to abstract and disentangle the process creation / spawning from plugins.
Motivation
The simplifications that #10059 introduced brought the DDPSpawnPlugin and DDPPlugin closer together in their function, execution order and API. The fundamental difference between the two however remains in how the processes are created.
DDPSpawnPlugin
The spawning logic in DDPSpawn comprises mainly these three methods:
https://github.com/PyTorchLightning/pytorch-lightning/blob/aeb0b5595fd73d086f4ae0f99d3f1f112f6a4c29/pytorch_lightning/plugins/training_type/ddp_spawn.py#L152
https://github.com/PyTorchLightning/pytorch-lightning/blob/aeb0b5595fd73d086f4ae0f99d3f1f112f6a4c29/pytorch_lightning/plugins/training_type/ddp_spawn.py#L245
https://github.com/PyTorchLightning/pytorch-lightning/blob/aeb0b5595fd73d086f4ae0f99d3f1f112f6a4c29/pytorch_lightning/plugins/training_type/ddp_spawn.py#L271
DDPPlugin
As with the spawn plugin, the creation of subprocesses is quite strongly decoupled in a single method in the DDPPlugin:
https://github.com/PyTorchLightning/pytorch-lightning/blob/aeb0b5595fd73d086f4ae0f99d3f1f112f6a4c29/pytorch_lightning/plugins/training_type/ddp.py#L155
The Trainer today (after #10896) has to differentiate between the two and call them differently:
Here, the plugin type check leaks into the trainer. This and the fact that the spawning logic is quite isolated inside the respective plugins motivates a refactor that separates them. Two designs have been proposed so far.
Pitch
Proposal 1 (@ananthsub):
In this proposal, the Trainer call reduces to:
Proposal 2 (@awaelchli):
The plugins would then own an instance of this executor. The DDPPlugin and DDPSpawnPlugin would collapse to a single class, for the sake of demonstration call it DDPNew, and it owns either a ScriptExecutor or a SpawnExecutor:
Alternatives
Additional context
At this point a very open discussion. The proposal may be updated depending on the feedback and discussions.
#10896 (comment)
Thanks @ananthsub for kicking off the discussion.
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @tchaton @justusschock @awaelchli @kaushikb11 @akihironitta
The text was updated successfully, but these errors were encountered: