-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite Accelerator_connector and follow up tasks #11449
Comments
Is this a duplicate of #10422 ? |
@ananthsub Some of the details are out of date in #10422, I have closed that one, only track though this one |
@four4fish Do I see correctly, the It is very confusing we never had tests for this, and there are two tests that call I'm adding this as follow up task. cc @krshrimali who is working on a related PR #11944 |
@awaelchli Ops!! you are totally right! I'm adding it back now. |
Thanks @four4fish The check for The boring model here raises an exception: MisconfigurationException Traceback (most recent call last)
[<ipython-input-7-ec9775ede022>](https://localhost:8080/#) in <module>()
----> 1 run()
4 frames
[/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py](https://localhost:8080/#) in _lazy_init_strategy(self)
725 if _IS_INTERACTIVE and self.strategy.strategy_name not in interactive_compatible_strategy:
726 raise MisconfigurationException(
--> 727 f"`Trainer(strategy={self.strategy.strategy_name!r})` or"
728 f" `Trainer(accelerator={self.strategy.strategy_name!r})` is not compatible with an interactive"
729 " environment. Run your code as a script, or choose one of the compatible backends:"
MisconfigurationException: `Trainer(strategy='single_device')` or `Trainer(accelerator='single_device')` is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible backends: dp, ddp_spawn, ddp_sharded_spawn, tpu_spawn. In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function. (need to replace install with Working on a fix here: #12008 |
Found another bug :)) #12044 |
AcceleratorConnector.parallel_device_ids
and deprecate Trainer.data_parallel_device_ids
#12051
I'm going to close this issue as most of the work has been done. If anybody has extra items in mind, please open separate smaller issues for them. |
Proposed refactor
We have been discussing this for a while. There are issues related to this topic like:
Motivation
Moving towards sable strategy version
The current logic is not clear and hard to maintain
There are a lot of simplification we can do after the rewrite
Pitch
The new logic can be divided to 3 parts (Details in the PR)
Part1 : Check mis config set by user - conflict between flags, duplication between flags. And set final flag
Part 2: Choose Strategy, Accelerator, Precision, cluster_envirment and set up parallel devices
Part 3: Initialized Strategy, set up Strategy's Accelerator, Precision, Checkpoint_IO, Cluster environment and Parallel_devices (all require lazy initialization)
Follow up items from #11448
eg: move this check to IPUPrecision plugin. from @carmocca
move this check to strategy. from @ananthsub
Add typing to accel_connector. Can we do this as a separate PR after unused properties deprecation? from @kaushikb11 @awaelchli @ananthsub
Reduce duplicated strategy registry code: Classmethod inheritance doesn't work with current strategy registry logic, cls is the base class not the inheritance class. To reduce duplicated
register_strategies
method, we need redo the strategy registry logic. @kaushikb11 @awaelchli @tchatonFlag conflict and fallback logic revisit:
- different flag set to the same thing: should be error (from @tchaton )
- dp/ddp2 on cpu fallback to ddp: should be error instead of silent fallback (from @ananthsub )
- [RFC] handle
cluster_env
andcheckpoint_io
set in both strategy() and plugins eg: (strategy=DDPPlugin(cluster_env=LightningEnv()), plugin=[TorchelasticEnv()])- check there is only 1 instance of each type at most in plugin flag (from @tchaton )
- now DDP is the default with 1 GPU multi node, why not fallback to ddp_spawn for all (from @tchaton )
- add/revisit warning for fallback logic
- Is Apex supported with Sharded methods? Should we remove self._precision_flag in (16, "bf16") from the "Sharded plugins are not supported with apex, please switch to
amp_backend='native'
."check? (from @tchaton )Move _IS_INTERACTIVE check to strategy
Loss check for "The
TPUAccelerator
can only be used with aSingleTPUStrategy
orTPUSpawnStrategy
," from @ananthsub (not required, nice to have)improving error message
Trainer(strategy={strategy})
" f" but you have also passed {accelerator} in Trainer(accelerator={accelerator}) instead of "accelerator set through both strategy class and accelerator flag, choose one" (from @ananthsub)Trainer(accelerator='cpu', precision=16, amp_type='apex')
"" but apex AMP not supported on CPU." Worth to mention this works with bfloat16 and native. (from @tchaton )
Enable accelerator.is_available() check
all the TODOs in accelerator_connector:
(HIGH PRIORITY) Re-introduce the _init_deterministic method on the AcceleratorConnector and set the value for deterministic.
Additional context
Improvement and potential improvement:
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @justusschock @awaelchli @akihironitta @rohitgr7 @kaushikb11 @ninginthecloud @carmocca @ananthsub @tchaton
The text was updated successfully, but these errors were encountered: