-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Simplifying the Accelerator Connector logic and flags #10422
Comments
4 tasks
@PyTorchLightning/core-contributors I'd like your feedback on this |
12 tasks
Hey @four4fish The tracking conversation is here: #10410 |
I think this RFC could be split into smaller sections, as it tries to discuss too many aspects. |
Hi Thomas, the #10410 is part of the topics this issue is discussing |
12 tasks
10 tasks
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Proposed refactoring or deprecation
No.5 of #10416 Accelerator and Plugin refactor
and part of #10417 Core Trainer Connectors
Related to #10410 Future of gpus/ipus/tpu_cores with respect to devices
Motivation
Current flags and accelerator logic is confusing. Multiple Accelerator flags partially overlap and interfere each other.
There are 30 MisconfigurationException in accelerator connector, half of them are caused by duplicated flags interfering with each other.
Multiple flag with same meaning doesn't add much value, but cause confusing and make accelerator_connector logic unnecessarily complicated.
For example:
devices
flags mentioned in [RFC] Future ofgpus/ipus/tpu_cores
with respect todevices
#10410 Future of gpus/ipus/tpu_cores with respect to devicesgpu=2, device=3
, the device will be ignored.or
https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L867-L869
device
which is a string, orAccelerator()
which wraps the precision and ttp.https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L784-L791
Plugins and strategy are duplicated. If user specific both will be misconfig. Also we have to keep logic to handle both
strategy flag
andplugin flag
. (There isdistributed_backend
too and it's deprecated)https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L317-L322
Also, with increasing use cases of custom Plugins, it's critical to have more scalable solutions. For example, the current enum for distributed is not scalable for customized distributed
https://github.com/PyTorchLightning/pytorch-lightning/blob/db4e7700047519ff6e6365517d7e592c8ef023cb/pytorch_lightning/utilities/enums.py
Pitch
Every Flag should have One and Only One meaning, NO overlap between flags. Reduce user's Misconfig possibility
Deprecate
num_processes
,tpu-cores
,ipus
,gpus
,plugins
flagKeep options:
More restrict typing
Reduce unnecessary internal wrapper
Remove dependence to Enums, use
TrainingTypePluginsRegistry
names, which works for both built in plugins and customized pluginsAdditional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: