Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Simplifying the Accelerator Connector logic and flags #10422

Closed
four4fish opened this issue Nov 8, 2021 · 4 comments
Closed

[RFC] Simplifying the Accelerator Connector logic and flags #10422

four4fish opened this issue Nov 8, 2021 · 4 comments
Assignees
Labels
design Includes a design discussion refactor

Comments

@four4fish
Copy link
Contributor

four4fish commented Nov 8, 2021

Proposed refactoring or deprecation

No.5 of #10416 Accelerator and Plugin refactor
and part of #10417 Core Trainer Connectors
Related to #10410 Future of gpus/ipus/tpu_cores with respect to devices

Motivation

Current flags and accelerator logic is confusing. Multiple Accelerator flags partially overlap and interfere each other.

There are 30 MisconfigurationException in accelerator connector, half of them are caused by duplicated flags interfering with each other.

Multiple flag with same meaning doesn't add much value, but cause confusing and make accelerator_connector logic unnecessarily complicated.

For example:

  1. The devices flags mentioned in [RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410 Future of gpus/ipus/tpu_cores with respect to devices
    gpu=2, device=3, the device will be ignored.
    or

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L867-L869

  1. Accelerator could have multiple meanings. It could be device which is a string, or Accelerator() which wraps the precision and ttp.

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L784-L791

  1. Plugins and strategy are duplicated. If user specific both will be misconfig. Also we have to keep logic to handle both strategy flag and plugin flag. (There is distributed_backend too and it's deprecated)

    https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L317-L322

Also, with increasing use cases of custom Plugins, it's critical to have more scalable solutions. For example, the current enum for distributed is not scalable for customized distributed
https://github.com/PyTorchLightning/pytorch-lightning/blob/db4e7700047519ff6e6365517d7e592c8ef023cb/pytorch_lightning/utilities/enums.py

Pitch

Every Flag should have One and Only One meaning, NO overlap between flags. Reduce user's Misconfig possibility
Deprecate num_processes, tpu-cores, ipus, gpus, plugins flag
Keep options:

devices_numbers(devices):  # how many devices user want to use
devices_type(Accelerator): # cpu/gpu/tpu or etc, we use this to choose Accelerator
strategy: # which TTP plugins

More restrict typing

devices_numbers: Optional[int].       #None means auto
devices_type(Accelerator): Optional[str].        #None means auto, remove Accelerator() type
strategy: Optional[Union[str, TrainingTypePlugin]].    #RFC. should we support both DDPPlugin() and 'ddp' ? or just one

Reduce unnecessary internal wrapper
Remove dependence to Enums, use TrainingTypePluginsRegistry names, which works for both built in plugins and customized plugins

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@four4fish four4fish added refactor design Includes a design discussion labels Nov 8, 2021
@four4fish
Copy link
Contributor Author

@PyTorchLightning/core-contributors I'd like your feedback on this

@tchaton
Copy link
Contributor

tchaton commented Nov 15, 2021

Hey @four4fish

The tracking conversation is here: #10410

@carmocca
Copy link
Contributor

I think this RFC could be split into smaller sections, as it tries to discuss too many aspects.

@four4fish
Copy link
Contributor Author

Hey @four4fish

The tracking conversation is here: #10410

Hi Thomas, the #10410 is part of the topics this issue is discussing

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
design Includes a design discussion refactor
Projects
None yet
Development

No branches or pull requests

4 participants