[RFC] Simplifying the Accelerator Connector logic and flags #10422

four4fish · 2021-11-08T21:17:13Z

Proposed refactoring or deprecation

No.5 of #10416 Accelerator and Plugin refactor
and part of #10417 Core Trainer Connectors
Related to #10410 Future of gpus/ipus/tpu_cores with respect to devices

Motivation

Current flags and accelerator logic is confusing. Multiple Accelerator flags partially overlap and interfere each other.

There are 30 MisconfigurationException in accelerator connector, half of them are caused by duplicated flags interfering with each other.

Multiple flag with same meaning doesn't add much value, but cause confusing and make accelerator_connector logic unnecessarily complicated.

For example:

The devices flags mentioned in [RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410 Future of gpus/ipus/tpu_cores with respect to devices
gpu=2, device=3, the device will be ignored.
or

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L867-L869

Accelerator could have multiple meanings. It could be device which is a string, or Accelerator() which wraps the precision and ttp.

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L784-L791

Plugins and strategy are duplicated. If user specific both will be misconfig. Also we have to keep logic to handle both strategy flag and plugin flag. (There is distributed_backend too and it's deprecated)

https://github.com/PyTorchLightning/pytorch-lightning/blob/a9bd4fbd96c1e73e251859d99b207008008de87d/pytorch_lightning/trainer/connectors/accelerator_connector.py#L317-L322

Also, with increasing use cases of custom Plugins, it's critical to have more scalable solutions. For example, the current enum for distributed is not scalable for customized distributed
https://github.com/PyTorchLightning/pytorch-lightning/blob/db4e7700047519ff6e6365517d7e592c8ef023cb/pytorch_lightning/utilities/enums.py

Pitch

Every Flag should have One and Only One meaning, NO overlap between flags. Reduce user's Misconfig possibility
Deprecate num_processes, tpu-cores, ipus, gpus, plugins flag
Keep options:

devices_numbers(devices):  # how many devices user want to use
devices_type(Accelerator): # cpu/gpu/tpu or etc, we use this to choose Accelerator
strategy: # which TTP plugins

More restrict typing

devices_numbers: Optional[int].       #None means auto
devices_type(Accelerator): Optional[str].        #None means auto, remove Accelerator() type
strategy: Optional[Union[str, TrainingTypePlugin]].    #RFC. should we support both DDPPlugin() and 'ddp' ? or just one

Reduce unnecessary internal wrapper
Remove dependence to Enums, use TrainingTypePluginsRegistry names, which works for both built in plugins and customized plugins

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

The text was updated successfully, but these errors were encountered:

four4fish · 2021-11-12T03:18:24Z

@PyTorchLightning/core-contributors I'd like your feedback on this

tchaton · 2021-11-15T11:08:45Z

Hey @four4fish

The tracking conversation is here: #10410

carmocca · 2021-11-15T17:26:55Z

I think this RFC could be split into smaller sections, as it tries to discuss too many aspects.

four4fish · 2021-11-15T17:28:46Z

Hey @four4fish

The tracking conversation is here: #10410

Hi Thomas, the #10410 is part of the topics this issue is discussing

four4fish added refactor design Includes a design discussion labels Nov 8, 2021

four4fish mentioned this issue Nov 8, 2021

[Main Issue] Accelerator and Plugin refactor #10416

Closed

4 tasks

four4fish mentioned this issue Nov 12, 2021

Deprecate DistributedType in favor of StrategyType #10505

Merged

12 tasks

four4fish assigned daniellepintz Nov 13, 2021

daniellepintz mentioned this issue Dec 8, 2021

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

Closed

four4fish mentioned this issue Dec 10, 2021

3/n Move accelerator into Strategy #11022

Merged

12 tasks

carmocca mentioned this issue Jan 12, 2022

extend Enum api #5478

Closed

3 tasks

ananthsub mentioned this issue Jan 12, 2022

Rewrite Accelerator_connector and follow up tasks #11449

Closed

10 tasks

four4fish closed this as completed Jan 12, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Simplifying the Accelerator Connector logic and flags #10422

[RFC] Simplifying the Accelerator Connector logic and flags #10422

four4fish commented Nov 8, 2021 •

edited

Loading

four4fish commented Nov 12, 2021

tchaton commented Nov 15, 2021

carmocca commented Nov 15, 2021

four4fish commented Nov 15, 2021

[RFC] Simplifying the Accelerator Connector logic and flags #10422

[RFC] Simplifying the Accelerator Connector logic and flags #10422

Comments

four4fish commented Nov 8, 2021 • edited Loading

Proposed refactoring or deprecation

Motivation

Pitch

Additional context

If you enjoy Lightning, check out our other projects! ⚡

four4fish commented Nov 12, 2021

tchaton commented Nov 15, 2021

carmocca commented Nov 15, 2021

four4fish commented Nov 15, 2021

four4fish commented Nov 8, 2021 •

edited

Loading