[Main Issue] Accelerator and Plugin refactor #10416

four4fish · 2021-11-08T17:52:30Z

Proposed refactoring or deprecation

Motivation

Accelerator is not stable API yet, we can improve the Accelerator related logic and move towards stable Accelerator version for 1.6

Pitch

Steps

Collective refactor Consolidate collective functions #7534
- deprecate accelerator collective directly call from TTP 1/n Call training_type_plugin collective functions directly instead of going through the Accelerator #9677
- collective refactor 2/n Consolidate collective functions - collective base and subclasses #9414
Move Precision Plugin into TTP Precision Plugins should be part of Training Type Plugins #7324
Move Accelerator into Strategy [Accelerator refactor] Move Accelerator into Strategy #10648
Simplify the Spawning logic Simplify multiprocessing logic in DDPSpawn plugins #10059
[RFC] Simplifying the Accelerator Connector logic and flags (can be done in parallel with aboves) Rewrite Accelerator_connector and follow up tasks #11449 11449
[RFC] Revisit the inheritance of TTP Flatten the Strategy inheritance #11863

More details in: Accelerator Refactor Proposal
[updating]

FAQ

Will this be a lot of breaking changes?
Not much user facing API changes from 1,2,3,4.(Unless we found out other existing bugs during refactor) The only breaking change will be for custom plugins
5 and 6 is still RFC stage, may have breaking changes which impact user facing APIs
How does this impact lightningLite?
Should be helpful for lightningLite too, there maybe function refactor/simplification could happen for lightningLite. (@awaelchli any suggestion about this part?)

Follow up TODOs:

check if trainer.FITTING before setting up optimizers 2/n Move Precision Plugin into strategy - move optimizer related logics #10596 (comment)
Send a PR to https://github.com/ray-project/ray_lightning to update their plugins

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @akihironitta @tchaton @Borda @kaushikb11 @ananthsub

The text was updated successfully, but these errors were encountered:

awaelchli · 2021-11-08T19:25:01Z

Great summary! About

How does this impact lightningLite?

As best as I can see right now:
Step 1, 2 do not impact Lite. If there are changes in required in the Trainer, they will be mirrored directly by Lite. The Trainer as well as Lite do not make any strong assumptions on the internal composition of plugins.
Step 3: not sure
Step 4: Will impact Trainer more than Lite. Lite is sort of already setting an example here how it could/should look like in the Trainer.
Step 5, 6: should not impact Lite at all (accelerator connector is shared)

tchaton · 2021-11-09T22:14:37Z

Yes, definitely really excited about this. I am quite eager to finally see the accelerators marked as a stable API targetting v1.6.

awaelchli · 2021-11-17T04:00:15Z

@ananthsub

#10570 (review)

It'd be very good to list what APIs this lets us simplify or remove from the Accelerator/strategy/precision plugin. Those could be listed as immediate followups to this PR

Here are some that I found and also some are mentioned in the doc

After 2)

Methods:

Accelerator.optimizer_step (move)
Accelerator.backward (move)
Accelerator.*_dispatch (move + reduce)
Accelerator.*_step (move)
some miscellaneous ones

Property:

Accelerator.amp_backend (move)
Accelerator.precision (move)
Accelerator.scaler (move)

After 3)

TrainingTypePlugin.on_gpu/on_tpu
TrainingTypePlugin.model_to_device

These will either make use of the accelerator or be moved to it.

awaelchli · 2021-11-21T01:02:32Z

Now that the precision plugin has moved, I will take a look at this TODO here:
https://github.com/PyTorchLightning/pytorch-lightning/blob/af0bb96f0ff645102680a7adc99dc131cfeb9c0b/pytorch_lightning/lite/lite.py#L433-L439

awaelchli · 2021-11-23T06:18:49Z

Here is another follow up we need to do IMO: #10686. This will unblock also #10657

carmocca · 2022-02-16T17:43:16Z

Closing this issue in favor of the smaller linked issues for the pending tasks.

four4fish added refactor design Includes a design discussion distributed Generic distributed-related topic labels Nov 8, 2021

ananthsub changed the title ~~[Master Issue] Accelerator and Plugin refactor~~ [Main Issue] Accelerator and Plugin refactor Nov 8, 2021

four4fish changed the title ~~[Main Issue] Accelerator and Plugin refactor~~ Accelerator and Plugin refactor Nov 8, 2021

four4fish changed the title ~~Accelerator and Plugin refactor~~ [Main Issue]Accelerator and Plugin refactor Nov 8, 2021

four4fish changed the title ~~[Main Issue]Accelerator and Plugin refactor~~ [Main Issue] Accelerator and Plugin refactor Nov 8, 2021

four4fish mentioned this issue Nov 8, 2021

[RFC] Simplifying the Accelerator Connector logic and flags #10422

Closed

akihironitta added this to the 1.6 milestone Nov 12, 2021

four4fish assigned daniellepintz and four4fish and unassigned daniellepintz Nov 13, 2021

This was referenced Nov 16, 2021

Move is_distributed from accelerator_connector to Strategy #10551

Closed

2/n Consolidate collective functions - collective base and subclasses #9414

Closed

1/n Move precision plugin into strategy - update reference #10570

Merged

four4fish linked a pull request Nov 17, 2021 that will close this issue

1/n Move precision plugin into strategy - update reference #10570

Merged

12 tasks

ananthsub mentioned this issue Nov 17, 2021

Support passing Accelerator objects to the accelerator flag with devices=x #10592

Closed

four4fish closed this as completed in #10570 Nov 19, 2021

awaelchli reopened this Nov 19, 2021

four4fish mentioned this issue Nov 19, 2021

2/n Move Precision Plugin into strategy - move optimizer related logics #10596

Merged

12 tasks

carmocca added accelerator plugin and removed distributed Generic distributed-related topic labels Nov 19, 2021

four4fish mentioned this issue Nov 20, 2021

[Accelerator refactor] Move Accelerator into Strategy #10648

Closed

four4fish linked a pull request Nov 20, 2021 that will close this issue

2/n Move Precision Plugin into strategy - move optimizer related logics #10596

Merged

12 tasks

four4fish linked a pull request Nov 20, 2021 that will close this issue

1/n Move Accelerator into strategy - move batch_to_device to strategy #10649

Merged

12 tasks

This was referenced Nov 21, 2021

Update DeepSpeed precision handling after moving PrecisionPlugin #10657

Merged

Update DDPShardedPlugin precision handling after moving PrecisionPlugin #10658

Merged

awaelchli removed a link to a pull request Nov 22, 2021

1/n Move Accelerator into strategy - move batch_to_device to strategy #10649

Merged

12 tasks

This was referenced Nov 26, 2021

Resolve training type plugin when passed with Accelerator #10775

Closed

Fix passing Accelerator objects to the accelerator flag with devices=x & resolve training type plugin #10773

Closed

awaelchli closed this as completed in #10596 Nov 30, 2021

akihironitta reopened this Nov 30, 2021

This was referenced Dec 2, 2021

Deprecate or re-purpose the get_mp_spawn_kwargs hook in spawn plugins #10895

Closed

Remove TrainingTypePlugin.post_dispatch in favor of teardown #10939

Merged

four4fish mentioned this issue Dec 8, 2021

[RFC] Future of gpus/ipus/tpu_cores with respect to devices #10410

Closed

awaelchli mentioned this issue Dec 9, 2021

[WIP] Move accelerator into strategy #11017

Closed

12 tasks

four4fish mentioned this issue Dec 10, 2021

3/n Move accelerator into Strategy #11022

Merged

12 tasks

four4fish mentioned this issue Dec 17, 2021

Deprecate Trainer.training_type_plugin in favor of trainer.strategy #11141

Merged

12 tasks

awaelchli mentioned this issue Jan 16, 2022

Unable to resume from checkpoint when using apex #11488

Closed

This was referenced Jan 19, 2022

Remove Strategy.on_tpu property #11536

Merged

Remove Strategy.on_gpu #11537

Merged

four4fish mentioned this issue Jan 20, 2022

Registry Strategy with strategy base name #11548

Closed

This was referenced Feb 8, 2022

[Tracker] Remaining tasks for Strategy stable version #11812

Closed

Flatten the Strategy inheritance #11863

Open

carmocca closed this as completed Feb 16, 2022

awaelchli mentioned this issue Feb 21, 2022

Update PL ecosystem libraries to the stable Accelerator/Strategy API #12026

Closed

ananthsub mentioned this issue Mar 5, 2022

Remove accelerator hooks from being called in call_hook #12237

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Main Issue] Accelerator and Plugin refactor #10416

[Main Issue] Accelerator and Plugin refactor #10416

four4fish commented Nov 8, 2021 •

edited by awaelchli

Loading

awaelchli commented Nov 8, 2021

tchaton commented Nov 9, 2021

awaelchli commented Nov 17, 2021

awaelchli commented Nov 21, 2021

awaelchli commented Nov 23, 2021

carmocca commented Feb 16, 2022

[Main Issue] Accelerator and Plugin refactor #10416

[Main Issue] Accelerator and Plugin refactor #10416

Comments

four4fish commented Nov 8, 2021 • edited by awaelchli Loading

Proposed refactoring or deprecation

Motivation

Pitch

FAQ

Follow up TODOs:

If you enjoy Lightning, check out our other projects! ⚡

awaelchli commented Nov 8, 2021

tchaton commented Nov 9, 2021

awaelchli commented Nov 17, 2021

awaelchli commented Nov 21, 2021

awaelchli commented Nov 23, 2021

carmocca commented Feb 16, 2022

four4fish commented Nov 8, 2021 •

edited by awaelchli

Loading