Rewrite Accelerator_connector and follow up tasks #11449

four4fish · 2022-01-12T18:53:08Z

Proposed refactor

We have been discussing this for a while. There are issues related to this topic like:

Motivation

Moving towards sable strategy version
The current logic is not clear and hard to maintain
There are a lot of simplification we can do after the rewrite

Pitch

The new logic can be divided to 3 parts (Details in the PR)

Part1 : Check mis config set by user - conflict between flags, duplication between flags. And set final flag

Part 2: Choose Strategy, Accelerator, Precision, cluster_envirment and set up parallel devices

Part 3: Initialized Strategy, set up Strategy's Accelerator, Precision, Checkpoint_IO, Cluster environment and Parallel_devices (all require lazy initialization)

Follow up items from #11448

Move error messages to precisionPlugin, strategy and accelerator init method if possible.
eg: move this check to IPUPrecision plugin. from @carmocca

if self._precision_flag not in (16, 32):
                raise MisconfigurationException(
                    f"`Trainer(accelerator='ipu', precision={self._precision_flag!r})` is not supported."
                )

move this check to strategy. from @ananthsub

if self._precision_flag in (16, "bf16") and self._amp_type_flag == AMPType.APEX:
            if isinstance(self.strategy, (DDPShardedStrategy, DDPSpawnShardedStrategy, DDPFullyShardedStrategy)):
                raise MisconfigurationException(
                    "Sharded plugins are not supported with apex, please switch to `amp_backend='native'`."

Add typing to accel_connector. Can we do this as a separate PR after unused properties deprecation? from @kaushikb11 @awaelchli @ananthsub
Reduce duplicated strategy registry code: Classmethod inheritance doesn't work with current strategy registry logic, cls is the base class not the inheritance class. To reduce duplicated register_strategies method, we need redo the strategy registry logic. @kaushikb11 @awaelchli @tchaton
Flag conflict and fallback logic revisit:
- different flag set to the same thing: should be error (from @tchaton )
- dp/ddp2 on cpu fallback to ddp: should be error instead of silent fallback (from @ananthsub )
- [RFC] handle cluster_env and checkpoint_io set in both strategy() and plugins eg: (strategy=DDPPlugin(cluster_env=LightningEnv()), plugin=[TorchelasticEnv()])
- check there is only 1 instance of each type at most in plugin flag (from @tchaton )
- now DDP is the default with 1 GPU multi node, why not fallback to ddp_spawn for all (from @tchaton )
- add/revisit warning for fallback logic
- Is Apex supported with Sharded methods? Should we remove self._precision_flag in (16, "bf16") from the "Sharded plugins are not supported with apex, please switch to amp_backend='native'."check? (from @tchaton )
Move _IS_INTERACTIVE check to strategy
Loss check for "The TPUAccelerator can only be used with a SingleTPUStrategy or TPUSpawnStrategy," from @ananthsub (not required, nice to have)
improving error message
- "You can only specify one strategy to the Trainer." f"You have passed Trainer(strategy={strategy})" f" but you have also passed {accelerator} in Trainer(accelerator={accelerator}) instead of "accelerator set through both strategy class and accelerator flag, choose one" (from @ananthsub)
- "You passed Trainer(accelerator='cpu', precision=16, amp_type='apex')"
  " but apex AMP not supported on CPU." Worth to mention this works with bfloat16 and native. (from @tchaton )
Enable accelerator.is_available() check
all the TODOs in accelerator_connector:
- deprecate unused properties
(HIGH PRIORITY) Re-introduce the _init_deterministic method on the AcceleratorConnector and set the value for deterministic.

Additional context

Improvement and potential improvement:

Enums could be deprecated, _StrategyType, _AcceleratorType, _distrib_type, _device_type and distributed_backend is not needed in new version
Strategy registry logic revisite : Now we have half of the str name registered, the rest half in _StrategyType, we could consolidate
Further Lazy initialization of the parallel Strategy classes: parallel devices need to be lazy initialized
Revisit flag priorities(part 1), choosing logic (part 2) and associated tests
Consolidate and revisit device parse related logic in utilities/devices, trainer and XAccelerators
Improve test, increase coverage and remove unnecessary tests
Deprecate unused functions from accelerator_connector (kept for now for backward compatibility)

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @akihironitta @rohitgr7 @kaushikb11 @ninginthecloud @carmocca @ananthsub @tchaton

The text was updated successfully, but these errors were encountered:

ananthsub · 2022-01-12T18:56:13Z

Is this a duplicate of #10422 ?
Can we consolidate the discussion on one issue to make tracking easier?

four4fish · 2022-01-12T18:57:21Z

Is this a duplicate of #10422 ? Can we consolidate the discussion on one issue to make tracking easier?

@ananthsub Some of the details are out of date in #10422, I have closed that one, only track though this one

awaelchli · 2022-02-19T02:25:50Z

@four4fish Do I see correctly, the _init_deterministic method has been completely removed from the connector. The deterministic flag does not get set at all.

It is very confusing we never had tests for this, and there are two tests that call
trainer._accelerator_connector._init_deterministic(False) but they don't fail because trainer is a MagicMock there 🤣

I'm adding this as follow up task. cc @krshrimali who is working on a related PR #11944

four4fish · 2022-02-19T05:33:28Z

@awaelchli Ops!! you are totally right! I'm adding it back now.

awaelchli · 2022-02-19T15:23:00Z

Thanks @four4fish
Found another bug with the rewrite that breaks Jupyter notebooks.

The check for is_interactive_compatible is no longer working correctly.

The boring model here raises an exception:

MisconfigurationException                 Traceback (most recent call last)
[<ipython-input-7-ec9775ede022>](https://localhost:8080/#) in <module>()
----> 1 run()

4 frames
[/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py](https://localhost:8080/#) in _lazy_init_strategy(self)
    725         if _IS_INTERACTIVE and self.strategy.strategy_name not in interactive_compatible_strategy:
    726             raise MisconfigurationException(
--> 727                 f"`Trainer(strategy={self.strategy.strategy_name!r})` or"
    728                 f" `Trainer(accelerator={self.strategy.strategy_name!r})` is not compatible with an interactive"
    729                 " environment. Run your code as a script, or choose one of the compatible backends:"

MisconfigurationException: `Trainer(strategy='single_device')` or `Trainer(accelerator='single_device')` is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible backends: dp, ddp_spawn, ddp_sharded_spawn, tpu_spawn. In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function.

(need to replace install with ! pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@master)

Working on a fix here: #12008

awaelchli · 2022-02-22T13:58:52Z

Found another bug :)) #12044

carmocca · 2022-07-19T16:36:44Z

I'm going to close this issue as most of the work has been done. If anybody has extra items in mind, please open separate smaller issues for them.

four4fish added the refactor label Jan 12, 2022

four4fish mentioned this issue Jan 12, 2022

Rewrite accelerator_connector #11448

Merged

12 tasks

four4fish mentioned this issue Jan 12, 2022

[Main Issue] Accelerator and Plugin refactor #10416

Closed

4 tasks

four4fish added the trainer: connector label Jan 12, 2022

This was referenced Jan 20, 2022

Registry Strategy with strategy base name #11548

Closed

Lazy initialize Strategy.parallel_devices #11572

Merged

Deprecate trainer.num_processe/trainer.num_gpus and remove incorrect tests #11624

Closed

carmocca mentioned this issue Jan 31, 2022

Fix typing for Accelerator Connector #11385

Closed

12 tasks

carmocca added this to Frameworks Planning Feb 14, 2022

carmocca assigned four4fish Feb 14, 2022

carmocca added this to the 1.6 milestone Feb 14, 2022

carmocca moved this to In Progress in Frameworks Planning Feb 14, 2022

kaushikb11 self-assigned this Feb 16, 2022

ananthsub closed this as completed in #11448 Feb 17, 2022

Repository owner moved this from In Progress to Done in Frameworks Planning Feb 17, 2022

four4fish reopened this Feb 17, 2022

four4fish changed the title ~~[RFC] Rewrite Accelerator_connector~~ Rewrite Accelerator_connector and follow up tasks Feb 17, 2022

This was referenced Feb 18, 2022

Add is_distributed to Strategy API and remove accl_conn is_distributed property #11968

Closed

Move is_interactive_compatible to Strategy #11970

Closed

This was referenced Feb 18, 2022

Remove Trainer._strategy_type #11990

Merged

Remove Trainer._device_type #11992

Merged

four4fish mentioned this issue Feb 19, 2022

Add back deterministic support in accelerator_connector #11999

Merged

12 tasks

This was referenced Feb 22, 2022

Import error when trying to run TPU accelerator on CPU #12044

Closed

Only allow one value for each plugin type in plugins flag #12083

Merged

This was referenced Feb 24, 2022

Check parallel_devices passed through strategy is consistent with the accelerator flag #12105

Merged

Remove AcceleratorConnector.device_type #12077

Closed

jjenniferdai mentioned this issue Feb 25, 2022

add accelerator.is_available() check #12104

Merged

12 tasks

four4fish mentioned this issue Feb 28, 2022

Move ipu precision flag check to IPUPrecisionPlugin init #12148

Merged

12 tasks

kaushikb11 mentioned this issue Mar 3, 2022

Update strategy_registry logic to avoid duplication code to register strategies #12211

Closed

12 tasks

ananthsub closed this as completed in #12105 Mar 3, 2022

awaelchli reopened this Mar 4, 2022

DuYicong515 mentioned this issue Mar 7, 2022

Remove AcceleratorConnector.root_gpu and deprecate Trainer.root_gpu #12262

Merged

12 tasks

tchaton closed this as completed in #12083 Mar 11, 2022

ananthsub reopened this Mar 11, 2022

carmocca modified the milestones: 1.6, 1.7 Mar 21, 2022

This was referenced Mar 24, 2022

Remove AcceleratorConnector.devices #12435

Merged

Remove AcceleratorConnector.tpu_cores and Deprecate Trainer.tpu_cores #12437

Merged

carmocca closed this as completed Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite Accelerator_connector and follow up tasks #11449

Rewrite Accelerator_connector and follow up tasks #11449

four4fish commented Jan 12, 2022 •

edited

Loading

ananthsub commented Jan 12, 2022

four4fish commented Jan 12, 2022

awaelchli commented Feb 19, 2022

four4fish commented Feb 19, 2022

awaelchli commented Feb 19, 2022

awaelchli commented Feb 22, 2022 •

edited

Loading

carmocca commented Jul 19, 2022 •

edited

Loading

Rewrite Accelerator_connector and follow up tasks #11449

Rewrite Accelerator_connector and follow up tasks #11449

Comments

four4fish commented Jan 12, 2022 • edited Loading

Proposed refactor

Motivation

Pitch

Follow up items from #11448

Additional context

If you enjoy Lightning, check out our other projects! ⚡

ananthsub commented Jan 12, 2022

four4fish commented Jan 12, 2022

awaelchli commented Feb 19, 2022

four4fish commented Feb 19, 2022

awaelchli commented Feb 19, 2022

awaelchli commented Feb 22, 2022 • edited Loading

carmocca commented Jul 19, 2022 • edited Loading

four4fish commented Jan 12, 2022 •

edited

Loading

awaelchli commented Feb 22, 2022 •

edited

Loading

carmocca commented Jul 19, 2022 •

edited

Loading