Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite Accelerator_connector and follow up tasks #11449

Closed
4 of 10 tasks
four4fish opened this issue Jan 12, 2022 · 7 comments · Fixed by #11448, #12083 or #12105
Closed
4 of 10 tasks

Rewrite Accelerator_connector and follow up tasks #11449

four4fish opened this issue Jan 12, 2022 · 7 comments · Fixed by #11448, #12083 or #12105

Comments

@four4fish
Copy link
Contributor

four4fish commented Jan 12, 2022

Proposed refactor

We have been discussing this for a while. There are issues related to this topic like:

Motivation

Moving towards sable strategy version
The current logic is not clear and hard to maintain
There are a lot of simplification we can do after the rewrite

Pitch

The new logic can be divided to 3 parts (Details in the PR)

Part1 : Check mis config set by user - conflict between flags, duplication between flags. And set final flag

Part 2: Choose Strategy, Accelerator, Precision, cluster_envirment and set up parallel devices

Part 3: Initialized Strategy, set up Strategy's Accelerator, Precision, Checkpoint_IO, Cluster environment and Parallel_devices (all require lazy initialization)

Follow up items from #11448

  1. Move error messages to precisionPlugin, strategy and accelerator init method if possible.
    eg: move this check to IPUPrecision plugin. from @carmocca
if self._precision_flag not in (16, 32):
                raise MisconfigurationException(
                    f"`Trainer(accelerator='ipu', precision={self._precision_flag!r})` is not supported."
                )

move this check to strategy. from @ananthsub

if self._precision_flag in (16, "bf16") and self._amp_type_flag == AMPType.APEX:
            if isinstance(self.strategy, (DDPShardedStrategy, DDPSpawnShardedStrategy, DDPFullyShardedStrategy)):
                raise MisconfigurationException(
                    "Sharded plugins are not supported with apex, please switch to `amp_backend='native'`."
  1. Add typing to accel_connector. Can we do this as a separate PR after unused properties deprecation? from @kaushikb11 @awaelchli @ananthsub

  2. Reduce duplicated strategy registry code: Classmethod inheritance doesn't work with current strategy registry logic, cls is the base class not the inheritance class. To reduce duplicated register_strategies method, we need redo the strategy registry logic. @kaushikb11 @awaelchli @tchaton

  3. Flag conflict and fallback logic revisit:
    - different flag set to the same thing: should be error (from @tchaton )
    - dp/ddp2 on cpu fallback to ddp: should be error instead of silent fallback (from @ananthsub )
    - [RFC] handle cluster_env and checkpoint_io set in both strategy() and plugins eg: (strategy=DDPPlugin(cluster_env=LightningEnv()), plugin=[TorchelasticEnv()])
    - check there is only 1 instance of each type at most in plugin flag (from @tchaton )
    - now DDP is the default with 1 GPU multi node, why not fallback to ddp_spawn for all (from @tchaton )
    - add/revisit warning for fallback logic
    - Is Apex supported with Sharded methods? Should we remove self._precision_flag in (16, "bf16") from the "Sharded plugins are not supported with apex, please switch to amp_backend='native'."check? (from @tchaton )

  4. Move _IS_INTERACTIVE check to strategy

  5. Loss check for "The TPUAccelerator can only be used with a SingleTPUStrategy or TPUSpawnStrategy," from @ananthsub (not required, nice to have)

  6. improving error message

    • "You can only specify one strategy to the Trainer." f"You have passed Trainer(strategy={strategy})" f" but you have also passed {accelerator} in Trainer(accelerator={accelerator}) instead of "accelerator set through both strategy class and accelerator flag, choose one" (from @ananthsub)
    • "You passed Trainer(accelerator='cpu', precision=16, amp_type='apex')"
      " but apex AMP not supported on CPU." Worth to mention this works with bfloat16 and native. (from @tchaton )
  7. Enable accelerator.is_available() check

  8. all the TODOs in accelerator_connector:

    • deprecate unused properties
  9. (HIGH PRIORITY) Re-introduce the _init_deterministic method on the AcceleratorConnector and set the value for deterministic.

Additional context

Improvement and potential improvement:

  • Enums could be deprecated, _StrategyType, _AcceleratorType, _distrib_type, _device_type and distributed_backend is not needed in new version
  • Strategy registry logic revisite : Now we have half of the str name registered, the rest half in _StrategyType, we could consolidate
  • Further Lazy initialization of the parallel Strategy classes: parallel devices need to be lazy initialized
  • Revisit flag priorities(part 1), choosing logic (part 2) and associated tests
  • Consolidate and revisit device parse related logic in utilities/devices, trainer and XAccelerators
  • Improve test, increase coverage and remove unnecessary tests
  • Deprecate unused functions from accelerator_connector (kept for now for backward compatibility)

If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @justusschock @awaelchli @akihironitta @rohitgr7 @kaushikb11 @ninginthecloud @carmocca @ananthsub @tchaton

@ananthsub
Copy link
Contributor

Is this a duplicate of #10422 ?
Can we consolidate the discussion on one issue to make tracking easier?

@four4fish
Copy link
Contributor Author

Is this a duplicate of #10422 ? Can we consolidate the discussion on one issue to make tracking easier?

@ananthsub Some of the details are out of date in #10422, I have closed that one, only track though this one

@awaelchli
Copy link
Contributor

@four4fish Do I see correctly, the _init_deterministic method has been completely removed from the connector. The deterministic flag does not get set at all.

It is very confusing we never had tests for this, and there are two tests that call
trainer._accelerator_connector._init_deterministic(False) but they don't fail because trainer is a MagicMock there 🤣

I'm adding this as follow up task. cc @krshrimali who is working on a related PR #11944

@four4fish
Copy link
Contributor Author

@awaelchli Ops!! you are totally right! I'm adding it back now.

@awaelchli
Copy link
Contributor

Thanks @four4fish
Found another bug with the rewrite that breaks Jupyter notebooks.

The check for is_interactive_compatible is no longer working correctly.

The boring model here raises an exception:

MisconfigurationException                 Traceback (most recent call last)
[<ipython-input-7-ec9775ede022>](https://localhost:8080/#) in <module>()
----> 1 run()

4 frames
[/usr/local/lib/python3.7/dist-packages/pytorch_lightning/trainer/connectors/accelerator_connector.py](https://localhost:8080/#) in _lazy_init_strategy(self)
    725         if _IS_INTERACTIVE and self.strategy.strategy_name not in interactive_compatible_strategy:
    726             raise MisconfigurationException(
--> 727                 f"`Trainer(strategy={self.strategy.strategy_name!r})` or"
    728                 f" `Trainer(accelerator={self.strategy.strategy_name!r})` is not compatible with an interactive"
    729                 " environment. Run your code as a script, or choose one of the compatible backends:"

MisconfigurationException: `Trainer(strategy='single_device')` or `Trainer(accelerator='single_device')` is not compatible with an interactive environment. Run your code as a script, or choose one of the compatible backends: dp, ddp_spawn, ddp_sharded_spawn, tpu_spawn. In case you are spawning processes yourself, make sure to include the Trainer creation inside the worker function.

(need to replace install with ! pip install git+https://github.com/PyTorchLightning/pytorch-lightning.git@master)

Working on a fix here: #12008

@awaelchli
Copy link
Contributor

awaelchli commented Feb 22, 2022

Found another bug :)) #12044

@awaelchli awaelchli reopened this Mar 4, 2022
@ananthsub ananthsub reopened this Mar 11, 2022
@carmocca carmocca modified the milestones: 1.6, 1.7 Mar 21, 2022
@carmocca
Copy link
Contributor

carmocca commented Jul 19, 2022

I'm going to close this issue as most of the work has been done. If anybody has extra items in mind, please open separate smaller issues for them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment