Support custom process group backends #11725

tchaton · 2022-02-03T19:20:34Z

🚀 Feature

Motivation

Find the link there: https://github.com/facebookresearch/fairring

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @awaelchli @rohitgr7 @akihironitta

ananthsub · 2022-02-03T19:25:03Z

I've thought about this as well. IMO, the process group backend should be an optional constructor argument on the relevant strategies. This would make it super simple to integrate with libraries like fairring, but still allow Lightning to set a good default depending on the device type being used.

e.g.

On the relevant strategies, add process_group_backend (or pg_backend as a shorter name) to the constructor:

class DDPStrategy(ParallelStrategy):
    def __init__(..., pg_backend: Optional[str] = None):
         self._pg_backend: Optional[str] = pg_backend

Add a utility function to distributed.py that gets the default process group backend for a given device. this is purely to facilitate code reuse across strategies instead of needing to inherit or extend from ParallelStrategy.

# Utility function that can be shared across strategies
def get_default_process_group_backend_for_device(device: torch.device) -> str:
    return "nccl" if device.type == "cuda" else "gloo"

Inside of setup_distributed make this change from

init_dist_connection(self.cluster_environment, self.torch_distributed_backend)

to

self._pg_backend = self._pg_backend or get_default_process_group_backend_for_device(self.root_device)
init_dist_connection(self.cluster_environment, self._pg_backend)

Deprecate/remove this property from the ParallelStrategy along with setting the process group backend via an environment variable. This is super hidden and not documented anywhere. https://github.com/PyTorchLightning/pytorch-lightning/blob/4d72110b51525296eac16454e7b34e2c1b60cb7e/pytorch_lightning/strategies/parallel.py#L101-L106

Then the end user simply does this:

import fairring
Trainer(strategy=DDPStrategy(pg_backend="fairring"), accelerator="gpu", devices=8)

daniellepintz · 2022-02-03T20:47:08Z

I can work on this! (Making process group backend a constructor argument and then integrating with fairring)

From going through the code in prep for #11725 - We can fail faster by raising the runtime error irst - Remove a level of nesting and return earlier if torch distributed is already initialized

Fixes Lightning-AI#11725

awaelchli · 2022-03-10T14:10:09Z

The previous mechanism with the env var had the advantage of making the code hardware agnostic, i.e., you could set PL_TORCH_DISTRIBUTED_BACKEND="gloo" as fixed and freely choose the devices and hardware though e.g. CLI. Now, you would need to change the code and instantiate the strategy yourself, or register the custom strategy in the Registry.
I hope existing users will not be impacted by this inconvenience.

An alternative would be to keep the env variable.

ananthsub · 2022-03-10T21:20:07Z

@awaelchli - if there is pushback from users regarding the enviroment variable, we can always undeprecate it and continue supporting it. At the very least, this logic is now isolated to a single utility function under distributed.py

Fixes Lightning-AI#11725

tchaton added the feature Is an improvement or enhancement label Feb 3, 2022

daniellepintz self-assigned this Feb 3, 2022

akihironitta changed the title ~~Add support for Fairing~~ Add support for Fairring Feb 4, 2022

akihironitta added the distributed Generic distributed-related topic label Feb 4, 2022

ananthsub mentioned this issue Feb 4, 2022

Minor refactors to init_dist_connection #11733

Merged

9 tasks

daniellepintz removed their assignment Feb 4, 2022

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 4, 2022

Add support for specifying process group backend to DDP strategy

dd12d27

Fixes Lightning-AI#11725

This was referenced Feb 4, 2022

Add support for specifying process group backend to relevant distributed strategies #11745

Merged

Centralize rank_zero_only utilities into their own module #11747

Merged

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 7, 2022

Add support for specifying process group backend to DDP strategy

7bd86c9

Fixes Lightning-AI#11725

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 7, 2022

Add support for specifying process group backend to DDP strategy

546c02b

Fixes Lightning-AI#11725

four4fish mentioned this issue Feb 8, 2022

[Tracker] Remaining tasks for Strategy stable version #11812

Closed

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 10, 2022

Add support for specifying process group backend to DDP strategy

148326a

Fixes Lightning-AI#11725

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 16, 2022

Add support for specifying process group backend to DDP strategy

7351568

Fixes Lightning-AI#11725

ananthsub added this to the 1.6 milestone Feb 16, 2022

ananthsub changed the title ~~Add support for Fairring~~ Support custom process group backends Feb 16, 2022

ananthsub added this to Frameworks Planning Feb 16, 2022

ananthsub self-assigned this Feb 16, 2022

ananthsub moved this to In Progress in Frameworks Planning Feb 16, 2022

carmocca moved this from In Progress to In Review in Frameworks Planning Feb 21, 2022

carmocca moved this from In Review to In Progress in Frameworks Planning Feb 23, 2022

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Mar 5, 2022

Add support for specifying process group backend to DDP strategy

d2bcb89

Fixes Lightning-AI#11725

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Mar 5, 2022

Add support for specifying process group backend to DDP strategy

cfa34df

Fixes Lightning-AI#11725

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Mar 18, 2022

Add support for specifying process group backend to DDP strategy

290c2dd

Fixes Lightning-AI#11725

ananthsub closed this as completed in #11745 Mar 18, 2022

Repository owner moved this from In Progress to Done in Frameworks Planning Mar 18, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support custom process group backends #11725

Support custom process group backends #11725

tchaton commented Feb 3, 2022 •

edited by ananthsub

Loading

ananthsub commented Feb 3, 2022 •

edited

Loading

daniellepintz commented Feb 3, 2022

awaelchli commented Mar 10, 2022 •

edited

Loading

ananthsub commented Mar 10, 2022

Support custom process group backends #11725

Support custom process group backends #11725

Comments

tchaton commented Feb 3, 2022 • edited by ananthsub Loading

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

ananthsub commented Feb 3, 2022 • edited Loading

daniellepintz commented Feb 3, 2022

awaelchli commented Mar 10, 2022 • edited Loading

ananthsub commented Mar 10, 2022

tchaton commented Feb 3, 2022 •

edited by ananthsub

Loading

ananthsub commented Feb 3, 2022 •

edited

Loading

awaelchli commented Mar 10, 2022 •

edited

Loading