Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom process group backends #11725

Closed
tchaton opened this issue Feb 3, 2022 · 4 comments · Fixed by #11745
Closed

Support custom process group backends #11725

tchaton opened this issue Feb 3, 2022 · 4 comments · Fixed by #11745
Assignees
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement
Milestone

Comments

@tchaton
Copy link
Contributor

tchaton commented Feb 3, 2022

🚀 Feature

Motivation

Find the link there: https://github.com/facebookresearch/fairring

Screenshot 2022-02-03 at 19 21 26

Screenshot 2022-02-03 at 19 20 26

Pitch

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

cc @Borda @awaelchli @rohitgr7 @akihironitta

@tchaton tchaton added the feature Is an improvement or enhancement label Feb 3, 2022
@ananthsub
Copy link
Contributor

ananthsub commented Feb 3, 2022

I've thought about this as well. IMO, the process group backend should be an optional constructor argument on the relevant strategies. This would make it super simple to integrate with libraries like fairring, but still allow Lightning to set a good default depending on the device type being used.

e.g.

  1. On the relevant strategies, add process_group_backend (or pg_backend as a shorter name) to the constructor:
class DDPStrategy(ParallelStrategy):
    def __init__(..., pg_backend: Optional[str] = None):
         self._pg_backend: Optional[str] = pg_backend
  1. Add a utility function to distributed.py that gets the default process group backend for a given device. this is purely to facilitate code reuse across strategies instead of needing to inherit or extend from ParallelStrategy.
# Utility function that can be shared across strategies
def get_default_process_group_backend_for_device(device: torch.device) -> str:
    return "nccl" if device.type == "cuda" else "gloo"
  1. Inside of setup_distributed make this change from
init_dist_connection(self.cluster_environment, self.torch_distributed_backend)

to

self._pg_backend = self._pg_backend or get_default_process_group_backend_for_device(self.root_device)
init_dist_connection(self.cluster_environment, self._pg_backend)
  1. Deprecate/remove this property from the ParallelStrategy along with setting the process group backend via an environment variable. This is super hidden and not documented anywhere. https://github.com/PyTorchLightning/pytorch-lightning/blob/4d72110b51525296eac16454e7b34e2c1b60cb7e/pytorch_lightning/strategies/parallel.py#L101-L106

Then the end user simply does this:

import fairring
Trainer(strategy=DDPStrategy(pg_backend="fairring"), accelerator="gpu", devices=8)

@daniellepintz
Copy link
Contributor

I can work on this! (Making process group backend a constructor argument and then integrating with fairring)

@daniellepintz daniellepintz self-assigned this Feb 3, 2022
@akihironitta akihironitta changed the title Add support for Fairing Add support for Fairring Feb 4, 2022
@akihironitta akihironitta added the distributed Generic distributed-related topic label Feb 4, 2022
ananthsub added a commit that referenced this issue Feb 4, 2022
From going through the code in prep for #11725
- We can fail faster by raising the runtime error irst
- Remove a level of nesting and return earlier if torch distributed is already initialized
@daniellepintz daniellepintz removed their assignment Feb 4, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 4, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 7, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 7, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 10, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Feb 16, 2022
@ananthsub ananthsub added this to the 1.6 milestone Feb 16, 2022
@ananthsub ananthsub changed the title Add support for Fairring Support custom process group backends Feb 16, 2022
@ananthsub ananthsub self-assigned this Feb 16, 2022
@ananthsub ananthsub moved this to In Progress in Frameworks Planning Feb 16, 2022
@carmocca carmocca moved this from In Progress to In Review in Frameworks Planning Feb 21, 2022
@carmocca carmocca moved this from In Review to In Progress in Frameworks Planning Feb 23, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Mar 5, 2022
ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Mar 5, 2022
@awaelchli
Copy link
Contributor

awaelchli commented Mar 10, 2022

The previous mechanism with the env var had the advantage of making the code hardware agnostic, i.e., you could set PL_TORCH_DISTRIBUTED_BACKEND="gloo" as fixed and freely choose the devices and hardware though e.g. CLI. Now, you would need to change the code and instantiate the strategy yourself, or register the custom strategy in the Registry.
I hope existing users will not be impacted by this inconvenience.

An alternative would be to keep the env variable.

@ananthsub
Copy link
Contributor

@awaelchli - if there is pushback from users regarding the enviroment variable, we can always undeprecate it and continue supporting it. At the very least, this logic is now isolated to a single utility function under distributed.py

ananthsub added a commit to ananthsub/pytorch-lightning that referenced this issue Mar 18, 2022
Repository owner moved this from In Progress to Done in Frameworks Planning Mar 18, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
distributed Generic distributed-related topic feature Is an improvement or enhancement
Projects
No open projects
Status: Done
5 participants