-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support custom process group backends #11725
Comments
I've thought about this as well. IMO, the process group backend should be an optional constructor argument on the relevant strategies. This would make it super simple to integrate with libraries like fairring, but still allow Lightning to set a good default depending on the device type being used. e.g.
class DDPStrategy(ParallelStrategy):
def __init__(..., pg_backend: Optional[str] = None):
self._pg_backend: Optional[str] = pg_backend
# Utility function that can be shared across strategies
def get_default_process_group_backend_for_device(device: torch.device) -> str:
return "nccl" if device.type == "cuda" else "gloo"
init_dist_connection(self.cluster_environment, self.torch_distributed_backend) to self._pg_backend = self._pg_backend or get_default_process_group_backend_for_device(self.root_device)
init_dist_connection(self.cluster_environment, self._pg_backend)
Then the end user simply does this: import fairring
Trainer(strategy=DDPStrategy(pg_backend="fairring"), accelerator="gpu", devices=8) |
I can work on this! (Making process group backend a constructor argument and then integrating with fairring) |
From going through the code in prep for #11725 - We can fail faster by raising the runtime error irst - Remove a level of nesting and return earlier if torch distributed is already initialized
The previous mechanism with the env var had the advantage of making the code hardware agnostic, i.e., you could set An alternative would be to keep the env variable. |
@awaelchli - if there is pushback from users regarding the enviroment variable, we can always undeprecate it and continue supporting it. At the very least, this logic is now isolated to a single utility function under distributed.py |
🚀 Feature
Motivation
Find the link there: https://github.com/facebookresearch/fairring
Pitch
Alternatives
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
cc @Borda @awaelchli @rohitgr7 @akihironitta
The text was updated successfully, but these errors were encountered: