Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove AccleratorConnector.devices and Deprecate Trainer.devices in favor of Trainer.device_ids and Trainer.num_devices #12126

Closed
DuYicong515 opened this issue Feb 26, 2022 · 6 comments
Labels

Comments

@DuYicong515
Copy link
Contributor

DuYicong515 commented Feb 26, 2022

Proposed refactor

Remove
https://github.com/PyTorchLightning/pytorch-lightning/blob/7e2f9fbad555242b0ceb2a24e5e4c004f0701bae/pytorch_lightning/trainer/connectors/accelerator_connector.py#L788-L794

Introduce Trainer.device_ids which returns a list of device indexes, and Trainer.num_devices which returns len(Trainer.device_ids).

Deprecate Trainer.devices in favor of Trainer.device_ids and Trainer.num_devices
https://github.com/PyTorchLightning/pytorch-lightning/blob/7e2f9fbad555242b0ceb2a24e5e4c004f0701bae/pytorch_lightning/trainer/connectors/accelerator_connector.py#L788-L794

Motivation

Accelerator.devices was not used within PyTorchLightning other than Trainer.devices, also its implementation looks more like num_devices. Since Accelerator.devices is not meant to be a public property, we can remove that in favor of using Trainer.num_devices externally.

Introduce Trainer.device_ids that returns list of device indexes to deprecate Trainer.devices. It could be used to get the device indexes, and also list of devices combing indexes + Trainer.acclerator.

Currently Trainer.devices wasn't used within PytorchLightning other than tests, and its name is quite confusing -- it returns number of devices instead of list of devices/or device indexes.

Trainer.num_devices which is derived from Trainer.device_ids can also help with several Trainer properties migration since we plan to remove bunch of unused properties in AcceleratorConnector (part of #11449) including num_processes, tpu_cores , ipus and num_gpus .

Pitch

@property
def device_ids(self) -> List[int]:
    if isinstance(self.strategy, ParallelStrategy):
         return [torch._utils._get_device_index(device) for device in self.strategy.parallel_devices]
    elif isinstance(self.strategy, SingleDeviceStrategy):
          return [torch._utils._get_device_index(self.strategy.root_device)]
    return []
    
@property
def num_devices(self) -> List[int]:
    return len(device_ids)
    

cc @justusschock @awaelchli @rohitgr7 @four4fish


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.

  • Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.

  • Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@carmocca
Copy link
Contributor

carmocca commented Mar 1, 2022

I think Trainer.devices should stay undeprecated at least as an attribute that returns what was passed with Trainer(devices=...)

@DuYicong515
Copy link
Contributor Author

I'm okay with keeping Trainer.devices API, just the name was a bit confusing to decide whatever should be returned.
Not sure which ones are better for its return value:

1/ whatever gets passed in Trainer(devices=...)
2/ whatever gets passed in Trainer(devices=...) combined with whatever gets passed in num_processes, gpus, tpu_cores, ipus since they currently overwrite the _devices_flag on what are the real devices used in trainer.
3/ just keep as it is, the num_devices

I feel given it's an parameter passed to trainer, but might not be what is really used as devices in trainer, this API was not too clear on what is the expectations on its return.

If we decide to keep, and agree with what to return, I am okay to keep it.

Thoughts? @carmocca, @four4fish, @ananthsub

@DuYicong515
Copy link
Contributor Author

DuYicong515 commented Mar 1, 2022

I plan to use the newly introduced device_ids and num_devices to deprecate a couple of other APIs on Trainer. Since Trainer.devices is discussed, would also love to bring it to the table that whether we want to deprecate/keep the other properties? Some of them are also used in the constructor(ipus, tpu_cores, num_processes) but I see couple of PRs to prefer accelerator+devices instead.

https://github.com/PyTorchLightning/pytorch-lightning/blob/d4d197070fc2c6c04d460bbfb8b1b9d3a2ebc944/pytorch_lightning/trainer/trainer.py#L2029-L2031

https://github.com/PyTorchLightning/pytorch-lightning/blob/d4d197070fc2c6c04d460bbfb8b1b9d3a2ebc944/pytorch_lightning/trainer/trainer.py#L2037-L2047

There's another related issue #11624

@four4fish
Copy link
Contributor

+1, I think the trainer device properties ending stage for 1.6 release should be :

  1. num_devices
  2. devices_ids
  3. num_nodes

@DuYicong515 Do you mind create a issue to state this and implementation order? It will be easier to get feedback and bring everyone to the same page.

@DuYicong515
Copy link
Contributor Author

@DuYicong515 Do you mind create a issue to state this and implementation order? It will be easier to get feedback and bring everyone to the same page.

Filed #12171

@DuYicong515
Copy link
Contributor Author

Let me close this one since it's covered in #12171

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants