Revisit _move_optimizer_state function for all Strategies #10820

four4fish · 2021-11-29T19:12:36Z

Proposed refactor

From comments in #10596
The device isn't used anymore

Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Lite: enables pure PyTorch users to scale their existing code on any kind of device while retaining full control over their own loops and optimization logic.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, fine-tuning, and solving problems with deep learning.
Bolts: Pretrained SOTA Deep Learning models, callbacks, and more for research and production with PyTorch Lightning and PyTorch.
Lightning Transformers: Flexible interface for high-performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

four4fish · 2021-11-30T00:18:17Z

@tchaton seems the device still in use in GPUAccelerator's teardown function. Instead of move optimizer state to root_device, it defined move_optimizer_state to cpu to avoid memory leaking.
https://github.com/PyTorchLightning/pytorch-lightning/blame/master/pytorch_lightning/accelerators/gpu.py#L78-L80

tchaton · 2021-11-30T12:49:30Z

Hey @four4fish, sounds good to me. I believe the code can be shared across strategy though. Need to revisit the TPU one.

four4fish added refactor distributed Generic distributed-related topic optimizer labels Nov 29, 2021

four4fish mentioned this issue Nov 29, 2021

2/n Move Precision Plugin into strategy - move optimizer related logics #10596

Merged

12 tasks

four4fish mentioned this issue Nov 30, 2021

Removed unnecessary _move_optimizer_state method overrides #10849

Merged

12 tasks

awaelchli assigned awaelchli and four4fish Dec 1, 2021

four4fish closed this as completed in #10849 Dec 2, 2021