You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Gather GPU stats using torch.cuda.memory_stats instead of nvidia-smi for GPUStatsMonitor.
Motivation
Some machines do not have nvidia-smi installed, so they currently are unable to gather data using the GPUStatsMonitor callback, which is useful for detecting OOMs and debugging their models.
Pitch
For users using PyTorch version >= 1.8.0, use torch.cuda.memory_stats to gather memory data instead of invoking the nvidia-smi binary.
Some fields (fan_speed, temperature) that are logged in GPUStatsMonitor are not available from torch.cuda.memory_stats. We can either 1) use nvidia-smi as a fallback if a user requests those fields or 2) Remove those fields if we see that they aren’t being used anywhere and don’t consider them useful anymore.
Alternatives
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered:
I think checking the PyTorch version, preferring torch.cuda.memory, and falling back to nvidia-smi if available makes sense.
Closely related is log_gpu_memory as an argument on the Trainer constructor. If set, the Trainer instantiates a GPUStatsMonitorCallback. IMO, we ought to deprecate this parameter from the Trainer constructor, as min_max is an implementation choice for nvidia-smi, but it bears no relation to torch.cuda.memory_stats. But we can discuss this in a separate issue. It's closely related to the discussion here: #8478 (comment)
Thanks for catching that - I'm thinking that we can keep get_memory_profile as is, with the mode argument, as long as we return the same output for get_gpu_memory_map.
🚀 Feature
(as discussed in #7518)
Gather GPU stats using torch.cuda.memory_stats instead of nvidia-smi for GPUStatsMonitor.
Motivation
Some machines do not have nvidia-smi installed, so they currently are unable to gather data using the GPUStatsMonitor callback, which is useful for detecting OOMs and debugging their models.
Pitch
For users using PyTorch version >= 1.8.0, use torch.cuda.memory_stats to gather memory data instead of invoking the nvidia-smi binary.
Some fields (fan_speed, temperature) that are logged in GPUStatsMonitor are not available from torch.cuda.memory_stats. We can either 1) use nvidia-smi as a fallback if a user requests those fields or 2) Remove those fields if we see that they aren’t being used anywhere and don’t consider them useful anymore.
Alternatives
Additional context
If you enjoy Lightning, check out our other projects! ⚡
Metrics: Machine learning metrics for distributed, scalable PyTorch applications.
Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning
Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch
Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.
The text was updated successfully, but these errors were encountered: