Fetch GPU stats using torch.cuda.memory_stats #8780

edward-io · 2021-08-06T16:58:35Z

🚀 Feature

(as discussed in #7518)

Gather GPU stats using torch.cuda.memory_stats instead of nvidia-smi for GPUStatsMonitor.

Motivation

Some machines do not have nvidia-smi installed, so they currently are unable to gather data using the GPUStatsMonitor callback, which is useful for detecting OOMs and debugging their models.

Pitch

For users using PyTorch version >= 1.8.0, use torch.cuda.memory_stats to gather memory data instead of invoking the nvidia-smi binary.

Some fields (fan_speed, temperature) that are logged in GPUStatsMonitor are not available from torch.cuda.memory_stats. We can either 1) use nvidia-smi as a fallback if a user requests those fields or 2) Remove those fields if we see that they aren’t being used anywhere and don’t consider them useful anymore.

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

_{Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.}

ananthsub · 2021-08-06T20:37:03Z

this will be great to expand memory stats availability!
@edward-io what do you propose for the utilities here, especially the mode argument? https://github.com/PyTorchLightning/pytorch-lightning/blob/e54180363663b690367f0051484650fcd12e5de2/pytorch_lightning/utilities/memory.py#L102-L157

I think checking the PyTorch version, preferring torch.cuda.memory, and falling back to nvidia-smi if available makes sense.

Closely related is log_gpu_memory as an argument on the Trainer constructor. If set, the Trainer instantiates a GPUStatsMonitorCallback. IMO, we ought to deprecate this parameter from the Trainer constructor, as min_max is an implementation choice for nvidia-smi, but it bears no relation to torch.cuda.memory_stats. But we can discuss this in a separate issue. It's closely related to the discussion here: #8478 (comment)

edward-io · 2021-08-17T21:22:25Z

Thanks for catching that - I'm thinking that we can keep get_memory_profile as is, with the mode argument, as long as we return the same output for get_gpu_memory_map.

daniellepintz · 2021-09-14T05:32:01Z

@edward-io have you already started working on this/do you plan to, or should I go ahead and include this in my PR for #9032?

carmocca · 2021-11-03T19:18:29Z

Done in #9586

edward-io added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 6, 2021

ananthsub assigned edward-io Aug 6, 2021

ananthsub added this to the v1.5 milestone Aug 6, 2021

This was referenced Aug 18, 2021

Deprecate process_position from the Trainer constructor #8968

Closed

Fig logging with log_gpu_memory='min_max' #9013

Merged

Revamp Device Stats Logging #9032

Closed

kaushikb11 self-assigned this Sep 30, 2021

carmocca closed this as completed Nov 3, 2021

daniellepintz mentioned this issue Feb 7, 2022

Improved control of device stats callbacks #11796

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fetch GPU stats using torch.cuda.memory_stats #8780

Fetch GPU stats using torch.cuda.memory_stats #8780

edward-io commented Aug 6, 2021

ananthsub commented Aug 6, 2021

edward-io commented Aug 17, 2021

daniellepintz commented Sep 14, 2021

carmocca commented Nov 3, 2021

Fetch GPU stats using torch.cuda.memory_stats #8780

Fetch GPU stats using torch.cuda.memory_stats #8780

Comments

edward-io commented Aug 6, 2021

🚀 Feature

Motivation

Pitch

Alternatives

Additional context

If you enjoy Lightning, check out our other projects! ⚡

ananthsub commented Aug 6, 2021

edward-io commented Aug 17, 2021

daniellepintz commented Sep 14, 2021

carmocca commented Nov 3, 2021