Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch GPU stats using torch.cuda.memory_stats #8780

Closed
edward-io opened this issue Aug 6, 2021 · 4 comments
Closed

Fetch GPU stats using torch.cuda.memory_stats #8780

edward-io opened this issue Aug 6, 2021 · 4 comments
Assignees
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Milestone

Comments

@edward-io
Copy link
Contributor

🚀 Feature

(as discussed in #7518)

Gather GPU stats using torch.cuda.memory_stats instead of nvidia-smi for GPUStatsMonitor.

Motivation

Some machines do not have nvidia-smi installed, so they currently are unable to gather data using the GPUStatsMonitor callback, which is useful for detecting OOMs and debugging their models.

Pitch

For users using PyTorch version >= 1.8.0, use torch.cuda.memory_stats to gather memory data instead of invoking the nvidia-smi binary.

Some fields (fan_speed, temperature) that are logged in GPUStatsMonitor are not available from torch.cuda.memory_stats. We can either 1) use nvidia-smi as a fallback if a user requests those fields or 2) Remove those fields if we see that they aren’t being used anywhere and don’t consider them useful anymore.

Alternatives

Additional context


If you enjoy Lightning, check out our other projects! ⚡

  • Metrics: Machine learning metrics for distributed, scalable PyTorch applications.

  • Flash: The fastest way to get a Lightning baseline! A collection of tasks for fast prototyping, baselining, finetuning and solving problems with deep learning

  • Bolts: Pretrained SOTA Deep Learning models, callbacks and more for research and production with PyTorch Lightning and PyTorch

  • Lightning Transformers: Flexible interface for high performance research using SOTA Transformers leveraging Pytorch Lightning, Transformers, and Hydra.

@edward-io edward-io added feature Is an improvement or enhancement help wanted Open to be worked on labels Aug 6, 2021
@ananthsub ananthsub added this to the v1.5 milestone Aug 6, 2021
@ananthsub
Copy link
Contributor

this will be great to expand memory stats availability!
@edward-io what do you propose for the utilities here, especially the mode argument? https://github.com/PyTorchLightning/pytorch-lightning/blob/e54180363663b690367f0051484650fcd12e5de2/pytorch_lightning/utilities/memory.py#L102-L157

I think checking the PyTorch version, preferring torch.cuda.memory, and falling back to nvidia-smi if available makes sense.

Closely related is log_gpu_memory as an argument on the Trainer constructor. If set, the Trainer instantiates a GPUStatsMonitorCallback. IMO, we ought to deprecate this parameter from the Trainer constructor, as min_max is an implementation choice for nvidia-smi, but it bears no relation to torch.cuda.memory_stats. But we can discuss this in a separate issue. It's closely related to the discussion here: #8478 (comment)

@edward-io
Copy link
Contributor Author

Thanks for catching that - I'm thinking that we can keep get_memory_profile as is, with the mode argument, as long as we return the same output for get_gpu_memory_map.

@daniellepintz
Copy link
Contributor

@edward-io have you already started working on this/do you plan to, or should I go ahead and include this in my PR for #9032?

@kaushikb11 kaushikb11 self-assigned this Sep 30, 2021
@carmocca
Copy link
Contributor

carmocca commented Nov 3, 2021

Done in #9586

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on
Projects
None yet
Development

No branches or pull requests

5 participants