-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improved control of device stats callbacks #11796
Comments
@cowwoc @twsl @daniellepintz @ananthsub @carmocca @mauvilsa could I please have your thoughts (I saw you were either involved with #9032 or discussed this in slack) |
LGTM |
IMO, the most important part of #9032 was deprecating I am fine with undeprecating GPUStatsMonitor/XLAStatsMonitor. Building off your proposal, the In general, think the callbacks demonstrate how easy it is to access & extend this information. |
I'm all for pitch 2. |
Hi @EricWiener, thanks for the proposal! First of all, I think we are conflating two things here - the move from Regarding the rest of your proposal, I agree that now that we are adding CPUStats and users may want both CPUStats + another accelerator stats, we need to change something in the design. I am still thinking about which option I like best. One potential downside of Pitch 2 is that technically users could add a statsmonitor for an accelerator that they aren't using, like they could add a GPUStatsMonitor if they are using only CPU, etc, so we would have to handle that. Another downside is that it replicates the list of all the accelerators, i.e. we already have CPU, GPU, TPU, IPU Accelerators, now we will also need to have CPUDeviceStats, GPUDeviceStats, TPUDeviceStats, IPUDeviceStats. Also one question, if we went with Pitch 2 would we deprecate the Another technicality, if we go with Pitch 2, IMO we shouldn't just undeprecate GPU/XLAStatsMonitors, because there was also some unification that went into #9032 to unify the interfaces, which we should keep. |
Couldnt we just add a generic statsmonitor as an additional class that always loggs cpu/sys mem and accelerator data? that way we would have an easy to use default and could still allow advanced users to configure logging the device of choice regardless of the accelerator. cause you might want to use a certain device and log the stats even if it isnt your accelerator |
Sorry for the confusion. I had meant for this to be an example of when finer user-control would be nice (specifying whether to use
We could either raise an error if the device wasn't supported (already done) or log a warning and just ignore the callback. Either way seems fine to me. Right now if the user specifies
It seems like the accelerators should be condensed vs. condensing the device stats monitors.
This would probably make sense.
Good point |
I was working on doing that in #11795, but there wasn't a very nice way to do that and still allow customization for both CPU metrics + accelerator metrics. This would be pitch 1 |
One thing I do not like about the current deprecated I might be more in favor of a single This view is more focused on the user perspective. I have not looked at the code to understand how this would fit or what complications there might be. |
Just added a new pitch 3 based on the above feedback where we keep a single |
I actually prefer code to fail-fast than failing silently. If you're going to go down this path, please log "[INFO] No GPU detected. Disabling GPUStstsMonitor" so the failure is not as silent. |
I also prefer code that fails fast and no silent failures. But just to clarify. What I am saying is that the purpose of a
Lightning already prints which devices are detected and which ones are used, e.g.
If the callback ends up working as I suggested, would there be a need for this callback to show a message that a device is not detected? If a message is shown, maybe better the other way around, like |
@EricWiener a single callback does not necessarily mean no per device type options. How about the following: class DeviceStatsMonitor(Callback):
def __init__(
self,
cpu_stats: bool = True,
gpu_stats: bool = True,
tpu_stats: bool = True,
get_device_stats_gpu_kwargs: Dict[str, Any] = None,
get_device_stats_tpu_kwargs: Dict[str, Any] = None,
):
... When instantiating the callback, one could optionally give in |
That would pretty much be pitch 1 (if I'm understanding you correctly). |
Yes, it is similar to pitch 1. But you added pitch 3 as response to my feedback, which distracts a bit from the core of what I was saying. |
I agree with @mauvilsa, I think we should keep one DeviceStatsMonitor and it should log stats for all devices used. However, I do not think we need this complicated interface:
|
I actually think the best option here is the one proposed in #11253 (comment) and the one you are adding in #11795 |
If we are no longer going to need to let the user choose whether they want to get GPU metrics from torch/
(and not have
|
after some discussion with @carmocca enabling this: DeviceStatsMonitor(
cpu_stats: Optional[Union[bool, Set[str]]] = None,
gpu_stats: Optional[Union[bool, Set[str]]] = None,
tpu_stats: Optional[Union[bool, Set[str]]] = None,
) will let users track other accelerator stats (for eg GPU and TPU), even if they are not using those accelerators for their scripts. thoughts @Lightning-AI/lai-frameworks ? my thoughts:
|
Proposed refactor
Separate device stats monitoring into separate callbacks per-device, but sub-class from
DeviceStatsMonitor
. This preserves the desired change from #9032 to consolidate the interface, but also allows for fine-grained control.Motivation
With #9032, all the accelerators were combined under a single
DeviceStatsMonitor
callback. This consolidated the API, but it also removed fine-grained control. For instance, theGPUStatsMonitor
that is now being deprecated used to provide fine-grained control over thenvidia-smi
stats that were tracked: https://github.com/PyTorchLightning/pytorch-lightning/blob/86b177ebe5427725b35fde1a8808a7b59b8a277a/pytorch_lightning/callbacks/gpu_stats_monitor.py#L87-L95However, the new interface defaults to using torch memory stats (which provide less info than nvidia-smi): https://github.com/PyTorchLightning/pytorch-lightning/blob/86b177ebe5427725b35fde1a8808a7b59b8a277a/pytorch_lightning/accelerators/gpu.py#L73-L75
Regardless of whether GPU stats are changed to default to
nvidia-smi
, the user no longer has control over what metrics are monitored. Additionally, if #11795 is merged, there will be additional CPU stats monitoring + whatever accelerator is used.Pitch 1
If the user was allowed to specify specific stats to monitor, this would require the callback to look like:
This builds on top of the suggestion in #11253 (comment) where the values allowed are:
None
: To know if the user passed a valuebool
: To easily enable/disable a default set of statsSet[str]
: To enable and show this specific set of stats.This design provides no argument validation via type checking/auto-complete.
Pitch 2
Have a common interface via a base class:
For each device, sub-class
DeviceStatsMonitor
and allow for configuration:Add a
CPUStatsMonitor
.If you want to track both CPU stats + another accelerator you can now pass:
Pitch 3
Use a single
DeviceStatsMonitor
with the option to specifycpu_stats=True
and provide sensible default metrics. This will be a friendly generic interface for quickly tracking stats.For other users, they should be able to access
get_device_stats()
from the accelerator class andget_device_stats
should take optional arguments for configuration (i.e.,get_device_stats()
with no arguments should be sufficient, but it should also allow additional optional arguments to be passed that change per-device). This allows for customization of the stats without needing to make each device callback unique and highly customizable.Currently, (in my opinion) it is a pain to make a Callback since you have to override multiple hooks even if you want the same/similar behavior per-hook. I instead propose adding a new
DecoratedCallback
class that derives from theCallback
class that allows you to specify decorators in order to specify which hooks should be called without needing to define a lot of one-line functions. I also think_prefix_metric_keys
should be made a public utility.The user could now do:
The current alternative to this would be:
Or if using a shared function it would be:
By providing utilities to get device metrics easily and making it faster/less LOC to create a Callback,
it becomes less of a pain to migrate away from
DeviceStatsMonitor
when you need to customize.cc @justusschock @awaelchli @akihironitta @rohitgr7 @tchaton @Borda @kaushikb11 @ananthsub @daniellepintz @edward-io @mauvilsa
The text was updated successfully, but these errors were encountered: