You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Currently, Triton does not publish GPU utilization and GPU Memory metrics at a model-level granularity.
Understandably, this maybe difficult to gauge due to multiple models being loaded on a single GPU, and due to nature of inference, this memory allocation may dynamically change.
However, I'm creating this issue to check whether any long-term solution is possible? Perhaps it is possible to maintain a running average of the GPU utilization of a given model and report that as avg utilization?
What blockers do currently exist in order to tackle this?
Thank you.
The text was updated successfully, but these errors were encountered:
@GuanLuo added per-model GPU memory usage in this PR, which should be available from 23.06 onwards for TensorRT and ONNX Runtime models. This provides estimated memory usage at load time.
I don't think GPU utilization would be possible, given it is not additive (i.e. a model using 20% GPU in isolation and another model using 50% GPU utilization in isolation will not necessarily use 70% of GPU if running at the same time). I suspect there would be similar issues with trying to get runtime GPU usage with multiple models potentially running, plus there would be the overhead of querying this information repeatedly. Guan could probably provide more context given he implemented per-model GPU usage metrics.
Is your feature request related to a problem? Please describe.
Thank you.
The text was updated successfully, but these errors were encountered: