Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Metrics] | Get per-model GPU Utilization and Memory metrics #6148

Closed
nikhil-sk opened this issue Aug 4, 2023 · 2 comments
Closed

[Metrics] | Get per-model GPU Utilization and Memory metrics #6148

nikhil-sk opened this issue Aug 4, 2023 · 2 comments
Labels
question Further information is requested

Comments

@nikhil-sk
Copy link
Contributor

Is your feature request related to a problem? Please describe.

  1. Currently, Triton does not publish GPU utilization and GPU Memory metrics at a model-level granularity.
  2. Understandably, this maybe difficult to gauge due to multiple models being loaded on a single GPU, and due to nature of inference, this memory allocation may dynamically change.
  3. However, I'm creating this issue to check whether any long-term solution is possible? Perhaps it is possible to maintain a running average of the GPU utilization of a given model and report that as avg utilization?
  4. What blockers do currently exist in order to tackle this?
    Thank you.
@dyastremsky dyastremsky added the question Further information is requested label Aug 7, 2023
@dyastremsky
Copy link
Contributor

dyastremsky commented Aug 7, 2023

@GuanLuo added per-model GPU memory usage in this PR, which should be available from 23.06 onwards for TensorRT and ONNX Runtime models. This provides estimated memory usage at load time.

I don't think GPU utilization would be possible, given it is not additive (i.e. a model using 20% GPU in isolation and another model using 50% GPU utilization in isolation will not necessarily use 70% of GPU if running at the same time). I suspect there would be similar issues with trying to get runtime GPU usage with multiple models potentially running, plus there would be the overhead of querying this information repeatedly. Guan could probably provide more context given he implemented per-model GPU usage metrics.

@krishung5
Copy link
Contributor

Closing due to inactivity. Please let us know if you would like to reopen the issue for follow-up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Development

No branches or pull requests

3 participants