Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify model metrics that will be part of the PoC. #13

Closed
Tracked by #6
devguyio opened this issue Aug 11, 2023 · 1 comment · Fixed by #31
Closed
Tracked by #6

Identify model metrics that will be part of the PoC. #13

devguyio opened this issue Aug 11, 2023 · 1 comment · Fixed by #31
Assignees
Labels
kind/documentation Improvements or additions to documentation priority/high Important issue that needs to be resolved asap. Releases should not have too many o

Comments

@devguyio
Copy link
Contributor

devguyio commented Aug 11, 2023

Use Story

A/C

  • Needed metrics for the PoC are documented.
@grdryn
Copy link
Member

grdryn commented Aug 17, 2023

The UI apparently currently displays the following kinds of metrics, where possible:

  • Number of inference requests per unit time
  • Average inference response time
  • Two measures of model bias (SPD and DIR)

The MLFlow serving container that we're going to use for the PoC currently exposes enough metrics to cover the first two, but it seems like TrustyAI is the thing that would provide the bias metrics, and we don't currently plan to include that in the PoC, as far as I know.

So, from that MLFlow container image, the following are the metrics exposed (shown here without labels to make it more concise, but a more full output can be seen here in this gist):

# HELP parallel_request_queue counter of request queue size for workers
# TYPE parallel_request_queue histogram
parallel_request_queue_sum
parallel_request_queue_bucket
parallel_request_queue_count

# HELP rest_server_request_duration_seconds HTTP request duration, in seconds
# TYPE rest_server_request_duration_seconds histogram
rest_server_request_duration_seconds_sum
rest_server_request_duration_seconds_bucket
rest_server_request_duration_seconds_count

# HELP rest_server_requests_in_progress Total HTTP requests currently in progress
# TYPE rest_server_requests_in_progress gauge
rest_server_requests_in_progress

# HELP rest_server_requests_total Total HTTP requests
# TYPE rest_server_requests_total counter
rest_server_requests_total

Given that there are not that many, and the cardinality seems like it will should remain pretty stable (there are only 3 endpoints that I know of: /ping, /version, and /invocations), I suggest we just scrape and keep all of the metrics for now, unless we have a good reason not to.

In the gist I linked above (here again), I also included an example of the OpenVINO metrics for comparison.

BTW, to enable the REST port and enable the metrics to be served over it, something akin to the following run command is needed:

podman run --rm -ti -p 9000:9000 -p 9001:9001 ovms_face_detection:latest --metrics_enable --rest_port 9001

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/documentation Improvements or additions to documentation priority/high Important issue that needs to be resolved asap. Releases should not have too many o
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants