Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: added performance metric grpahs config for nvidia nim #320

Merged

Conversation

TomerFi
Copy link
Contributor

@TomerFi TomerFi commented Nov 28, 2024

Description

Added metrics graphs configuration for NVIDIA NIM runtimes, including logic for identifying said runtimes:

Graph Query
Requests per 5 minutes Number of successful incoming requests
Number of failed incoming requests
Average response time (ms) Average inference latency (not included)
Average e2e latency
CPU utilization % CPU usage
Memory utilization % Memory usage

Currently, NIM runtimes do not report inference latency (see here). Hence, the Average inference latency query is NOT included in this PR.

This PR includes:

  • Modifying the Template for NIM's ServingRuntimes:
    • Adding runtime metadata annotation for identification.
    • Adding runtime spec annotations for ISTIO Prometheus metrics merge.
  • Adding metrics JSON object encapsulating NVIDIA NIM queries.
  • Modified metrics JSON selection process to return the new NVIDIA NIM object for runtimes annotated accordingly.

Jira: NVPE-30.

How Has This Been Tested?

This work was tested against an OpenShift cluster (dev04):

  • I deployed NIM runtime.
  • I executed a couple of requests against the runtime.
  • I connected to the ISTIO sidecar and verified the metrics merge.
  • I opened the related graphs page and verified them (see attached snapshot).

image

Note

Since graphs are currently turned off for NIM runtimes, after enabling locally on my computer, the snapshot was taken from a frontend running on my local against the remote cluster. Jira for enabling: NVPE-18

Note

Building and testing the queries required enabling monitoring for user-defined projects (see here), to make the runtime metrics available from OpenShift Metrics dashboard.

Merge criteria:

  • The commits are squashed in a cohesive manner and have meaningful messages.
  • Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
  • The developer has manually tested the changes and verified that the changes work

@openshift-ci openshift-ci bot requested review from Jooho and rnetser November 28, 2024 15:16
Copy link
Contributor

openshift-ci bot commented Nov 28, 2024

Hi @TomerFi. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Jooho
Copy link
Contributor

Jooho commented Nov 28, 2024

/ok-to-test

@spolti
Copy link
Member

spolti commented Nov 28, 2024

changes looks good to me, just one question, should we align this with the dashboard team?

@TomerFi
Copy link
Contributor Author

TomerFi commented Nov 28, 2024

changes looks good to me, just one question, should we align this with the dashboard team?

@spolti—We're working with them. Currently, nim metrics are disabled in the dashboard. We have a Jira in place to eventually enable them back.

@spolti
Copy link
Member

spolti commented Nov 28, 2024

Okay, thanks.

@Jooho
Copy link
Contributor

Jooho commented Nov 29, 2024

/test

Copy link
Contributor

openshift-ci bot commented Nov 29, 2024

@Jooho: The /test command needs one or more targets.
The following commands are available to trigger required jobs:

/test images
/test pr-image-mirror
/test unit

Use /test all to run all jobs.

In response to this:

/test

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@Jooho
Copy link
Contributor

Jooho commented Nov 29, 2024

/test all

Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a small suggestion. But otherwise it is OK.

If you think that the current code is fine, let me know, and I will approve.

controllers/utils/nim.go Outdated Show resolved Hide resolved
@TomerFi
Copy link
Contributor Author

TomerFi commented Dec 1, 2024

I have a small suggestion. But otherwise it is OK.

If you think that the current code is fine, let me know, and I will approve.

Good idea. I accepted the change suggestion.

@TomerFi TomerFi force-pushed the nvidia-nim-metrics branch from d8a3544 to d761fd2 Compare December 1, 2024 14:12
Co-authored-by: Edgar Hernández <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
@TomerFi TomerFi force-pushed the nvidia-nim-metrics branch from d761fd2 to f0cc223 Compare December 1, 2024 14:13
Copy link
Contributor

@israel-hdez israel-hdez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/lgtm

Copy link
Contributor

openshift-ci bot commented Dec 2, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: israel-hdez, spolti, TomerFi

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-merge-bot openshift-merge-bot bot merged commit aee3e05 into opendatahub-io:incubating Dec 2, 2024
5 checks passed
@TomerFi TomerFi deleted the nvidia-nim-metrics branch December 11, 2024 23:25
openshift-merge-bot bot pushed a commit that referenced this pull request Jan 16, 2025
* update global ca bundle logic and storage-config logic to follow up odh operator pr(1339) (#308)

Signed-off-by: jooho lee <[email protected]>

* disable dashboard and fix servingruntime display name

Signed-off-by: jooho lee <[email protected]>

* Use the main branch to build stable image tags, incubating for latest image tags (#316)

Signed-off-by: Hannah DeFazio <[email protected]>

* [RHOAIENG-13638] - Do not allow isvc creation in protected isvc (#311)

* [RHOAIENG-13638] - Do not allow isvc creation in protected namespace

chore: Fixes [RHOAIENG-13638] - Kserve model is not Ready after a kserve model is created and deleted from istio-system namespace

Signed-off-by: Spolti <[email protected]>

* review suggestions

Signed-off-by: Spolti <[email protected]>

* Update controllers/webhook/isvc_validator.go

Co-authored-by: Edgar Hernández <[email protected]>
Signed-off-by: Spolti <[email protected]>

---------

Signed-off-by: Spolti <[email protected]>
Co-authored-by: Edgar Hernández <[email protected]>

* update gitaction based on branch strategy change (#322)

Signed-off-by: jooho lee <[email protected]>

* feat: added performance metric grpahs config for nvidia nim (#320)

* feat: added performance metric grpahs config for nvidia nim

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: modifyed the runtime id annotation

Co-authored-by: Edgar Hernández <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>

---------

Signed-off-by: Tomer Figenblat <[email protected]>
Co-authored-by: Edgar Hernández <[email protected]>

* Add NIM flag logic (#312)

Signed-off-by: mtrujillo <[email protected]>

* Grab the old release tag based on creation date

Signed-off-by: Hannah DeFazio <[email protected]>

* Updated the checkout code command

Signed-off-by: Mariah Holder <[email protected]>

* Updated the checkout code command (#329)

Signed-off-by: Mariah Holder <[email protected]>
Co-authored-by: Mariah Holder <[email protected]>

* Add reconciliation for Kserve Raw (#274)

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>

* chore: added pagination support for nim catalog response (#332)

Signed-off-by: Tomer Figenblat <[email protected]>

* feat(mr): enable model registry inference reconcile (#326)

Signed-off-by: Alessio Pragliola <[email protected]>

* add upstream release metadata (#333)

Signed-off-by: heyselbi <[email protected]>

* Migration to kubebuilder v4 (#324)

* Migration to kubebuilder v4

Signed-off-by: Edgar Hernández <[email protected]>

* Restore MR E2Es

Signed-off-by: Edgar Hernández <[email protected]>

* Restore top-level files

Signed-off-by: Edgar Hernández <[email protected]>

* Cleaning

Signed-off-by: Edgar Hernández <[email protected]>

* Fixing Makefile and Containerfile

Signed-off-by: Edgar Hernández <[email protected]>

* Linter fixes

Signed-off-by: Edgar Hernández <[email protected]>

* Initial rework of manifests

Signed-off-by: Edgar Hernández <[email protected]>

* Fix manifests

Signed-off-by: Edgar Hernández <[email protected]>

* Fix lint issues

Signed-off-by: Edgar Hernández <[email protected]>

* Deactivate E2Es

Because setup is not automated, yet.

Signed-off-by: Edgar Hernández <[email protected]>

* Feedback: Filippe

Signed-off-by: Edgar Hernández <[email protected]>

* Feedback: Filippe

Test differences after `go mod tidy`

Signed-off-by: Edgar Hernández <[email protected]>

* Apply suggestions from code review: Filippe

Co-authored-by: Filippe Spolti <[email protected]>
Signed-off-by: Edgar Hernández <[email protected]>

* Feedback: Filippe

* Pin go-toolset base image in Containerfile.
* Add `gosec` linter

Signed-off-by: Edgar Hernández <[email protected]>

* Update config/prometheus/monitor.yaml

Co-authored-by: Filippe Spolti <[email protected]>
Signed-off-by: Edgar Hernández <[email protected]>

* Feedback: Filippe

* Small change to comments in Makefile, to make the text clearer.
* Remove (again) `gosec` linter

Signed-off-by: Edgar Hernández <[email protected]>

* Fix panic on controller startup

Signed-off-by: Edgar Hernández <[email protected]>

---------

Signed-off-by: Edgar Hernández <[email protected]>
Co-authored-by: Filippe Spolti <[email protected]>

* chore: use naming convention for resources created by nim (#340)

* chore: use naming convention for resources created by nim

Signed-off-by: Tomer Figenblat <[email protected]>

* test: added assertions for dyamic nim resources name

Signed-off-by: Tomer Figenblat <[email protected]>

---------

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: set nim runtime api call page size to 1000 (#344)

Signed-off-by: Tomer Figenblat <[email protected]>

* Nim enablement change default to managed and add clean up job (#342)

* initial commit for clean up of nim and managed set as default

Signed-off-by: mtrujillo <[email protected]>

* remove space

Signed-off-by: mtrujillo <[email protected]>

* fix code length for linting

Signed-off-by: mtrujillo <[email protected]>

* fixed comments / adjusted import

Signed-off-by: mtrujillo <[email protected]>

---------

Signed-off-by: mtrujillo <[email protected]>

* chore: added new graph object for nim runtimes (#334)

* chore: added new graph object for nim runtimes

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: added REQUEST_OUTCOMES nim graph

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: added fixed typo in nim query object

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: fixed typo in nim query object

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: added initial query for nim gpu cache usage

Signed-off-by: Tomer Figenblat <[email protected]>

* chore: rewrite queries for nim new graphs

Signed-off-by: Tomer Figenblat <[email protected]>

---------

Signed-off-by: Tomer Figenblat <[email protected]>

* Update ovms to current build (#343)

Signed-off-by: Steve Grubb <[email protected]>
Co-authored-by: Steve Grubb <[email protected]>

* Automatically inject expected ODH annotations to InferenceGraph and InferenceServices (#339)

* Implementation of ODH defaulters for InferenceGraph and InferenceService

On creation of InferenceGraph or InferenceService resources, the following default annotations will be added:
* `serving.knative.openshift.io/enablePassthrough: true`
* `sidecar.istio.io/inject: true`
* `sidecar.istio.io/rewriteAppHTTPProbers: true`

The annotations are added only for Serverless mode, and only if they are missing.

Signed-off-by: Edgar Hernández <[email protected]>

* Feedback: Filippe

Extract "ENABLE_WEBHOOKS" string to constant

Signed-off-by: Edgar Hernández <[email protected]>

---------

Signed-off-by: Edgar Hernández <[email protected]>

* Authorization for InferenceGraph (Serverless) (#345)

* Authorization for InferenceGraph (Serverless)

This adds a new controller for KServe InferenceGraph resources. This new controller will have the responsibility of creating Authorino AuthConfig resources (similarly to InferenceServices case), when authorization is available in ODH platform.

InferenceGraphs can now be annotated with `security.opendatahub.io/enable-auth: "true"` to secure InferenceGraphs and only serve requests that are authorized.

Signed-off-by: Edgar Hernández <[email protected]>

* Feedback: Filippe - Event when auth is not available

Signed-off-by: Edgar Hernández <[email protected]>

---------

Signed-off-by: Edgar Hernández <[email protected]>

* [RHOAIENG-10293] add metrics resources for rawdeployment (#347)

* [RHOAIENG-10293] add metrics resources for rawdeployment

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>

* [RHOAIENG-10293] address feedback

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>

---------

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>

* [RHOAIENG-16851] rawdeployment route bug fixes (#341)

Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>

* fix null pointer error (RHOAIENG-18228) (#349)

Signed-off-by: jooho lee <[email protected]>

* remove old file

Signed-off-by: jooho lee <[email protected]>

update go.mod

Signed-off-by: jooho lee <[email protected]>

---------

Signed-off-by: jooho lee <[email protected]>
Signed-off-by: Hannah DeFazio <[email protected]>
Signed-off-by: Spolti <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: mtrujillo <[email protected]>
Signed-off-by: Mariah Holder <[email protected]>
Signed-off-by: Vedant Mahabaleshwarkar <[email protected]>
Signed-off-by: Alessio Pragliola <[email protected]>
Signed-off-by: heyselbi <[email protected]>
Signed-off-by: Edgar Hernández <[email protected]>
Signed-off-by: Steve Grubb <[email protected]>
Co-authored-by: Hannah DeFazio <[email protected]>
Co-authored-by: Filippe Spolti <[email protected]>
Co-authored-by: Edgar Hernández <[email protected]>
Co-authored-by: Tomer Figenblat <[email protected]>
Co-authored-by: Marcus Trujillo <[email protected]>
Co-authored-by: Mariah Holder <[email protected]>
Co-authored-by: Mariah Holder <[email protected]>
Co-authored-by: Vedant Mahabaleshwarkar <[email protected]>
Co-authored-by: Tomer Figenblat <[email protected]>
Co-authored-by: Alessio Pragliola <[email protected]>
Co-authored-by: Selbi Nuryyeva <[email protected]>
Co-authored-by: Steven Grubb <[email protected]>
Co-authored-by: Steve Grubb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants