chore: added new graph object for nim runtimes #334

TomerFi · 2024-12-11T02:20:27Z

Description

Added new graph objects for NIM models.

How Has This Been Tested?

A custom image was deployed into our dev cluster. After deploying a NIM model to the cluster, I checked the relevant x-metrics-dashboard configmap for the graph objects. I proceeded to execute the instantiated queries against the model after loading it with invocations to provide data. See doc for more details.

Merge criteria:

The commits are squashed in a cohesive manner and have meaningful messages.
Testing instructions have been added in the PR body (for PRs involving changes that are not immediately obvious).
The developer has manually tested the changes and verified that the changes work

openshift-ci · 2024-12-11T02:20:38Z

Hi @TomerFi. Thanks for your PR.

I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

spolti · 2024-12-12T15:13:15Z

controllers/constants/runtime-metrics.go

+				"queries": [
+					{
+						"title": "GPU cache usage over time",
+						"query":  "TODO"


will this be added for now?

Yes, we're looking for some clarifications about this graph. But this will be included in this PR ASAP.

openshift-ci · 2024-12-18T15:55:55Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: TomerFi
Once this PR has been reviewed and has the lgtm label, please assign mwaykole for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

spolti · 2024-12-20T14:42:23Z

controllers/constants/runtime-metrics.go

@@ -277,7 +277,7 @@ const (
 				"queries": [
 					{
 						"title": "GPU cache usage over time",
-						"query":  "TODO"
+						"query":  "round(sum(increase(gpu_cache_usage_perc{namespace='${NAMESPACE}', pod=~'${MODEL_NAME}-predictor-.*'}[${RATE_INTERVAL}])))"


I guess this rate-interval needs to be hardcoded, IIRC dashboard does not set it.

We set it to 1m here: https://github.com/opendatahub-io/odh-model-controller/blob/main/controllers/utils/utils.go#L404.

That's a good catch. I wanted to use the REQUEST_RATE_INTERVAL, which is 5m, but I mistakenly used the RATE_INTERVAL.

I have more changes coming after yesterday's meeting. Pushing soon.

In 7785941, I added a constant template replacement, "KV_CACHE_SAMPLING_RATE" with the value 24h for the KV cache sampling. I'm not sure we'll stick to that, but I don't want to block the frontend work on our development environment.

Signed-off-by: Tomer Figenblat <[email protected]>

openshift-ci bot added the do-not-merge/work-in-progress label Dec 11, 2024

openshift-ci bot added the needs-ok-to-test label Dec 11, 2024

spolti reviewed Dec 12, 2024

View reviewed changes

spolti reviewed Dec 20, 2024

View reviewed changes

openshift-merge-robot added the needs-rebase label Dec 22, 2024

TomerFi added 6 commits December 23, 2024 16:51

chore: added new graph object for nim runtimes

89214f9

Signed-off-by: Tomer Figenblat <[email protected]>

chore: added REQUEST_OUTCOMES nim graph

ea7c973

Signed-off-by: Tomer Figenblat <[email protected]>

chore: added fixed typo in nim query object

cd5035d

Signed-off-by: Tomer Figenblat <[email protected]>

chore: fixed typo in nim query object

5e987a6

Signed-off-by: Tomer Figenblat <[email protected]>

chore: added initial query for nim gpu cache usage

2c0d933

Signed-off-by: Tomer Figenblat <[email protected]>

chore: rewrite queries for nim new graphs

86217c9

Signed-off-by: Tomer Figenblat <[email protected]>

TomerFi force-pushed the add-new-nim-graphs branch from 7785941 to 86217c9 Compare December 23, 2024 21:51

openshift-merge-robot removed the needs-rebase label Dec 23, 2024

TomerFi marked this pull request as ready for review January 2, 2025 21:56

openshift-ci bot removed the do-not-merge/work-in-progress label Jan 2, 2025

openshift-ci bot requested review from terrytangyuan and VedantMahabaleshwarkar January 2, 2025 21:56

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore: added new graph object for nim runtimes #334

chore: added new graph object for nim runtimes #334

TomerFi commented Dec 11, 2024 •

edited

Loading

openshift-ci bot commented Dec 11, 2024

spolti Dec 12, 2024

TomerFi Dec 12, 2024

openshift-ci bot commented Dec 18, 2024

spolti Dec 20, 2024 •

edited

Loading

TomerFi Dec 20, 2024

TomerFi Dec 20, 2024

TomerFi Dec 20, 2024

chore: added new graph object for nim runtimes #334

Are you sure you want to change the base?

chore: added new graph object for nim runtimes #334

Conversation

TomerFi commented Dec 11, 2024 • edited Loading

Description

How Has This Been Tested?

Merge criteria:

openshift-ci bot commented Dec 11, 2024

spolti Dec 12, 2024

Choose a reason for hiding this comment

TomerFi Dec 12, 2024

Choose a reason for hiding this comment

openshift-ci bot commented Dec 18, 2024

spolti Dec 20, 2024 • edited Loading

Choose a reason for hiding this comment

TomerFi Dec 20, 2024

Choose a reason for hiding this comment

TomerFi Dec 20, 2024

Choose a reason for hiding this comment

TomerFi Dec 20, 2024

Choose a reason for hiding this comment

TomerFi commented Dec 11, 2024 •

edited

Loading

spolti Dec 20, 2024 •

edited

Loading