-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: added new graph object for nim runtimes #334
base: incubating
Are you sure you want to change the base?
Conversation
Hi @TomerFi. Thanks for your PR. I'm waiting for a opendatahub-io member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
"queries": [ | ||
{ | ||
"title": "GPU cache usage over time", | ||
"query": "TODO" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be added for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, we're looking for some clarifications about this graph. But this will be included in this PR ASAP.
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: TomerFi The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@@ -277,7 +277,7 @@ const ( | |||
"queries": [ | |||
{ | |||
"title": "GPU cache usage over time", | |||
"query": "TODO" | |||
"query": "round(sum(increase(gpu_cache_usage_perc{namespace='${NAMESPACE}', pod=~'${MODEL_NAME}-predictor-.*'}[${RATE_INTERVAL}])))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this rate-interval needs to be hardcoded, IIRC dashboard does not set it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We set it to 1m here: https://github.com/opendatahub-io/odh-model-controller/blob/main/controllers/utils/utils.go#L404.
That's a good catch. I wanted to use the REQUEST_RATE_INTERVAL, which is 5m, but I mistakenly used the RATE_INTERVAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have more changes coming after yesterday's meeting. Pushing soon.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In 7785941, I added a constant template replacement, "KV_CACHE_SAMPLING_RATE" with the value 24h for the KV cache sampling. I'm not sure we'll stick to that, but I don't want to block the frontend work on our development environment.
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
Signed-off-by: Tomer Figenblat <[email protected]>
7785941
to
86217c9
Compare
Description
Added new graph objects for NIM models.
Work doc
Jira: NVPE-51
How Has This Been Tested?
A custom image was deployed into our dev cluster. After deploying a NIM model to the cluster, I checked the relevant x-metrics-dashboard configmap for the graph objects. I proceeded to execute the instantiated queries against the model after loading it with invocations to provide data. See doc for more details.
Merge criteria: