Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AAW dev: general nodepools track resource usage #2002

Open
Jose-Matsuda opened this issue Dec 11, 2024 · 5 comments
Open

AAW dev: general nodepools track resource usage #2002

Jose-Matsuda opened this issue Dec 11, 2024 · 5 comments
Assignees

Comments

@Jose-Matsuda
Copy link
Contributor

EPIC
Follow up to #1997

We want to observe if there are any workloads being booted off / acting weirdly.

@jacek-dudek
Copy link

Here is a table of resource usage metrics comparing the discrepancies between actual usage and resource requests before request adjustments were made and after adjustments were made.
filtered-resource-utilization-on-aaw-dev-general-nodes.ods
There is a problem here, in that some of the workloads seem to have reverted back to the resource requests that were in effect before the adjustments I made. I checked the source code changes that I made in previous issue to update the requests. Those are still in place. Perhaps I made changes in the wrong place? I'm posting a topic on elab for it. For the ones where the changes persisted the actual usage tracks the updated requests better than previously.

@jacek-dudek
Copy link

I got some of these outstanding workloads to update their resource requests by manually syncing in argocd web app with help from Mathis. I noticed another issue in that there is an additional container defined running istio-proxy that bumps up the requests, so they're not matching actual usage as well as they could be. I missed those containers in making the original resource request updates, so I will be making another commit to fix that issue.

@jacek-dudek
Copy link

jacek-dudek commented Jan 6, 2025

Investigated the istio-proxy related issue. Worked out that we don't specify that container in any of our manifests, instead it's sidecar-injected by a mutating webhook provided by the istio platform onto any pods that are running in the kubeflow namespace.
The kubeflow namespace has this label set on it: istio-injection=enabled

There are two kubernetes objects responsible for this process, one is a ConfigMap, the other is a mutating webhook, both are called istio-sidecar-injector. They are not found in either of our configuration repositories, but I tracked the underlying manifests down in the upstream kubeflow-manifests repo: https://github.com/kubeflow/manifests/blob/master/common/istio-1-23/istio-install/base/install.yaml.

I'm not sure which version branch we track, but the ConfigMap in the master branch has entries for istio-proxy resource requests which match what I'm seeing in the container specs of istio-proxy running on our cluster, 100mcores, and 128MiB.

@Souheil-Yazji
Copy link
Contributor

Souheil-Yazji commented Jan 8, 2025

@jacek-dudek create a follow up issue for the istio-proxy related resource usage and the possibility of updating this in the upstream reference.

Check grafana/prometheus alerts to see if any of the following alerts have fired when pods are down-sized:
Image

@jacek-dudek
Copy link

jacek-dudek commented Jan 8, 2025

For checking the alerts I will need to be granted access to the Prometheus UI as I don't have access for either aaw-dev or aaw-prod at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants