AAW dev: general nodepools track resource usage #2002

Jose-Matsuda · 2024-12-11T14:56:13Z

EPIC
Follow up to #1997

We want to observe if there are any workloads being booted off / acting weirdly.

jacek-dudek · 2024-12-30T15:40:20Z

Here is a table of resource usage metrics comparing the discrepancies between actual usage and resource requests before request adjustments were made and after adjustments were made.
filtered-resource-utilization-on-aaw-dev-general-nodes.ods
There is a problem here, in that some of the workloads seem to have reverted back to the resource requests that were in effect before the adjustments I made. I checked the source code changes that I made in previous issue to update the requests. Those are still in place. Perhaps I made changes in the wrong place? I'm posting a topic on elab for it. For the ones where the changes persisted the actual usage tracks the updated requests better than previously.

jacek-dudek · 2024-12-31T04:37:38Z

I got some of these outstanding workloads to update their resource requests by manually syncing in argocd web app with help from Mathis. I noticed another issue in that there is an additional container defined running istio-proxy that bumps up the requests, so they're not matching actual usage as well as they could be. I missed those containers in making the original resource request updates, so I will be making another commit to fix that issue.

jacek-dudek · 2025-01-06T02:38:22Z

Investigated the istio-proxy related issue. Worked out that we don't specify that container in any of our manifests, instead it's sidecar-injected by a mutating webhook provided by the istio platform onto any pods that are running in the kubeflow namespace.
The kubeflow namespace has this label set on it: istio-injection=enabled

There are two kubernetes objects responsible for this process, one is a ConfigMap, the other is a mutating webhook, both are called istio-sidecar-injector. They are not found in either of our configuration repositories, but I tracked the underlying manifests down in the upstream kubeflow-manifests repo: https://github.com/kubeflow/manifests/blob/master/common/istio-1-23/istio-install/base/install.yaml.

I'm not sure which version branch we track, but the ConfigMap in the master branch has entries for istio-proxy resource requests which match what I'm seeing in the container specs of istio-proxy running on our cluster, 100mcores, and 128MiB.

Souheil-Yazji · 2025-01-08T16:20:35Z

@jacek-dudek create a follow up issue for the istio-proxy related resource usage and the possibility of updating this in the upstream reference.

Check grafana/prometheus alerts to see if any of the following alerts have fired when pods are down-sized:

jacek-dudek · 2025-01-08T23:56:38Z

For checking the alerts I will need to be granted access to the Prometheus UI as I don't have access for either aaw-dev or aaw-prod at the moment.

Jose-Matsuda assigned jacek-dudek Dec 11, 2024

Jose-Matsuda mentioned this issue Dec 11, 2024

AAW Dev: Resource Utilization #1998

Open

10 tasks

Jose-Matsuda mentioned this issue Dec 30, 2024

AAW Infra: scale down gerenal nodepool #1965

Open

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AAW dev: general nodepools track resource usage #2002

AAW dev: general nodepools track resource usage #2002

Jose-Matsuda commented Dec 11, 2024

jacek-dudek commented Dec 30, 2024

jacek-dudek commented Dec 31, 2024

jacek-dudek commented Jan 6, 2025 •

edited

Loading

Souheil-Yazji commented Jan 8, 2025 •

edited

Loading

jacek-dudek commented Jan 8, 2025 •

edited

Loading

AAW dev: general nodepools track resource usage #2002

AAW dev: general nodepools track resource usage #2002

Comments

Jose-Matsuda commented Dec 11, 2024

jacek-dudek commented Dec 30, 2024

jacek-dudek commented Dec 31, 2024

jacek-dudek commented Jan 6, 2025 • edited Loading

Souheil-Yazji commented Jan 8, 2025 • edited Loading

jacek-dudek commented Jan 8, 2025 • edited Loading

jacek-dudek commented Jan 6, 2025 •

edited

Loading

Souheil-Yazji commented Jan 8, 2025 •

edited

Loading

jacek-dudek commented Jan 8, 2025 •

edited

Loading