-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AAW dev: general nodepools track resource usage #2002
Comments
Here is a table of resource usage metrics comparing the discrepancies between actual usage and resource requests before request adjustments were made and after adjustments were made. |
I got some of these outstanding workloads to update their resource requests by manually syncing in argocd web app with help from Mathis. I noticed another issue in that there is an additional container defined running istio-proxy that bumps up the requests, so they're not matching actual usage as well as they could be. I missed those containers in making the original resource request updates, so I will be making another commit to fix that issue. |
Investigated the istio-proxy related issue. Worked out that we don't specify that container in any of our manifests, instead it's sidecar-injected by a mutating webhook provided by the istio platform onto any pods that are running in the kubeflow namespace. There are two kubernetes objects responsible for this process, one is a ConfigMap, the other is a mutating webhook, both are called istio-sidecar-injector. They are not found in either of our configuration repositories, but I tracked the underlying manifests down in the upstream kubeflow-manifests repo: https://github.com/kubeflow/manifests/blob/master/common/istio-1-23/istio-install/base/install.yaml. I'm not sure which version branch we track, but the ConfigMap in the master branch has entries for istio-proxy resource requests which match what I'm seeing in the container specs of istio-proxy running on our cluster, 100mcores, and 128MiB. |
@jacek-dudek create a follow up issue for the istio-proxy related resource usage and the possibility of updating this in the upstream reference. Check grafana/prometheus alerts to see if any of the following alerts have fired when pods are down-sized: |
For checking the alerts I will need to be granted access to the Prometheus UI as I don't have access for either aaw-dev or aaw-prod at the moment. |
EPIC
Follow up to #1997
We want to observe if there are any workloads being booted off / acting weirdly.
The text was updated successfully, but these errors were encountered: