-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Stop filtering aggregatedContainerStates without samples #5326
Stop filtering aggregatedContainerStates without samples #5326
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for slow reply, I'm not sure why I missed this PR.
General approach looks good to me, there are some things we need to do to make sure this works properly.
We use TotalSamplesCount
to decide if isEmpty
should return true, which we later use to decide if we should delete the container state (here directly and inside call to isExpired
.
I'm worried that this will result in OOMs only briefly having an effect and then being cleaned up.
Also I wonder why do you increase timeout for the OOM test? |
It's buried in the long PR description, sorry:
When I watched the VPA status changes during the test, I could see that the recommendations were increasing, it just took longer than currently configured. Without the change, the recommendations didn't move at all. |
That's a good point! For the aggregate state to be garbage collected, the Pod needs to be detected a 'non contributive' in addition to the TotalSampleCount being 0 – which shouldn't be the case as long as the Deployment still exists, right? I wasn't sure how likely this would happen, but it definitely could! Do you have concrete suggestions on how to improve the solution so we don't run into this case? |
/lgtm |
ping @jbartosik is there something missing before merging this? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/approve
Thanks for the reminder.
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: jbartosik, voelzmo The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Which component this PR applies to?
vertical-pod-autoscaler
What type of PR is this?
/kind bug
What this PR does / why we need it:
Stop filtering
aggregatedContainerState
s withTotalSampleCount==0
. TheseaggregateContainerState
s can contain OOMSamples, which don't increase theTotalSampleCount
but make it possible to give a better memory recommendation. This change also increases the timeout for the OOMKill e2e test as it takes ~10 minutes for the recommendations to reach the expected range.Which issue(s) this PR fixes:
Fixes #4981
Special notes for your reviewer:
I'm not entirely sure if there is a good reason why
aggregateContainerStates
without PodMetric samples were ignored besides trying to optimize. Maybe someone else has context if there's an additional reason behind it why we shouldn't be removing this condition?I looked at other options to solve this, but they seem much more complicated.
Alternative A: Adding separate
SampleCounts
for CPU and memoryThe VPA is built with the assumption that a sample contains both, CPU and memory data, therefore only a single
TotalSampleCount
exists inaggregateContainerState
. We could move this further down and give each resource its own counter, but confidence is computed based onTotalSampleCount
: therefore this would mean that CPU and memory can have different confidence and therefore different upper and lower bounds. This seems way too intrusive.Alternative B: Increase
TotalSampleCount
when an OOMSample is addedDo something similar to the quick hack I discussed in #4981, but only for OOMSamples, not all memory samples.
Downside: When many OOMSamples are recorded in a short period (e.g. when a Pod suddenly spikes in memory or has waaaaayyyyy too low intial resources defined), the confidence for CPU recommendations increases as well – which is the opposite of what's happening: In a situation like this we don't get PodMetrics and therefore no CPU metrics, so this also seems incorrect.
Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: