-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152
BUG 1824215: Allow small tolerance on memory capacity when comparing nodegroups #152
Conversation
@JoelSpeed: This pull request references Bugzilla bug 1824215, which is invalid:
Comment In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/bugzilla refresh |
@JoelSpeed: This pull request references Bugzilla bug 1824215, which is valid. The bug has been moved to the POST state. The bug has been updated to refer to the pull request using the external bug tracker. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks nice to me, thanks Joel!
/approve
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: elmiko The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Thanks! This is purely autoscaler core, can we get a counter PR upstream? |
/lgtm |
/retest Please review the full test history for this PR and help us cut down flakes. |
@@ -124,8 +126,10 @@ func IsCloudProviderNodeInfoSimilar(n1, n2 *schedulernodeinfo.NodeInfo, ignoredL | |||
switch kind { | |||
case apiv1.ResourceMemory: | |||
// For memory capacity we allow a small tolerance | |||
memoryDifference := math.Abs(float64(qtyList[0].Value()) - float64(qtyList[1].Value())) | |||
if memoryDifference > MaxMemoryDifferenceInKiloBytes { | |||
difference := absSub(qtyList[0], qtyList[1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could just use math.Abs
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to keep all of the quantities as resource.Quantity
's, so this helper allows us to do that and reduce the conversions to/from integers to reduce the likelihood of a mistake being made there
|
||
var ( | ||
// MaxMemoryDifference describes how much memory capacity can differ but still be considered equal. | ||
MaxMemoryDifference = resource.MustParse("256Mi") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How big is the diff we are seeing in real nodes?
Tolerating 256Mi seems too much as to consider nodeGroups equal. The original intention was to tolerate 128Ki kubernetes@e8b3c2a.
I think this should be 256Ki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see the test case I've added below which came from a real world test case. The values that came through the code (via much debug logging) were 16116152Ki
and 15944120Ki
, which is 168Mi
, just over a 1% difference in this case
Please also review the attached BZ which has more details from a customer who report differences in a similar magnitude.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't this MaxMemoryDifference
also apply to much more smaller instances to the point of making the check loosing its value?
i.e If the possible diff range increase with the instance size, should we may be make our tolerance window a percentage of the given total size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep the discussion to the upstream PR for better visibility https://github.com/kubernetes/autoscaler/pull/3124/files#r422931565
/lgtm cancel |
Counter PR will be created shortly |
d52511d
to
d26853f
Compare
Upstream kubernetes#3124 |
…hen comparing nodegroups This allows developers to better interpet how the calculations are being done by leaving the values as "Quantities". For example, the max difference is now a string converted to a quantity which will be easier to reason about and update if needed in the future. Also adds tests that match real values from a real set of nodes that would be expected to be the same
b19f2ee
to
90751c4
Compare
@JoelSpeed: This pull request references Bugzilla bug 1824215, which is valid. 3 validation(s) were run on this bug
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/hold cancel This has been updated to reflect the upstream implementation which should be merging within the next week |
@JoelSpeed: The following test failed, say
Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here. |
thanks Joel! |
@JoelSpeed: All pull requests linked via external trackers have merged: openshift/kubernetes-autoscaler#152, openshift/kubernetes-autoscaler#144. Bugzilla bug 1824215 has been moved to the MODIFIED state. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/cherry-pick release-4.5 |
@JoelSpeed: new pull request created: #157 In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
This allows a small tolerance in the memory capacity of nodes to allow better matching of similar node groups. There are differences in the memory values that Kubernetes interprets due to variances in the instances that a cloud provider provides.
Also adds tests that match real values from a real set of nodes that would be expected to be the same (the same instance type across multiple availability zones within a given region)
Eg. In testing I saw AWS m5.xlarge nodes with capacities such as
16116152Ki
and15944120Ki
not only across availability zones, but within the same availability zone after a few cycles through machines. This is a difference on168Mi
which is much larger than the original tolerance of 128000 Bytes which was preventingBalanceSimilarNodeGroups
from balancing across these availability zones.