-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good modulo one comment.
Did you grep the code for usages of ProviderId
?
b0f8c98
to
f4ef957
Compare
Yep, the providerIDs got from cloud provider are only used in clusterstate. |
@@ -56,6 +56,9 @@ type CloudProvider interface { | |||
// GetResourceLimiter returns struct containing limits (max, min) for resources (cores, memory etc.). | |||
GetResourceLimiter() (*ResourceLimiter, error) | |||
|
|||
// GetInstanceID gets the instance ID for the specified node. | |||
GetInstanceID(node *apiv1.Node) string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are we 100% percent sure that this is the right approach? Honestly, I have doubts that a (hopefully) temporary bugfix/workaround for minor issues in NodeController/Azure node registration process should involve an api change for all cloud providers (including ones that are not in this repo). Do we see any other applications of this method other "toLower" in Azure?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't expect to change all the cloud providers, but since the providerID is handled in clusterstate and it is case sensitive, the changes are still required. (there are also other ways, e.g. only this for Azure, but that's a little hacky).
In other applications, we are also keeping the resourceID of Azure VM, but when the resourceID is used for comparison, it is converted to lower or upper cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to understand if there is a plan to fix it in Azure (ie. make ProviderID match the name of VM)? Basically is this a temporary workaround? Are we planning to remove it later on? If so I'd like a comment stating that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaciekPytel This is actually for fixing the issues with old releases, as providerID couldn't be changed after node initialized.
For new releases, we should fix the providerID inconsistent issues in the cloud provider, probably cast to lower cases instead of using VM name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you talking about master or node versions? How long will we support these old releases? What should w do in, let's say half a year from now - remove this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean cherry pick to old stable releases (for each kubernetes release). Those workarounds should be there and never get deleted. But for master branch, we may delete it after fixing the issues in cloud provider (e.g. in v1.14).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, makes sense.
registered := sets.NewString() | ||
for _, node := range allNodes { | ||
registered.Insert(node.Spec.ProviderID) | ||
registered.Insert(cloudProvider.GetInstanceID(node)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is comparable to the previous code. Now, instead of if (azure) providerId=tolower(providerId)
we have cloudProvider.GetInstanceID(node)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, it's cleaner than the previous one.
@MaciekPytel @losipiuk @aleksandra-malinowska What do you think? I understand the problem and the need to solve it but I have mixed feelings about the approach - actually every approach would be ugly to some extent. Which is the least ugly to you? |
I like this PR more than explicit With that said I will not force on this approach if over-voted. |
+1 from me. As was mentioned offline, it may also open the way for implementations that don't rely on ProviderId, which was brought up multiple times by users running into this not-so-obvious requirement. |
I still don't like it too much but we can merge it if neither @MaciekPytel or @losipiuk have strong negative feelings about it. |
/lgtm |
/hold cancel |
/approve |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: mwielgus The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/test all |
Closing and reopening to reset stalled Travis. |
Cluster Autoscaler 1.2 : cherry pick of #1738
Cluster Autoscaler 1.12 : cherry pick of #1738
Cluster Autoscaler 1.3 : cherry pick of #1738
Cluster Autoscaler 1.13 : cherry pick of #1738
This PR adds a new
GetInstanceID()
for cloud provider interface, which is required for fixing the VMSS cases different issues.For VMSS node's providerIDs, they may be in different cases depending user's PUT requests to the resources. This would cause autoscaler to delete all newly provisioned nodes. And more seriously, CA would scale up/down again and again.
This PR adds a workaround for this issue, so that they are handled in lower cases in CA.
/assign @mwielgus @losipiuk
cc @andyzhangx @ritazh