Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738

Merged
merged 4 commits into from
Mar 1, 2019

Conversation

feiskyer
Copy link
Member

This PR adds a new GetInstanceID() for cloud provider interface, which is required for fixing the VMSS cases different issues.

For VMSS node's providerIDs, they may be in different cases depending user's PUT requests to the resources. This would cause autoscaler to delete all newly provisioned nodes. And more seriously, CA would scale up/down again and again.

This PR adds a workaround for this issue, so that they are handled in lower cases in CA.

/assign @mwielgus @losipiuk

cc @andyzhangx @ritazh

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2019
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 27, 2019
Copy link
Contributor

@losipiuk losipiuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good modulo one comment.
Did you grep the code for usages of ProviderId?

@feiskyer
Copy link
Member Author

Did you grep the code for usages of ProviderId?

Yep, the providerIDs got from cloud provider are only used in clusterstate.

@@ -56,6 +56,9 @@ type CloudProvider interface {
// GetResourceLimiter returns struct containing limits (max, min) for resources (cores, memory etc.).
GetResourceLimiter() (*ResourceLimiter, error)

// GetInstanceID gets the instance ID for the specified node.
GetInstanceID(node *apiv1.Node) string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we 100% percent sure that this is the right approach? Honestly, I have doubts that a (hopefully) temporary bugfix/workaround for minor issues in NodeController/Azure node registration process should involve an api change for all cloud providers (including ones that are not in this repo). Do we see any other applications of this method other "toLower" in Azure?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't expect to change all the cloud providers, but since the providerID is handled in clusterstate and it is case sensitive, the changes are still required. (there are also other ways, e.g. only this for Azure, but that's a little hacky).

In other applications, we are also keeping the resourceID of Azure VM, but when the resourceID is used for comparison, it is converted to lower or upper cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand if there is a plan to fix it in Azure (ie. make ProviderID match the name of VM)? Basically is this a temporary workaround? Are we planning to remove it later on? If so I'd like a comment stating that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaciekPytel This is actually for fixing the issues with old releases, as providerID couldn't be changed after node initialized.

For new releases, we should fix the providerID inconsistent issues in the cloud provider, probably cast to lower cases instead of using VM name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you talking about master or node versions? How long will we support these old releases? What should w do in, let's say half a year from now - remove this method?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean cherry pick to old stable releases (for each kubernetes release). Those workarounds should be there and never get deleted. But for master branch, we may delete it after fixing the issues in cloud provider (e.g. in v1.14).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, makes sense.

registered := sets.NewString()
for _, node := range allNodes {
registered.Insert(node.Spec.ProviderID)
registered.Insert(cloudProvider.GetInstanceID(node))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is comparable to the previous code. Now, instead of if (azure) providerId=tolower(providerId) we have cloudProvider.GetInstanceID(node)

Copy link
Member Author

@feiskyer feiskyer Feb 28, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, it's cleaner than the previous one.

@mwielgus
Copy link
Contributor

@MaciekPytel @losipiuk @aleksandra-malinowska What do you think? I understand the problem and the need to solve it but I have mixed feelings about the approach - actually every approach would be ugly to some extent. Which is the least ugly to you?

@losipiuk
Copy link
Contributor

@MaciekPytel Łukasz Osipiuk Aleksandra Malinowska What do you think? I understand the problem and the need to solve it but I have mixed feelings about the approach - actually every approach would be ugly to some extent. Which is the least ugly to you?

I like this PR more than explicit if(azure) call in core logic. It is mostly due to aesthetics, yet still. (I agree the two solutions are similar).
I am not concerned very much about API changes. As long as CPs live in this repository we can rollback the change easily if it is no longer needed.
As for CPs living elsewhere - we are not giving any guarantees about interface stability - and I think we should not give such.

With that said I will not force on this approach if over-voted.

@feiskyer
Copy link
Member Author

feiskyer commented Mar 1, 2019

@mwielgus @losipiuk Thanks for the review. If we don't have other better solutions, could we get this way in first?

@aleksandra-malinowska
Copy link
Contributor

+1 from me. As was mentioned offline, it may also open the way for implementations that don't rely on ProviderId, which was brought up multiple times by users running into this not-so-obvious requirement.

@mwielgus
Copy link
Contributor

mwielgus commented Mar 1, 2019

I still don't like it too much but we can merge it if neither @MaciekPytel or @losipiuk have strong negative feelings about it.

@MaciekPytel
Copy link
Contributor

/lgtm
/hold
Feel free to cancel hold after replying to my comment.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 1, 2019
@MaciekPytel
Copy link
Contributor

/hold cancel

@k8s-ci-robot k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2019
@mwielgus
Copy link
Contributor

mwielgus commented Mar 1, 2019

/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2019
@mwielgus
Copy link
Contributor

mwielgus commented Mar 1, 2019

/test all

@mwielgus mwielgus closed this Mar 1, 2019
@mwielgus mwielgus reopened this Mar 1, 2019
@mwielgus
Copy link
Contributor

mwielgus commented Mar 1, 2019

Closing and reopening to reset stalled Travis.

@k8s-ci-robot k8s-ci-robot merged commit 0cd0c90 into kubernetes:master Mar 1, 2019
@feiskyer feiskyer deleted the get-instance-id branch March 2, 2019 04:47
k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019
k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019
k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019
k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants