Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738

feiskyer · 2019-02-27T14:38:44Z

This PR adds a new GetInstanceID() for cloud provider interface, which is required for fixing the VMSS cases different issues.

For VMSS node's providerIDs, they may be in different cases depending user's PUT requests to the resources. This would cause autoscaler to delete all newly provisioned nodes. And more seriously, CA would scale up/down again and again.

This PR adds a workaround for this issue, so that they are handled in lower cases in CA.

/assign @mwielgus @losipiuk

cc @andyzhangx @ritazh

losipiuk

Looks good modulo one comment.
Did you grep the code for usages of ProviderId?

cluster-autoscaler/clusterstate/clusterstate.go

feiskyer · 2019-02-27T14:59:24Z

Did you grep the code for usages of ProviderId?

Yep, the providerIDs got from cloud provider are only used in clusterstate.

cluster-autoscaler/cloudprovider/kubemark/kubemark_other.go

mwielgus · 2019-02-27T20:33:33Z

cluster-autoscaler/cloudprovider/cloud_provider.go

@@ -56,6 +56,9 @@ type CloudProvider interface {
 	// GetResourceLimiter returns struct containing limits (max, min) for resources (cores, memory etc.).
 	GetResourceLimiter() (*ResourceLimiter, error)

+	// GetInstanceID gets the instance ID for the specified node.
+	GetInstanceID(node *apiv1.Node) string


Are we 100% percent sure that this is the right approach? Honestly, I have doubts that a (hopefully) temporary bugfix/workaround for minor issues in NodeController/Azure node registration process should involve an api change for all cloud providers (including ones that are not in this repo). Do we see any other applications of this method other "toLower" in Azure?

I don't expect to change all the cloud providers, but since the providerID is handled in clusterstate and it is case sensitive, the changes are still required. (there are also other ways, e.g. only this for Azure, but that's a little hacky).

In other applications, we are also keeping the resourceID of Azure VM, but when the resourceID is used for comparison, it is converted to lower or upper cases.

I'd like to understand if there is a plan to fix it in Azure (ie. make ProviderID match the name of VM)? Basically is this a temporary workaround? Are we planning to remove it later on? If so I'd like a comment stating that.

@MaciekPytel This is actually for fixing the issues with old releases, as providerID couldn't be changed after node initialized.

For new releases, we should fix the providerID inconsistent issues in the cloud provider, probably cast to lower cases instead of using VM name.

Are you talking about master or node versions? How long will we support these old releases? What should w do in, let's say half a year from now - remove this method?

I mean cherry pick to old stable releases (for each kubernetes release). Those workarounds should be there and never get deleted. But for master branch, we may delete it after fixing the issues in cloud provider (e.g. in v1.14).

Ok, makes sense.

mwielgus · 2019-02-27T20:34:59Z

cluster-autoscaler/clusterstate/clusterstate.go

 	registered := sets.NewString()
 	for _, node := range allNodes {
-		registered.Insert(node.Spec.ProviderID)
+		registered.Insert(cloudProvider.GetInstanceID(node))


This is comparable to the previous code. Now, instead of if (azure) providerId=tolower(providerId) we have cloudProvider.GetInstanceID(node)

yep, it's cleaner than the previous one.

mwielgus · 2019-02-28T14:20:17Z

@MaciekPytel @losipiuk @aleksandra-malinowska What do you think? I understand the problem and the need to solve it but I have mixed feelings about the approach - actually every approach would be ugly to some extent. Which is the least ugly to you?

losipiuk · 2019-02-28T21:04:44Z

@MaciekPytel Łukasz Osipiuk Aleksandra Malinowska What do you think? I understand the problem and the need to solve it but I have mixed feelings about the approach - actually every approach would be ugly to some extent. Which is the least ugly to you?

I like this PR more than explicit if(azure) call in core logic. It is mostly due to aesthetics, yet still. (I agree the two solutions are similar).
I am not concerned very much about API changes. As long as CPs live in this repository we can rollback the change easily if it is no longer needed.
As for CPs living elsewhere - we are not giving any guarantees about interface stability - and I think we should not give such.

With that said I will not force on this approach if over-voted.

feiskyer · 2019-03-01T02:01:55Z

@mwielgus @losipiuk Thanks for the review. If we don't have other better solutions, could we get this way in first?

aleksandra-malinowska · 2019-03-01T10:11:46Z

+1 from me. As was mentioned offline, it may also open the way for implementations that don't rely on ProviderId, which was brought up multiple times by users running into this not-so-obvious requirement.

mwielgus · 2019-03-01T10:44:18Z

I still don't like it too much but we can merge it if neither @MaciekPytel or @losipiuk have strong negative feelings about it.

MaciekPytel · 2019-03-01T11:00:17Z

/lgtm
/hold
Feel free to cancel hold after replying to my comment.

MaciekPytel · 2019-03-01T13:35:57Z

/hold cancel

mwielgus · 2019-03-01T14:00:40Z

/approve

k8s-ci-robot · 2019-03-01T14:01:03Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: mwielgus

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [mwielgus]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mwielgus · 2019-03-01T16:50:19Z

/test all

mwielgus · 2019-03-01T17:06:37Z

Closing and reopening to reset stalled Travis.

Cluster Autoscaler 1.2 : cherry pick of #1738

Cluster Autoscaler 1.12 : cherry pick of #1738

Cluster Autoscaler 1.3 : cherry pick of #1738

Cluster Autoscaler 1.13 : cherry pick of #1738

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 27, 2019

k8s-ci-robot requested review from aleksandra-malinowska and piosz February 27, 2019 14:38

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Feb 27, 2019

losipiuk reviewed Feb 27, 2019

View reviewed changes

cluster-autoscaler/clusterstate/clusterstate.go Show resolved Hide resolved

feiskyer added 4 commits February 27, 2019 22:51

Add GetInstanceID interface for cloudprovider

a9758b2

Implement GetInstanceID for Azure and make instanceID to lower cases

2758133

Implement GetInstanceID for other cloud providers

2e2aab6

Use cloudProvider.GetInstanceID() to get unregistered nodes

f4ef957

feiskyer force-pushed the get-instance-id branch from b0f8c98 to f4ef957 Compare February 27, 2019 14:58

mwielgus suggested changes Feb 27, 2019

View reviewed changes

mwielgus approved these changes Mar 1, 2019

View reviewed changes

k8s-ci-robot assigned MaciekPytel Mar 1, 2019

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Mar 1, 2019

k8s-ci-robot removed the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Mar 1, 2019

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 1, 2019

mwielgus closed this Mar 1, 2019

mwielgus reopened this Mar 1, 2019

k8s-ci-robot merged commit 0cd0c90 into kubernetes:master Mar 1, 2019

feiskyer deleted the get-instance-id branch March 2, 2019 04:47

This was referenced Mar 6, 2019

Cluster Autoscaler 1.13 : cherry pick of #1738 #1755

Merged

Cluster Autoscaler 1.12 : cherry pick of #1738 #1756

Merged

Cluster Autoscaler 1.2 : cherry pick of #1738 #1757

Merged

Cluster Autoscaler 1.3 : cherry pick of #1738 #1758

Merged

k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019

Merge pull request #1757 from feiskyer/cluster-autoscaler-release-1.2

796710d

Cluster Autoscaler 1.2 : cherry pick of #1738

k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019

Merge pull request #1756 from feiskyer/cluster-autoscaler-release-1.12

841c8a1

Cluster Autoscaler 1.12 : cherry pick of #1738

k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019

Merge pull request #1758 from feiskyer/cluster-autoscaler-release-1.3

f9eb5a4

Cluster Autoscaler 1.3 : cherry pick of #1738

k8s-ci-robot added a commit that referenced this pull request Mar 6, 2019

Merge pull request #1755 from feiskyer/cluster-autoscaler-release-1.13

d516464

Cluster Autoscaler 1.13 : cherry pick of #1738

feiskyer mentioned this pull request Mar 8, 2019

Cluster Autoscaler: Cleanup GetInstanceID() interface #1769

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738

Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738

feiskyer commented Feb 27, 2019

losipiuk left a comment

feiskyer commented Feb 27, 2019

mwielgus Feb 27, 2019

feiskyer Feb 28, 2019

MaciekPytel Mar 1, 2019

feiskyer Mar 1, 2019

mwielgus Mar 1, 2019

feiskyer Mar 1, 2019

MaciekPytel Mar 1, 2019

mwielgus Feb 27, 2019

feiskyer Feb 28, 2019 •

edited

Loading

mwielgus commented Feb 28, 2019

losipiuk commented Feb 28, 2019

feiskyer commented Mar 1, 2019

aleksandra-malinowska commented Mar 1, 2019

mwielgus commented Mar 1, 2019

MaciekPytel commented Mar 1, 2019

MaciekPytel commented Mar 1, 2019

mwielgus commented Mar 1, 2019

k8s-ci-robot commented Mar 1, 2019

mwielgus commented Mar 1, 2019

mwielgus commented Mar 1, 2019

Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738

Cluster Autoscaler: add GetInstanceID() for cloudprovider interface #1738

Conversation

feiskyer commented Feb 27, 2019

losipiuk left a comment

Choose a reason for hiding this comment

feiskyer commented Feb 27, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

feiskyer Feb 28, 2019 • edited Loading

Choose a reason for hiding this comment

mwielgus commented Feb 28, 2019

losipiuk commented Feb 28, 2019

feiskyer commented Mar 1, 2019

aleksandra-malinowska commented Mar 1, 2019

mwielgus commented Mar 1, 2019

MaciekPytel commented Mar 1, 2019

MaciekPytel commented Mar 1, 2019

mwielgus commented Mar 1, 2019

k8s-ci-robot commented Mar 1, 2019

mwielgus commented Mar 1, 2019

mwielgus commented Mar 1, 2019

feiskyer Feb 28, 2019 •

edited

Loading