Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Azure AKS] 403s received from Azure when listing VMSS instances #1674

Closed
danmassie opened this issue Feb 12, 2019 · 17 comments
Closed

[Azure AKS] 403s received from Azure when listing VMSS instances #1674

danmassie opened this issue Feb 12, 2019 · 17 comments
Assignees
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider

Comments

@danmassie
Copy link

aks-engine 0.27.0
kubernetes 1.13.1
cluster autoscaler v1.13.1

The cluster autoscaler periodically receives 403s from Azure which results in the pod restarting where it resumes normal behaviour until receiving a further 403.

E0208 08:54:43.057787 1 azure_scale_set.go:199] VirtualMachineScaleSetVMsClient.List failed for k8s-agentpool1-39472669-vmss: compute.VirtualMachineScaleSetVMsClient#List: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Serv ice returned an error. Status=403 Code="AuthorizationFailed" Message="The client '6053ad54-5340-4d95-8842-9f2ac12c4566' with object id '6053ad54-5340-4d95-8842-9f2ac12c4566' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/ virtualMachines/read' over scope '/subscriptions/xxx-xxx-xxx-xxx/resourceGroups/k8s/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-agentpool1-39472669-vmss'." F0208 08:54:43.057854 1 azure_cloud_provider.go:139] Failed to create Azure Manager: compute.VirtualMachineScaleSetVMsClient#List: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="A uthorizationFailed" Message="The client 'xxx-xxx-xxx-xxx' with object id 'xxx-xxx-xxx-xxx' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read' over scope '/subscr iptions/xxx-xxx-xxx-xxx/resourceGroups/k8s/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-agentpool1-39472669-vmss'."

@danmassie danmassie changed the title [Azure AKS] [Azure AKS] 403s received from Azure when listing VMSS instances Feb 12, 2019
@aleksandra-malinowska aleksandra-malinowska added area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider labels Feb 12, 2019
@mwielgus mwielgus assigned Jeffwan and feiskyer and unassigned Jeffwan Feb 17, 2019
@feiskyer
Copy link
Member

From the logs, the client '6053ad54-5340-4d95-8842-9f2ac12c4566' hasn't been authorized to VMSS APIs. Could you configure the AAD for it?

@danmassie
Copy link
Author

@feiskyer yes, that's the point. The client has permission and the autoscaler works, then it loses permission (perhaps during token refresh?) and the autoscaler restarts. This occurs very frequently.

@feiskyer
Copy link
Member

@feiskyer yes, that's the point. The client has permission and the autoscaler works, then it loses permission (perhaps during token refresh?) and the autoscaler restarts. This occurs very frequently.

@danmassie What period have you observed for the issue? Actually, the token refresh is doing automatically whenever it's expired. I'm wondering whether there're other potential issues.

@oronboni
Copy link

Have the same issue. AKS-Engine 0.35.1. K8S version 13.5.

@feiskyer
Copy link
Member

@oronboni Could you share the logs of CA? Which version of CA are you using?

@oronboni
Copy link

oronboni commented May 20, 2019

@feiskyer additional data:

Autoscaler version 1.13.2.
When the cluster created the Autoscaler pod status is fail to run with the following error:

Failed to get nodes from apiserver: Get https://10.0.0.1:443/api/v1/nodes: dial tcp 10.0.0.1:443: i/o timeout.

After deleting the pod I get the following error:
1 azure_cloud_provider.go:145] Failed to create Azure Manager: compute.VirtualMachineScaleSetVMsClient#List: Failure responding to request:
StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client 'xxx'
with object id 'xxx' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read'
over scope '/subscriptions/xxx/resourceGroups/RG/providers/Microsoft.Compute/virtualMachineScaleSets/yyyy-vmss'."

When I gave the user identity permissions on the vmss the error stopped but
Seems the issue remains.

pods stay in pending status:
cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached

When changing the amount of nodes manually the pods start.

If additional logs required for investigate please mention the pods or the log file required.

@feiskyer
Copy link
Member

@oronboni thanks for the information, so the issue is actually different from Dan's. Identity is required for CA to operate VMSS.

By cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached, have you increased the max nodes for vmss?

@oronboni
Copy link

oronboni commented May 20, 2019

@feiskyer my max nodes configured in AKS-ENGINE to 50 but the current nodes are 5.

imgo

Log from the autoscaler:
W0520 07:30:16.949253 1 scale_up.go:322] Node group XXX-vmss is not ready for scaleup - backoff
I0520 07:30:16.949265 1 scale_up.go:411] No expansion options

Also I think that Dan's problem is the same because I had the same error before I gave to the user identity manually permissions. with the additional permissions the this error disappeared.

@feiskyer
Copy link
Member

@oronboni So is your cluster scaling-up by CA now? you can run kubectl -n kube-system describe configmaps cluster-autoscaler-status to verify the scale status for each node group. And you could also try delete CA pod and check again.

Also I think that Dan's problem is the same because I had the same error before I gave to the user identity manually permissions. with the additional permissions the this error disappeared.

Dan has claimed The client has permission at #1674 (comment). So it's still not sure why it's happening frequently for his case.

@oronboni
Copy link

oronboni commented May 20, 2019

@feiskyer thank you for your quick replay.

I gave the following permissions:

  • agentSubnet owner
  • VMSS owner
  • clusterVnet owner

Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="LinkedAuthorizationFailed"
Message="The client 'xxx' with object id 'xxx' has permission to perform action
'Microsoft.Compute/virtualMachineScaleSets/write' on scope '/subscriptions/yyy/resourceGroups/RG/providers/Microsoft.Compute/virtualMachineScaleSets/xxx-vmss';
however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s)
'/subscriptions/xxx/resourceGroups/RG/providers/Microsoft.Network/virtualNetworks/clusterVnet/subnets/agentSubnet'."

@feiskyer
Copy link
Member

@oronboni surprised owner role is still not authorized. could you open a ticket on Azure portal?

@oronboni
Copy link

oronboni commented May 20, 2019

I checked permissions in old clusters that created with AKS-Engine 0.31.1.
The user identity added to the resource group as contributor in the cluster creation process.
In AKS-Engine 0.35.1 the user identity not added as contributor.

I removed from all the resources the owenr and add to the resource group contributor and the scaleset add 3 instances.

Second problem (seems K8S issue) more instances required (pods in pending status)
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached

The auto scaler log:
I0520 09:05:38.604416 1 utils.go:552] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0520 09:05:38.605961 1 scale_up.go:263] Pod yyy is unschedulable
I0520 09:05:38.605981 1 scale_up.go:263] Pod xxx is unschedulable
I0520 09:05:38.605997 1 scale_up.go:263] Pod zzz is unschedulable
I0520 09:05:38.606096 1 scale_up.go:411] No expansion options

Found the issue:
image

image

@feiskyer
Copy link
Member

@oronboni Did CA work now? the addon config above looks good to me.

@oronboni
Copy link

@feiskyer thank you for your assistance, yes after the change CA works (the above configuration worked in the past but by checking CA yaml I saw that the nodes configuration was: - --nodes=1:5)

From my side everything works (but there are issues I think you should check in AKS-Engine):
Issue 1: there is a problem in AKS-Engine 0.35.1 deployment, user identity not added to the resource group as contributor as the version 0.31.1 therefore we saw the permission error.

Issue 2: with the configuration above the CA defined the nodes was max 1:5 (took the default values and ignore the json configuration file) change to max-nodes, min-nodes solve the issue (the old configuration worked in the past)

@feiskyer
Copy link
Member

@oronboni Glad to see it works now, and thanks for providing the details. I think those two issues should be fixed in aks-engine. Would get them involved.

@feiskyer
Copy link
Member

/close

@k8s-ci-robot
Copy link
Contributor

@feiskyer: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider
Projects
None yet
Development

No branches or pull requests

6 participants