[Azure AKS] 403s received from Azure when listing VMSS instances #1674

danmassie · 2019-02-12T11:50:43Z

aks-engine 0.27.0
kubernetes 1.13.1
cluster autoscaler v1.13.1

The cluster autoscaler periodically receives 403s from Azure which results in the pod restarting where it resumes normal behaviour until receiving a further 403.

E0208 08:54:43.057787 1 azure_scale_set.go:199] VirtualMachineScaleSetVMsClient.List failed for k8s-agentpool1-39472669-vmss: compute.VirtualMachineScaleSetVMsClient#List: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Serv ice returned an error. Status=403 Code="AuthorizationFailed" Message="The client '6053ad54-5340-4d95-8842-9f2ac12c4566' with object id '6053ad54-5340-4d95-8842-9f2ac12c4566' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/ virtualMachines/read' over scope '/subscriptions/xxx-xxx-xxx-xxx/resourceGroups/k8s/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-agentpool1-39472669-vmss'." F0208 08:54:43.057854 1 azure_cloud_provider.go:139] Failed to create Azure Manager: compute.VirtualMachineScaleSetVMsClient#List: Failure responding to request: StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="A uthorizationFailed" Message="The client 'xxx-xxx-xxx-xxx' with object id 'xxx-xxx-xxx-xxx' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read' over scope '/subscr iptions/xxx-xxx-xxx-xxx/resourceGroups/k8s/providers/Microsoft.Compute/virtualMachineScaleSets/k8s-agentpool1-39472669-vmss'."

The text was updated successfully, but these errors were encountered:

feiskyer · 2019-02-18T04:21:48Z

From the logs, the client '6053ad54-5340-4d95-8842-9f2ac12c4566' hasn't been authorized to VMSS APIs. Could you configure the AAD for it?

danmassie · 2019-02-18T10:39:52Z

@feiskyer yes, that's the point. The client has permission and the autoscaler works, then it loses permission (perhaps during token refresh?) and the autoscaler restarts. This occurs very frequently.

feiskyer · 2019-02-19T06:59:48Z

@feiskyer yes, that's the point. The client has permission and the autoscaler works, then it loses permission (perhaps during token refresh?) and the autoscaler restarts. This occurs very frequently.

@danmassie What period have you observed for the issue? Actually, the token refresh is doing automatically whenever it's expired. I'm wondering whether there're other potential issues.

oronboni · 2019-05-19T13:23:06Z

Have the same issue. AKS-Engine 0.35.1. K8S version 13.5.

feiskyer · 2019-05-20T03:15:42Z

@oronboni Could you share the logs of CA? Which version of CA are you using?

oronboni · 2019-05-20T05:02:50Z

@feiskyer additional data:

Autoscaler version 1.13.2.
When the cluster created the Autoscaler pod status is fail to run with the following error:

Failed to get nodes from apiserver: Get https://10.0.0.1:443/api/v1/nodes: dial tcp 10.0.0.1:443: i/o timeout.

After deleting the pod I get the following error:
1 azure_cloud_provider.go:145] Failed to create Azure Manager: compute.VirtualMachineScaleSetVMsClient#List: Failure responding to request:
StatusCode=403 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="AuthorizationFailed" Message="The client 'xxx'
with object id 'xxx' does not have authorization to perform action 'Microsoft.Compute/virtualMachineScaleSets/virtualMachines/read'
over scope '/subscriptions/xxx/resourceGroups/RG/providers/Microsoft.Compute/virtualMachineScaleSets/yyyy-vmss'."

When I gave the user identity permissions on the vmss the error stopped but
Seems the issue remains.

pods stay in pending status:
cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached

When changing the amount of nodes manually the pods start.

If additional logs required for investigate please mention the pods or the log file required.

feiskyer · 2019-05-20T07:14:20Z

@oronboni thanks for the information, so the issue is actually different from Dan's. Identity is required for CA to operate VMSS.

By cluster-autoscaler pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached, have you increased the max nodes for vmss?

oronboni · 2019-05-20T07:21:38Z

@feiskyer my max nodes configured in AKS-ENGINE to 50 but the current nodes are 5.

Log from the autoscaler:
W0520 07:30:16.949253 1 scale_up.go:322] Node group XXX-vmss is not ready for scaleup - backoff
I0520 07:30:16.949265 1 scale_up.go:411] No expansion options

Also I think that Dan's problem is the same because I had the same error before I gave to the user identity manually permissions. with the additional permissions the this error disappeared.

feiskyer · 2019-05-20T07:51:49Z

@oronboni So is your cluster scaling-up by CA now? you can run kubectl -n kube-system describe configmaps cluster-autoscaler-status to verify the scale status for each node group. And you could also try delete CA pod and check again.

Also I think that Dan's problem is the same because I had the same error before I gave to the user identity manually permissions. with the additional permissions the this error disappeared.

Dan has claimed The client has permission at #1674 (comment). So it's still not sure why it's happening frequently for his case.

oronboni · 2019-05-20T08:00:08Z

@feiskyer thank you for your quick replay.

I gave the following permissions:

agentSubnet owner
VMSS owner
clusterVnet owner

Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=403 Code="LinkedAuthorizationFailed"
Message="The client 'xxx' with object id 'xxx' has permission to perform action
'Microsoft.Compute/virtualMachineScaleSets/write' on scope '/subscriptions/yyy/resourceGroups/RG/providers/Microsoft.Compute/virtualMachineScaleSets/xxx-vmss';
however, it does not have permission to perform action 'Microsoft.Network/virtualNetworks/subnets/join/action' on the linked scope(s)
'/subscriptions/xxx/resourceGroups/RG/providers/Microsoft.Network/virtualNetworks/clusterVnet/subnets/agentSubnet'."

feiskyer · 2019-05-20T08:14:46Z

@oronboni surprised owner role is still not authorized. could you open a ticket on Azure portal?

oronboni · 2019-05-20T09:01:03Z

I checked permissions in old clusters that created with AKS-Engine 0.31.1.
The user identity added to the resource group as contributor in the cluster creation process.
In AKS-Engine 0.35.1 the user identity not added as contributor.

I removed from all the resources the owenr and add to the resource group contributor and the scaleset add 3 instances.

Second problem (seems K8S issue) more instances required (pods in pending status)
pod didn't trigger scale-up (it wouldn't fit if a new node is added): 1 max limit reached

The auto scaler log:
I0520 09:05:38.604416 1 utils.go:552] No pod using affinity / antiaffinity found in cluster, disabling affinity predicate for this loop
I0520 09:05:38.605961 1 scale_up.go:263] Pod yyy is unschedulable
I0520 09:05:38.605981 1 scale_up.go:263] Pod xxx is unschedulable
I0520 09:05:38.605997 1 scale_up.go:263] Pod zzz is unschedulable
I0520 09:05:38.606096 1 scale_up.go:411] No expansion options

Found the issue:

feiskyer · 2019-05-20T13:48:54Z

@oronboni Did CA work now? the addon config above looks good to me.

oronboni · 2019-05-20T13:59:23Z

@feiskyer thank you for your assistance, yes after the change CA works (the above configuration worked in the past but by checking CA yaml I saw that the nodes configuration was: - --nodes=1:5)

From my side everything works (but there are issues I think you should check in AKS-Engine):
Issue 1: there is a problem in AKS-Engine 0.35.1 deployment, user identity not added to the resource group as contributor as the version 0.31.1 therefore we saw the permission error.

Issue 2: with the configuration above the CA defined the nodes was max 1:5 (took the default values and ignore the json configuration file) change to max-nodes, min-nodes solve the issue (the old configuration worked in the past)

feiskyer · 2019-05-20T14:13:43Z

@oronboni Glad to see it works now, and thanks for providing the details. I think those two issues should be fixed in aks-engine. Would get them involved.

feiskyer · 2019-06-30T03:45:43Z

/close

k8s-ci-robot · 2019-06-30T03:45:44Z

@feiskyer: Closing this issue.

In response to this:

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

danmassie changed the title ~~[Azure AKS]~~ [Azure AKS] 403s received from Azure when listing VMSS instances Feb 12, 2019

aleksandra-malinowska added area/cluster-autoscaler area/provider/azure Issues or PRs related to azure provider labels Feb 12, 2019

mwielgus assigned Jeffwan and feiskyer and unassigned Jeffwan Feb 17, 2019

k8s-ci-robot closed this as completed Jun 30, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Azure AKS] 403s received from Azure when listing VMSS instances #1674

[Azure AKS] 403s received from Azure when listing VMSS instances #1674

danmassie commented Feb 12, 2019

feiskyer commented Feb 18, 2019

danmassie commented Feb 18, 2019

feiskyer commented Feb 19, 2019

oronboni commented May 19, 2019

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 •

edited

Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 •

edited

Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 •

edited

Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 •

edited

Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019

feiskyer commented May 20, 2019

feiskyer commented Jun 30, 2019

k8s-ci-robot commented Jun 30, 2019

[Azure AKS] 403s received from Azure when listing VMSS instances #1674

[Azure AKS] 403s received from Azure when listing VMSS instances #1674

Comments

danmassie commented Feb 12, 2019

feiskyer commented Feb 18, 2019

danmassie commented Feb 18, 2019

feiskyer commented Feb 19, 2019

oronboni commented May 19, 2019

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 • edited Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 • edited Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 • edited Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019 • edited Loading

feiskyer commented May 20, 2019

oronboni commented May 20, 2019

feiskyer commented May 20, 2019

feiskyer commented Jun 30, 2019

k8s-ci-robot commented Jun 30, 2019

oronboni commented May 20, 2019 •

edited

Loading

oronboni commented May 20, 2019 •

edited

Loading

oronboni commented May 20, 2019 •

edited

Loading

oronboni commented May 20, 2019 •

edited

Loading