Skip to content
This repository was archived by the owner on Jan 11, 2023. It is now read-only.

Not able to deploy K8s Cluster with ACS-Engine 14.6 version on azure #2591

Closed
rakeshkulkarni6 opened this issue Apr 4, 2018 · 19 comments
Closed

Comments

@rakeshkulkarni6
Copy link

Is this a request for help?:


Is this an ISSUE or FEATURE REQUEST? (choose one):
ISSUE

What version of acs-engine?:

acs-engine v0.14.6

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
kubernetes 1.9.0

What happened:
I have deployed K8S cluster using acs-engine v0.14.6 . After cluster deployes I am not able to see any nodes listed. I have checkd docker images and docker container where any containers are not created.
Docker Images:
root@k8s-master-63864159-0:/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/status# docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
k8s-gcrio.azureedge.net/hyperkube-amd64 v1.9.0 0e4e0ed658bb 3 months ago 618 MB
Docker PS:
root@k8s-master-63864159-0:/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/status# docker ps
CONTAINER ID IMAGE COMMAND CREATED STATUS PORTS NAMES
root@k8s-master-63864159-0:/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/status#

What you expected to happen:
Kubernetes cluster should get deployed with Kubeernetes V 1.9.0 using acs-engine v0.14.6

How to reproduce it (as minimally and precisely as possible):
acs-engine generate
Anything else we need to know:
Extension Status:
root@k8s-master-63864159-0:/var/lib/waagent/Microsoft.Azure.Extensions.CustomScript-2.0.6/status# cat 0.status
[
{
"version": 1,
"timestampUTC": "2018-04-04T15:15:45Z",
"status": {
"operation": "Enable",
"status": "error",
"formattedMessage": {
"lang": "en",
"message": "Enable failed: failed to execute command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n"
}
}
}

I have tried many version from acs-engine v 0.12.0 to 0.14.6 andy of the version is not deploying kubernetes cluster

@rakeshkulkarni6 rakeshkulkarni6 changed the title Not able K8s Cluster with ACS-Engine 14.6 version Not able to deploy K8s Cluster with ACS-Engine 14.6 version on azure Apr 4, 2018
@CecileRobertMichon
Copy link
Contributor

Hi @rakeshkulkarni6 , what does your apimodel look like?

@rakeshkulkarni6
Copy link
Author

rakeshkulkarni6 commented Apr 5, 2018

Hi here is the api model for your reference

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.6",
"kubernetesConfig": {
"kubernetesImageBase": "k8s-gcrio.azureedge.net/",
"clusterSubnet": "10.246.0.0/16",
"dnsServiceIP": "10.0.0.10",
"serviceCidr": "10.0.0.0/16",
"networkPolicy": "azure",
"maxPods": 30,
"dockerBridgeSubnet": "172.17.0.1/16",
"useInstanceMetadata": true,
"enableRbac": true,
"enableSecureKubelet": true,
"privateCluster": {
"enabled": false
},
"gchighthreshold": 85,
"gclowthreshold": 80,
"etcdVersion": "3.2.16",
"etcdDiskSizeGB": "130",
"addons": [
{
"name": "tiller",
"enabled": true,
"containers": [
{
"name": "tiller",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"max-history": "0"
}
},
{
"name": "aci-connector",
"enabled": false,
"containers": [
{
"name": "aci-connector",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"nodeName": "aci-connector",
"os": "Linux",
"region": "westus",
"taint": "azure.com/aci"
}
},
{
"name": "kubernetes-dashboard",
"enabled": true,
"containers": [
{
"name": "kubernetes-dashboard",
"cpuRequests": "300m",
"memoryRequests": "150Mi",
"cpuLimits": "300m",
"memoryLimits": "150Mi"
}
]
},
{
"name": "rescheduler",
"enabled": false,
"containers": [
{
"name": "rescheduler",
"cpuRequests": "10m",
"memoryRequests": "100Mi",
"cpuLimits": "10m",
"memoryLimits": "100Mi"
}
]
},
{
"name": "metrics-server",
"enabled": true,
"containers": [
{
"name": "metrics-server"
}
]
}
],
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "110",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.246.0.0/16",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
},
"controllerManagerConfig": {
"--allocate-node-cidrs": "false",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.246.0.0/16",
"--cluster-name": "somadeleteacspoc",
"--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
"--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
"--feature-gates": "ServiceNodeExclusion=true",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "5m0s",
"--profiling": "false",
"--root-ca-file": "/etc/kubernetes/certs/ca.crt",
"--route-reconciliation-period": "10s",
"--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--terminated-pod-gc-threshold": "5000",
"--use-service-account-credentials": "true",
"--v": "2"
},
"cloudControllerManagerConfig": {
"--allocate-node-cidrs": "false",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.246.0.0/16",
"--cluster-name": "somadeleteacspoc",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--route-reconciliation-period": "10s",
"--v": "2"
},
"apiServerConfig": {
"--admission-control": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,ResourceQuota,DenyEscalatingExec,AlwaysPullImages",
"--advertise-address": "",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--audit-log-maxage": "30",
"--audit-log-maxbackup": "10",
"--audit-log-maxsize": "100",
"--audit-log-path": "/var/log/audit.log",
"--audit-policy-file": "/etc/kubernetes/manifests/audit-policy.yaml",
"--authorization-mode": "Node,RBAC",
"--bind-address": "0.0.0.0",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
"--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
"--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
"--etcd-quorum-read": "true",
"--etcd-servers": "https://127.0.0.1:2379",
"--insecure-port": "8080",
"--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
"--kubelet-client-key": "/etc/kubernetes/certs/client.key",
"--profiling": "false",
"--proxy-client-cert-file": "/etc/kubernetes/certs/proxy.crt",
"--proxy-client-key-file": "/etc/kubernetes/certs/proxy.key",
"--repair-malformed-updates": "false",
"--requestheader-allowed-names": "",
"--requestheader-client-ca-file": "/etc/kubernetes/certs/proxy-ca.crt",
"--requestheader-extra-headers-prefix": "X-Remote-Extra-",
"--requestheader-group-headers": "X-Remote-Group",
"--requestheader-username-headers": "X-Remote-User",
"--secure-port": "443",
"--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
"--service-account-lookup": "true",
"--service-cluster-ip-range": "10.0.0.0/16",
"--storage-backend": "etcd3",
"--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
"--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--v": "4"
},
"schedulerConfig": {
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--profiling": "false",
"--v": "2"
}
}
},
"masterProfile": {
"count": 1,
"dnsPrefix": "somadeleteacspoc",
"vmSize": "Standard_E8s_v3",
"vnetSubnetID": "/subscriptions/XXXXXXXXXXXXXXXXXXXXXXXXXXXX/resourceGroups/somadelete3/providers/Microsoft.Network/virtualNetworks/somadeletevnet/subnets/default",
"firstConsecutiveStaticIP": "10.10.1.45",
"storageProfile": "ManagedDisks",
"oauthEnabled": false,
"preProvisionExtension": null,
"extensions": [],
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "110",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.246.0.0/16",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
}
},
"agentPoolProfiles": [
{
"name": "agentpool1",
"count": 1,
"vmSize": "Standard_E8s_v3",
"osType": "Linux",
"availabilityProfile": "AvailabilitySet",
"storageProfile": "ManagedDisks",
"vnetSubnetID": "/subscriptions/XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX/resourceGroups/somadelete3/providers/Microsoft.Network/virtualNetworks/somadeletevnet/subnets/default",
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "Accelerators=true",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "110",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.246.0.0/16",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
},
"fqdn": "",
"preProvisionExtension": null,
"extensions": []
}
],
"linuxProfile": {
"adminUsername": "cloudinfraadmin",
"ssh": {
"publicKeys": [
{
"keyData": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"secret": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX"
},
"certificateProfile": {
"caCertificate": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"caPrivateKey": "XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"apiServerCertificate": "XXXXXXXXXXXXXXXXXXXXXXXXXXX",
"clientCertificate": "XXXXXXXXXXXXXXXXXXXXXXX",
"clientPrivateKey": "XXXXXXXXXXXXXXXXXXXXXXXXXXXX",
"kubeConfigCertificate": "XXXXXXXXXXXXXXXXXXXXXXXX",
"kubeConfigPrivateKey": "XXXXXXXXXXXXXXXXXXXXXX",
"etcdServerCertificate": "XXXXXXXXXXXXXXXX",
"etcdServerPrivateKey": "XXXXXXXXXXXXXXXXXXXXXXXXX",
"etcdClientCertificate": "XXXXXXXXXXXXX",
"etcdClientPrivateKey": "XXXXXXXXXXXXXXXX",
"etcdPeerCertificates": [
XXXXXXXXXXXXXXXXXXX"
],
"etcdPeerPrivateKeys": [
""
]
}
}
}

@rakeshkulkarni6
Copy link
Author

Hi I have tried with new version acs-engine 0.15.0 to deploy kubernetes cluster on azure.
I have generated arm templates using acs-engine generate command and deployed using powershell command.

I am getting below error:
New-AzureRmResourceGroupDeployment : 5:54:07 PM - Resource Microsoft.Compute/virtualMachines/extensions
'k8s-master-63864159-0/cse0' failed with message '{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute
command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n"."
}
]
}
}'
At line:1 char:1

  • New-AzureRmResourceGroupDeployment -Name smatestpoc -ResourceGroupNam ...
  •   + CategoryInfo          : NotSpecified: (:) [New-AzureRmResourceGroupDeployment], Exception
      + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureResourceGroupDeploymentCmdlet
    
    

New-AzureRmResourceGroupDeployment : 5:54:07 PM - VM has reported a failure when processing extension 'cse0'. Error message:
"Enable failed: failed to execute command: command terminated with exit status=3
[stdout]
[stderr]
".
At line:1 char:1

  • New-AzureRmResourceGroupDeployment -Name smatestpoc -ResourceGroupNam ...
  •   + CategoryInfo          : NotSpecified: (:) [New-AzureRmResourceGroupDeployment], Exception
      + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureResourceGroupDeploymentCmdlet
    

@CecileRobertMichon
Copy link
Contributor

@rakeshkulkarni6 please remove all secrets/keys from the apimodel you shared

@CecileRobertMichon
Copy link
Contributor

@rakeshkulkarni6 can you please try deploying with k8s 1.9.6? There are known upstream bugs in 1.9.0

@rakeshkulkarni6
Copy link
Author

Hi @CecileRobertMichon ,

Again its sowing same error, Please find the below error details.
it seeme the extension is not able to provision script to deploy Kubernetes cluster.

New-AzureRmResourceGroupDeployment : 1:24:27 PM - Resource Microsoft.Compute/virtualMachines/extensions
'k8s-master-63864159-0/cse0' failed with message '{
"status": "Failed",
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute
command: command terminated with exit status=3\n[stdout]\n\n[stderr]\n"."
}
]
}
}'
At line:1 char:1

  • New-AzureRmResourceGroupDeployment -Name smatestpoc -ResourceGroupNam ...
  •   + CategoryInfo          : NotSpecified: (:) [New-AzureRmResourceGroupDeployment], Exception
      + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureResourceGroupDeploymentCmdlet
    
    

New-AzureRmResourceGroupDeployment : 1:24:27 PM - VM has reported a failure when processing extension 'cse0'. Error message:
"Enable failed: failed to execute command: command terminated with exit status=3
[stdout]
[stderr]
".
At line:1 char:1

  • New-AzureRmResourceGroupDeployment -Name smatestpoc -ResourceGroupNam ...
  •   + CategoryInfo          : NotSpecified: (:) [New-AzureRmResourceGroupDeployment], Exception
      + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureResourceGroupDeploymentCmdlet
    
    

New-AzureRmResourceGroupDeployment : 1:24:27 PM - Template output evaluation skipped: at least one resource deployment operation
failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.
At line:1 char:1

  • New-AzureRmResourceGroupDeployment -Name smatestpoc -ResourceGroupNam ...
  •   + CategoryInfo          : NotSpecified: (:) [New-AzureRmResourceGroupDeployment], Exception
      + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureResourceGroupDeploymentCmdlet
    
    

New-AzureRmResourceGroupDeployment : 1:24:27 PM - Template output evaluation skipped: at least one resource deployment operation
failed. Please list deployment operations for details. Please see https://aka.ms/arm-debug for usage details.
At line:1 char:1

  • New-AzureRmResourceGroupDeployment -Name smatestpoc -ResourceGroupNam ...
  •   + CategoryInfo          : NotSpecified: (:) [New-AzureRmResourceGroupDeployment], Exception
      + FullyQualifiedErrorId : Microsoft.Azure.Commands.ResourceManager.Cmdlets.Implementation.NewAzureResourceGroupDeploymentCmdlet
    
    

@rakeshkulkarni6
Copy link
Author

I have used vlabs in apiversion and kept dnd prefix empty in agentpoolprofile.still I am getting above error while deploying can you please help I need to setup production cluster using ACS-ENGINE

@rakidu
Copy link

rakidu commented Apr 7, 2018

I got similar error when deploying cluster using 14.6 acs-engine version.
"error": {
"code": "ResourceDeploymentFailure",
"message": "The resource operation completed with terminal provisioning state 'Failed'.",
"details": [
{
"code": "VMExtensionProvisioningError",
"message": "VM has reported a failure when processing extension 'cse0'. Error message: "Enable failed: failed to execute

I deleted the deployment and redeployed the cluster in to a new resource group with kubernetes version 1.9.5

it got deployed successfully and working fine with out any errors.

@CecileRobertMichon
Copy link
Contributor

@rakeshkulkarni6 @rakidu I am trying to repro this error. It looks like 14.6 might have introduced a regression causing transient deployment errors (possibly a race condition). In the meantime, if you retry you might get lucky and get a working cluster @rakeshkulkarni6. If you still have a cluster that failed with this error can you please share the content of /var/log/azure/cluster-provision.log on your first master please?

@bobjac
Copy link

bobjac commented Apr 12, 2018

I am seeing a similar issue with version 15.1 of acs-engine and version 1.8.9 of kubernetes. I am trying to deploy an cluster into an existing vNet. The vNet has multiple subnets, but I am getting the same error when deploying master & agents to the same subnet or splitting the master & agents into different subnets.

@rncwnd79
Copy link

rncwnd79 commented Apr 13, 2018

Same here with 15.2, kubernetes 1.9.6 and 3 masters, 3 agents.
In azure portal I get the already documented error message for all 3 masters (cse0, cs1, cse2).

The last lines from /var/log/azure/cluster-provision.log:
++ /usr/local/bin/kubectl get nodes
++ grep Ready
++ wc -l
+ nodes=5
+ '[' 5 -eq 6 ']'
+ sleep 1
+ '[' 1 -ne 0 ']'
+ echo 'still waiting for active nodes after 1800 seconds'
still waiting for active nodes after 1800 seconds
+ exit 3

@CecileRobertMichon I tried now many times hoping to get the cluster deployed once, but had no luck...

@htuomola
Copy link

htuomola commented Apr 16, 2018

I also got this, acs-engine 0.15.2, kubernetes 1.9.6 and 1 master, 4 agents.

cluster-provision.log

Kubernetes master is running at http://localhost:8080

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
+ '[' 1 = 0 ']'
+ sleep 1
+ for i in '{1..600}'
+ /usr/local/bin/kubectl cluster-info
Kubernetes master is running at http://localhost:8080

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.
+ '[' 1 = 0 ']'
+ sleep 1
+ '[' 1 -ne 0 ']'
+ echo 'k8s cluster is not healthy after 600 seconds'
k8s cluster is not healthy after 600 seconds
+ exit 3

Interestingly, if I run kubectl cluster-info manually, it hangs after posting the master status (which is shown with FQDN, contrary to the log above which has localhost).
With kubectl cluster-info dump I only get:
Unable to connect to the server: dial tcp <master public IP>:443: i/o timeout

Edit: in my case, it was resolved by deleting the cluster and re-creating.

@htuomola
Copy link

Err, I'll take that (edit) back - it deployed on second time but without heapster, dashboard, kubedns and tiller.

> kubectl cluster-info
Kubernetes master is running at https://<dns>

To further debug and diagnose cluster problems, use 'kubectl cluster-info dump'.

@CecileRobertMichon
Copy link
Contributor

v0.16.0 will be released this week with a number of improvements to address the above deployment errors including #2641, #2625, #2639, #2650 and #2666. cc: @jackfrancis

@lehtiton
Copy link

Is there any configuration of which it is possible to generate Kubernetes cluster at the moment. I have tried quite a many orchestratorRelease & orchestratorVersion combinations and getting the same error with all of the trials: "VM has reported a failure when processing extension 'cse0'."

Here is my latest trial:
{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.10",
"orchestratorVersion": "1.10.0",
"kubernetesConfig": {
"privateCluster": {
"enabled": true
}
}
},
"masterProfile": {
"count": 1,
"dnsPrefix": "acsengine01",
"vmSize": "Standard_D2_v2",
"vnetSubnetId": "",
"firstConsecutiveStaticIP": "
",
"vnetCidr": ""
},
"agentPoolProfiles": [
{
"name": "agentpool1",
"count": 3,
"vmSize": "Standard_D2_v2",
"availabilityProfile": "AvailabilitySet",
"vnetSubnetId": "
"
}
],
"linuxProfile": {
"adminUsername": "",
"ssh": {
"publicKeys": [
{
"keyData": "
"
}
]
}
},
"servicePrincipalProfile": {
"clientId": "",
"secret": "
"
}
}
}

@CecileRobertMichon
Copy link
Contributor

@lehtiton this isn't a bug with a specific configuration but rather transient vm provisioning errors which result in one or more of the nodes not being ready in a certain amount of time. The improvements I mentioned above aim to catch those errors and add retries and timeouts to better handle infrastructure flakiness. If you are seeing 100% failures, please send me the content of /var/log/azure/cluster-provision.log and the output of kubectl get nodes so I can help you debug. Also please try to deploy a cluster without using a custom vnet to make sure this isn't something to do with your network config.

@lehtiton
Copy link

lehtiton commented Apr 18, 2018

@CecileRobertMichon thanks for your help. I managed finally getting rid of this issue by changing something in the configs (I guess). I have tried so many times with so many different configurations that cannot keep book any more of those. However, I still faced a couple of issues that I commented to another issue #2476 in case you would have any hints how to work on those. I also shared my configurations there.

@ankitsingh11
Copy link

@CecileRobertMichon I am trying to create cluster using acs-engine v0.16.0 in China East2 which is a new region but there are lot of errors while running the custom execution script in the master VM. Have changed lot of configurations and the registry Url, able to fetch the images in the master VM but still the provisioning fails with an exit code 30.

  • echo 'k8s cluster is not healthy after 600 seconds'
    k8s cluster is not healthy after 600 seconds
  • exit 30

Please help

@CecileRobertMichon
Copy link
Contributor

CecileRobertMichon commented Sep 26, 2018

@ankitsingh11 please refer to https://github.com/Azure/acs-engine/blob/master/docs/kubernetes/troubleshooting.md#vmextensionprovisioningerror-or-vmextensionprovisioningtimeout for a guide on troubleshooting these issues and open a new issue if you still need help. I will close this one for now as it is outdated.

Any reason why you are using acs-engine v0.16.0? The latest versions contain a lot of improvements for vm extensions.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants