acs-engine scaling failed with error [segmentation violation code=0x1 addr=0x8 pc=0x11aa9be] #3337

saromba · 2018-06-22T07:09:16Z

Is this a request for help?:
Yes

Is this an ISSUE or FEATURE REQUEST? (choose one):ISSUE

What version of acs-engine?:0.18.9

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes (1.9.8)

What happened:

[36mINFO[0m[0000] validating...

[36mINFO[0m[0001] Name suffix: %s 15892004
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x11aa9be]

goroutine 1 [running]:
github.com/Azure/acs-engine/cmd.(*scaleCmd).run(0xc420386000, 0xc42037c6c0, 0xc42035f320, 0x0, 0x12, 0x0, 0x0)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/cmd/scale.go:224 +0x38e
github.com/Azure/acs-engine/cmd.newScaleCmd.func1(0xc42037c6c0, 0xc42035f320, 0x0, 0x12, 0x0, 0x0)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/cmd/scale.go:69 +0x52
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).execute(0xc42037c6c0, 0xc42035f200, 0x12, 0x12, 0xc42037c6c0, 0xc42035f200)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:647 +0x3f1
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc42032f8c0, 0xc42037c6c0, 0xc42037c480, 0xc42037c240)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:726 +0x2fe
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).Execute(0xc42032f8c0, 0xc42000c018, 0x0)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:685 +0x2b
main.main()
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/main.go:12 +0x74

What you expected to happen:
Correct scaling of the cluster

How to reproduce it (as minimally and precisely as possible):
/root/deploy/acs-engine/acs-engine scale --auth-method client_secret --client-id $AZURE_CLIENT_ID --client-secret $AZURE_CLIENT_SECRET --subscription-id $AZURE_SUBSCRIPTION_ID --resource-group $AZURE_RESOURCEGROUP --location westeurope --deployment-dir tmpDir --new-node-count $NODE_COUNT --master-FQDN $AZURE_RESOURCEGROUP.westeurope.cloudapp.azure.com

Anything else we need to know:

CecileRobertMichon · 2018-06-22T16:29:20Z

Hi @saromba, could share you cluster configuration json (after removing secrets)? Did you manually change anything in the generated azuredeploy.json before deploying?

CecileRobertMichon · 2018-06-22T16:30:56Z

Here's similar issue #2649 that was fixed a while ago in case that helps.

saromba · 2018-06-25T06:50:14Z

Hi @CecileRobertMichon: Which file do you need? There are many files during the provisioning.
How is it called?

CecileRobertMichon · 2018-06-25T18:38:58Z

@saromba apimodel.json but make sure you remove all secrets

saromba · 2018-06-26T04:23:19Z

Hi @CecileRobertMichon: No I didn't change a file manually. Here is the cluster config:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.8",
"kubernetesConfig": {
"kubernetesImageBase": "k8s-gcrio.azureedge.net/",
"clusterSubnet": "10.240.0.0/12",
"dnsServiceIP": "10.0.0.10",
"serviceCidr": "10.0.0.0/16",
"networkPlugin": "azure",
"dockerBridgeSubnet": "172.17.0.1/16",
"useInstanceMetadata": true,
"enableRbac": false,
"enableSecureKubelet": true,
"privateCluster": {
"enabled": false
},
"gchighthreshold": 85,
"gclowthreshold": 80,
"etcdVersion": "3.2.16",
"etcdDiskSizeGB": "512",
"addons": [
{
"name": "tiller",
"enabled": true,
"containers": [
{
"name": "tiller",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"max-history": "0"
}
},
{
"name": "aci-connector",
"enabled": false,
"containers": [
{
"name": "aci-connector",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"nodeName": "aci-connector",
"os": "Linux",
"region": "westus",
"taint": "azure.com/aci"
}
},
{
"name": "cluster-autoscaler",
"enabled": false,
"containers": [
{
"name": "cluster-autoscaler",
"cpuRequests": "100m",
"memoryRequests": "300Mi",
"cpuLimits": "100m",
"memoryLimits": "300Mi"
}
],
"config": {
"maxNodes": "5",
"minNodes": "1"
}
},
{
"name": "kubernetes-dashboard",
"enabled": true,
"containers": [
{
"name": "kubernetes-dashboard",
"cpuRequests": "300m",
"memoryRequests": "150Mi",
"cpuLimits": "300m",
"memoryLimits": "150Mi"
}
]
},
{
"name": "rescheduler",
"enabled": false,
"containers": [
{
"name": "rescheduler",
"cpuRequests": "10m",
"memoryRequests": "100Mi",
"cpuLimits": "10m",
"memoryLimits": "100Mi"
}
]
},
{
"name": "metrics-server",
"enabled": true,
"containers": [
{
"name": "metrics-server"
}
]
},
{
"name": "nvidia-device-plugin",
"containers": [
{
"name": "nvidia-device-plugin"
}
]
}
],
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--image-pull-progress-deadline": "30m",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "30",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.240.0.0/12",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
},
"controllerManagerConfig": {
"--allocate-node-cidrs": "false",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.240.0.0/12",
"--cluster-name": "dev-59",
"--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
"--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
"--configure-cloud-routes": "false",
"--feature-gates": "ServiceNodeExclusion=true",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "5m0s",
"--profiling": "false",
"--root-ca-file": "/etc/kubernetes/certs/ca.crt",
"--route-reconciliation-period": "10s",
"--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--terminated-pod-gc-threshold": "5000",
"--use-service-account-credentials": "false",
"--v": "2"
},
"cloudControllerManagerConfig": {
"--allocate-node-cidrs": "false",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.240.0.0/12",
"--cluster-name": "dev-59",
"--configure-cloud-routes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--route-reconciliation-period": "10s",
"--v": "2"
},
"apiServerConfig": {
"--admission-control": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,DenyEscalatingExec,AlwaysPullImages",
"--advertise-address": "",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--audit-log-maxage": "30",
"--audit-log-maxbackup": "10",
"--audit-log-maxsize": "100",
"--audit-log-path": "/var/log/audit.log",
"--audit-policy-file": "/etc/kubernetes/manifests/audit-policy.yaml",
"--bind-address": "0.0.0.0",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
"--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
"--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
"--etcd-servers": "https://127.0.0.1:2379",
"--insecure-port": "8080",
"--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
"--kubelet-client-key": "/etc/kubernetes/certs/client.key",
"--profiling": "false",
"--proxy-client-cert-file": "/etc/kubernetes/certs/proxy.crt",
"--proxy-client-key-file": "/etc/kubernetes/certs/proxy.key",
"--repair-malformed-updates": "false",
"--requestheader-allowed-names": "",
"--requestheader-client-ca-file": "/etc/kubernetes/certs/proxy-ca.crt",
"--requestheader-extra-headers-prefix": "X-Remote-Extra-",
"--requestheader-group-headers": "X-Remote-Group",
"--requestheader-username-headers": "X-Remote-User",
"--secure-port": "443",
"--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
"--service-account-lookup": "true",
"--service-cluster-ip-range": "10.0.0.0/16",
"--storage-backend": "etcd3",
"--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
"--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--v": "4"
},
"schedulerConfig": {
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--profiling": "false",
"--v": "2"
}
}
},
"masterProfile": {
"count": 1,
"dnsPrefix": "dev-59",
"subjectAltNames": null,
"vmSize": "Standard_DS2_v2_Promo",
"firstConsecutiveStaticIP": "10.255.255.5",
"storageProfile": "ManagedDisks",
"oauthEnabled": false,
"preProvisionExtension": null,
"extensions": [],
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--image-pull-progress-deadline": "30m",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "30",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.240.0.0/12",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
}
},
"agentPoolProfiles": [
{
"name": "agent",
"count": 3,
"vmSize": "Standard_DS12_v2_Promo",
"osType": "Linux",
"availabilityProfile": "AvailabilitySet",
"storageProfile": "ManagedDisks",
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--image-pull-progress-deadline": "30m",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "30",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.240.0.0/12",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
},
"fqdn": "",
"customNodeLabels": {
"stack": "app"
},
"preProvisionExtension": null,
"extensions": []
}
],
"linuxProfile": {
"adminUsername": "k8sadmin",
"ssh": {
"publicKeys": [
{
"keyData": ""
}
]
}
},
"servicePrincipalProfile": {
"clientId": "",
"secret": ""
},
"certificateProfile": {
"caCertificate": "",
"caPrivateKey": "",
"apiServerCertificate": "",
"apiServerPrivateKey": "",
"clientCertificate": "",
"clientPrivateKey":"",
"kubeConfigCertificate":"",
"kubeConfigPrivateKey": "",
"etcdServerCertificate": "",
"etcdServerPrivateKey": "",
"etcdClientCertificate":"",
"etcdClientPrivateKey":"",
"etcdPeerCertificates":[
""
],
"etcdPeerPrivateKeys": [
""
]
}
}
}

saromba · 2018-07-23T06:53:27Z

@CecileRobertMichon
Hi, something new here?

Regards,
saromba

CecileRobertMichon · 2018-07-30T20:55:42Z

@saromba no update yet, I wasn't able to repro and then went on vacation. I'll try to give it another look this week. Did you by any chance scale or upgrade your cluster previously? Or was this cluster in its original state when you attempted to scale?

saromba · 2018-08-01T07:24:02Z

@CecileRobertMichon
No it's in original state. I deployed an "empty" cluster. After that I tried to scale it..(before deploying anything)

saromba · 2018-08-20T05:55:52Z

Hi @CecileRobertMichon,

we fixed that ourselves, scaling is now possible again...
See that "vmTags stuff" -> Cmd/scale.go_

ndexToVM := make(map[int]string)
if sc.agentPool.IsAvailabilitySets() {
   for vmsListPage, err := sc.client.ListVirtualMachines(ctx, sc.resourceGroupName);     vmsListPage.NotDone(); err = vmsListPage.Next() {
  if err != nil {
     return errors.Wrap(err, "failed to get vms in the resource group")
  } else if len(vmsListPage.Values()) < 1 {
     return errors.New("The provided resource group does not contain any vms")
  }
  for _, vm := range vmsListPage.Values() {
     vmTags := vm.Tags
     poolName := *vmTags["poolName"]
     nameSuffix := *vmTags["resourceNameSuffix"]

     //Changed to string contains for the nameSuffix as the Windows Agent Pools use only a substring of the first 5 characters of the entire nameSuffix
     if err != nil || !strings.EqualFold(poolName, sc.agentPoolToScale) || !strings.Contains(sc.nameSuffix, nameSuffix) {
        continue
     }

     osPublisher := vm.StorageProfile.ImageReference.Publisher
     if osPublisher != nil && strings.EqualFold(*osPublisher, "MicrosoftWindowsServer") {



}
} else {
   for vmssListPage, err := sc.client.ListVirtualMachineScaleSets(ctx, sc.resourceGroupName); vmssListPage.NotDone(); vmssListPage.Next() {
  if err != nil {
     return errors.Wrap(err, "failed to get vmss list in the resource group")
  }
  for _, vmss := range vmssListPage.Values() {
     vmTags := vmss.Tags
     poolName := *vmTags["poolName"]
     nameSuffix := *vmTags["resourceNameSuffix"]

     //Changed to string contains for the nameSuffix as the Windows Agent Pools use only a substring of the first 5 characters of the entire nameSuffix
     if err != nil || !strings.EqualFold(poolName, sc.agentPoolToScale) || !strings.Contains(sc.nameSuffix, nameSuffix) {
        continue

Regards,
saromba

CecileRobertMichon · 2018-08-20T16:04:03Z

Hi @saromba, I'm glad to hear that you are unblocked. Was the root cause that the VM tags were missing? How can we fix this so that others don't run into the same issue?

ryanlovett · 2018-09-04T20:24:49Z

We have run into this on v0.20.6 and v0.21.1.

We did not manually change anything in our generated azuredeploy.json before deploying.

ryanlovett · 2018-09-04T20:31:58Z

@CecileRobertMichon Based on @saromba's comment, I cheched our VM tags and found both poolName and resourceNameSuffix present.

CecileRobertMichon · 2018-09-04T20:35:45Z

@ryanlovett can you please share your apimodel and exact steps you took so I can try and repro?

ryanlovett · 2018-09-04T20:37:46Z

We had manually added a non-cluster VM to the resource group and this VM did not have those tags. I just manually added them and acs-engine hasn't crashed yet.

I think scale.go should check for whether the poolName and resourceNameSuffix tags exist before trying to reference them. I know nothing about go, otherwise I'd create a PR.

CecileRobertMichon · 2018-09-04T20:44:07Z

@ryanlovett agreed, we have an issue tracking this problem at #3663. I will close this one.

ryanlovett · 2018-09-04T20:49:57Z

Great, thanks!

CecileRobertMichon added the feature/scale label Jun 22, 2018

CecileRobertMichon closed this as completed Sep 4, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

acs-engine scaling failed with error [segmentation violation code=0x1 addr=0x8 pc=0x11aa9be] #3337

acs-engine scaling failed with error [segmentation violation code=0x1 addr=0x8 pc=0x11aa9be] #3337

saromba commented Jun 22, 2018

CecileRobertMichon commented Jun 22, 2018

CecileRobertMichon commented Jun 22, 2018

saromba commented Jun 25, 2018

CecileRobertMichon commented Jun 25, 2018

saromba commented Jun 26, 2018

saromba commented Jul 23, 2018

CecileRobertMichon commented Jul 30, 2018

saromba commented Aug 1, 2018

saromba commented Aug 20, 2018

CecileRobertMichon commented Aug 20, 2018

ryanlovett commented Sep 4, 2018

ryanlovett commented Sep 4, 2018

CecileRobertMichon commented Sep 4, 2018

ryanlovett commented Sep 4, 2018

CecileRobertMichon commented Sep 4, 2018

ryanlovett commented Sep 4, 2018

acs-engine scaling failed with error [segmentation violation code=0x1 addr=0x8 pc=0x11aa9be] #3337

acs-engine scaling failed with error [segmentation violation code=0x1 addr=0x8 pc=0x11aa9be] #3337

Comments

saromba commented Jun 22, 2018

CecileRobertMichon commented Jun 22, 2018

CecileRobertMichon commented Jun 22, 2018

saromba commented Jun 25, 2018

CecileRobertMichon commented Jun 25, 2018

saromba commented Jun 26, 2018

saromba commented Jul 23, 2018

CecileRobertMichon commented Jul 30, 2018

saromba commented Aug 1, 2018

saromba commented Aug 20, 2018

CecileRobertMichon commented Aug 20, 2018

ryanlovett commented Sep 4, 2018

ryanlovett commented Sep 4, 2018

CecileRobertMichon commented Sep 4, 2018

ryanlovett commented Sep 4, 2018

CecileRobertMichon commented Sep 4, 2018

ryanlovett commented Sep 4, 2018