Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

acs-engine scaling failed with error [segmentation violation code=0x1 addr=0x8 pc=0x11aa9be] #3337

Closed
saromba opened this issue Jun 22, 2018 · 16 comments

Comments

@saromba
Copy link

saromba commented Jun 22, 2018

Is this a request for help?:
Yes


Is this an ISSUE or FEATURE REQUEST? (choose one):ISSUE


What version of acs-engine?:0.18.9

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm)
Kubernetes (1.9.8)

What happened:

[36mINFO[0m[0000] validating...

[36mINFO[0m[0001] Name suffix: %s 15892004
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x8 pc=0x11aa9be]

goroutine 1 [running]:
github.com/Azure/acs-engine/cmd.(*scaleCmd).run(0xc420386000, 0xc42037c6c0, 0xc42035f320, 0x0, 0x12, 0x0, 0x0)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/cmd/scale.go:224 +0x38e
github.com/Azure/acs-engine/cmd.newScaleCmd.func1(0xc42037c6c0, 0xc42035f320, 0x0, 0x12, 0x0, 0x0)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/cmd/scale.go:69 +0x52
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).execute(0xc42037c6c0, 0xc42035f200, 0x12, 0x12, 0xc42037c6c0, 0xc42035f200)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:647 +0x3f1
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).ExecuteC(0xc42032f8c0, 0xc42037c6c0, 0xc42037c480, 0xc42037c240)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:726 +0x2fe
github.com/Azure/acs-engine/vendor/github.com/spf13/cobra.(*Command).Execute(0xc42032f8c0, 0xc42000c018, 0x0)
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/vendor/github.com/spf13/cobra/command.go:685 +0x2b
main.main()
/Users/jackfrancis/work/src/github.com/Azure/acs-engine/main.go:12 +0x74

What you expected to happen:
Correct scaling of the cluster

How to reproduce it (as minimally and precisely as possible):
/root/deploy/acs-engine/acs-engine scale --auth-method client_secret --client-id $AZURE_CLIENT_ID --client-secret $AZURE_CLIENT_SECRET --subscription-id $AZURE_SUBSCRIPTION_ID --resource-group $AZURE_RESOURCEGROUP --location westeurope --deployment-dir tmpDir --new-node-count $NODE_COUNT --master-FQDN $AZURE_RESOURCEGROUP.westeurope.cloudapp.azure.com

Anything else we need to know:

@CecileRobertMichon
Copy link
Contributor

Hi @saromba, could share you cluster configuration json (after removing secrets)? Did you manually change anything in the generated azuredeploy.json before deploying?

@CecileRobertMichon
Copy link
Contributor

Here's similar issue #2649 that was fixed a while ago in case that helps.

@saromba
Copy link
Author

saromba commented Jun 25, 2018

Hi @CecileRobertMichon: Which file do you need? There are many files during the provisioning.
How is it called?

@CecileRobertMichon
Copy link
Contributor

@saromba apimodel.json but make sure you remove all secrets

@saromba
Copy link
Author

saromba commented Jun 26, 2018

Hi @CecileRobertMichon: No I didn't change a file manually. Here is the cluster config:

{
"apiVersion": "vlabs",
"properties": {
"orchestratorProfile": {
"orchestratorType": "Kubernetes",
"orchestratorRelease": "1.9",
"orchestratorVersion": "1.9.8",
"kubernetesConfig": {
"kubernetesImageBase": "k8s-gcrio.azureedge.net/",
"clusterSubnet": "10.240.0.0/12",
"dnsServiceIP": "10.0.0.10",
"serviceCidr": "10.0.0.0/16",
"networkPlugin": "azure",
"dockerBridgeSubnet": "172.17.0.1/16",
"useInstanceMetadata": true,
"enableRbac": false,
"enableSecureKubelet": true,
"privateCluster": {
"enabled": false
},
"gchighthreshold": 85,
"gclowthreshold": 80,
"etcdVersion": "3.2.16",
"etcdDiskSizeGB": "512",
"addons": [
{
"name": "tiller",
"enabled": true,
"containers": [
{
"name": "tiller",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"max-history": "0"
}
},
{
"name": "aci-connector",
"enabled": false,
"containers": [
{
"name": "aci-connector",
"cpuRequests": "50m",
"memoryRequests": "150Mi",
"cpuLimits": "50m",
"memoryLimits": "150Mi"
}
],
"config": {
"nodeName": "aci-connector",
"os": "Linux",
"region": "westus",
"taint": "azure.com/aci"
}
},
{
"name": "cluster-autoscaler",
"enabled": false,
"containers": [
{
"name": "cluster-autoscaler",
"cpuRequests": "100m",
"memoryRequests": "300Mi",
"cpuLimits": "100m",
"memoryLimits": "300Mi"
}
],
"config": {
"maxNodes": "5",
"minNodes": "1"
}
},
{
"name": "kubernetes-dashboard",
"enabled": true,
"containers": [
{
"name": "kubernetes-dashboard",
"cpuRequests": "300m",
"memoryRequests": "150Mi",
"cpuLimits": "300m",
"memoryLimits": "150Mi"
}
]
},
{
"name": "rescheduler",
"enabled": false,
"containers": [
{
"name": "rescheduler",
"cpuRequests": "10m",
"memoryRequests": "100Mi",
"cpuLimits": "10m",
"memoryLimits": "100Mi"
}
]
},
{
"name": "metrics-server",
"enabled": true,
"containers": [
{
"name": "metrics-server"
}
]
},
{
"name": "nvidia-device-plugin",
"containers": [
{
"name": "nvidia-device-plugin"
}
]
}
],
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--image-pull-progress-deadline": "30m",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "30",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.240.0.0/12",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
},
"controllerManagerConfig": {
"--allocate-node-cidrs": "false",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.240.0.0/12",
"--cluster-name": "dev-59",
"--cluster-signing-cert-file": "/etc/kubernetes/certs/ca.crt",
"--cluster-signing-key-file": "/etc/kubernetes/certs/ca.key",
"--configure-cloud-routes": "false",
"--feature-gates": "ServiceNodeExclusion=true",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--node-monitor-grace-period": "40s",
"--pod-eviction-timeout": "5m0s",
"--profiling": "false",
"--root-ca-file": "/etc/kubernetes/certs/ca.crt",
"--route-reconciliation-period": "10s",
"--service-account-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--terminated-pod-gc-threshold": "5000",
"--use-service-account-credentials": "false",
"--v": "2"
},
"cloudControllerManagerConfig": {
"--allocate-node-cidrs": "false",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-cidr": "10.240.0.0/12",
"--cluster-name": "dev-59",
"--configure-cloud-routes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--route-reconciliation-period": "10s",
"--v": "2"
},
"apiServerConfig": {
"--admission-control": "NamespaceLifecycle,LimitRanger,ServiceAccount,DefaultStorageClass,DefaultTolerationSeconds,MutatingAdmissionWebhook,ValidatingAdmissionWebhook,ResourceQuota,DenyEscalatingExec,AlwaysPullImages",
"--advertise-address": "",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--audit-log-maxage": "30",
"--audit-log-maxbackup": "10",
"--audit-log-maxsize": "100",
"--audit-log-path": "/var/log/audit.log",
"--audit-policy-file": "/etc/kubernetes/manifests/audit-policy.yaml",
"--bind-address": "0.0.0.0",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--etcd-cafile": "/etc/kubernetes/certs/ca.crt",
"--etcd-certfile": "/etc/kubernetes/certs/etcdclient.crt",
"--etcd-keyfile": "/etc/kubernetes/certs/etcdclient.key",
"--etcd-servers": "https://127.0.0.1:2379",
"--insecure-port": "8080",
"--kubelet-client-certificate": "/etc/kubernetes/certs/client.crt",
"--kubelet-client-key": "/etc/kubernetes/certs/client.key",
"--profiling": "false",
"--proxy-client-cert-file": "/etc/kubernetes/certs/proxy.crt",
"--proxy-client-key-file": "/etc/kubernetes/certs/proxy.key",
"--repair-malformed-updates": "false",
"--requestheader-allowed-names": "",
"--requestheader-client-ca-file": "/etc/kubernetes/certs/proxy-ca.crt",
"--requestheader-extra-headers-prefix": "X-Remote-Extra-",
"--requestheader-group-headers": "X-Remote-Group",
"--requestheader-username-headers": "X-Remote-User",
"--secure-port": "443",
"--service-account-key-file": "/etc/kubernetes/certs/apiserver.key",
"--service-account-lookup": "true",
"--service-cluster-ip-range": "10.0.0.0/16",
"--storage-backend": "etcd3",
"--tls-cert-file": "/etc/kubernetes/certs/apiserver.crt",
"--tls-private-key-file": "/etc/kubernetes/certs/apiserver.key",
"--v": "4"
},
"schedulerConfig": {
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--leader-elect": "true",
"--profiling": "false",
"--v": "2"
}
}
},
"masterProfile": {
"count": 1,
"dnsPrefix": "dev-59",
"subjectAltNames": null,
"vmSize": "Standard_DS2_v2_Promo",
"firstConsecutiveStaticIP": "10.255.255.5",
"storageProfile": "ManagedDisks",
"oauthEnabled": false,
"preProvisionExtension": null,
"extensions": [],
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--image-pull-progress-deadline": "30m",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "30",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.240.0.0/12",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
}
},
"agentPoolProfiles": [
{
"name": "agent",
"count": 3,
"vmSize": "Standard_DS12_v2_Promo",
"osType": "Linux",
"availabilityProfile": "AvailabilitySet",
"storageProfile": "ManagedDisks",
"distro": "ubuntu",
"kubernetesConfig": {
"kubeletConfig": {
"--address": "0.0.0.0",
"--allow-privileged": "true",
"--anonymous-auth": "false",
"--authorization-mode": "Webhook",
"--azure-container-registry-config": "/etc/kubernetes/azure.json",
"--cadvisor-port": "0",
"--cgroups-per-qos": "true",
"--client-ca-file": "/etc/kubernetes/certs/ca.crt",
"--cloud-config": "/etc/kubernetes/azure.json",
"--cloud-provider": "azure",
"--cluster-dns": "10.0.0.10",
"--cluster-domain": "cluster.local",
"--enforce-node-allocatable": "pods",
"--event-qps": "0",
"--eviction-hard": "memory.available<100Mi,nodefs.available<10%,nodefs.inodesFree<5%",
"--feature-gates": "",
"--image-gc-high-threshold": "85",
"--image-gc-low-threshold": "80",
"--image-pull-progress-deadline": "30m",
"--keep-terminated-pod-volumes": "false",
"--kubeconfig": "/var/lib/kubelet/kubeconfig",
"--max-pods": "30",
"--network-plugin": "cni",
"--node-status-update-frequency": "10s",
"--non-masquerade-cidr": "10.240.0.0/12",
"--pod-infra-container-image": "k8s-gcrio.azureedge.net/pause-amd64:3.1",
"--pod-manifest-path": "/etc/kubernetes/manifests"
}
},
"fqdn": "",
"customNodeLabels": {
"stack": "app"
},
"preProvisionExtension": null,
"extensions": []
}
],
"linuxProfile": {
"adminUsername": "k8sadmin",
"ssh": {
"publicKeys": [
{
"keyData": ""
}
]
}
},
"servicePrincipalProfile": {
"clientId": "",
"secret": ""
},
"certificateProfile": {
"caCertificate": "",
"caPrivateKey": "",
"apiServerCertificate": "",
"apiServerPrivateKey": "",
"clientCertificate": "",
"clientPrivateKey":"",
"kubeConfigCertificate":"",
"kubeConfigPrivateKey": "",
"etcdServerCertificate": "",
"etcdServerPrivateKey": "",
"etcdClientCertificate":"",
"etcdClientPrivateKey":"",
"etcdPeerCertificates":[
""
],
"etcdPeerPrivateKeys": [
""
]
}
}
}

@saromba
Copy link
Author

saromba commented Jul 23, 2018

@CecileRobertMichon
Hi, something new here?

Regards,
saromba

@CecileRobertMichon
Copy link
Contributor

@saromba no update yet, I wasn't able to repro and then went on vacation. I'll try to give it another look this week. Did you by any chance scale or upgrade your cluster previously? Or was this cluster in its original state when you attempted to scale?

@saromba
Copy link
Author

saromba commented Aug 1, 2018

@CecileRobertMichon
No it's in original state. I deployed an "empty" cluster. After that I tried to scale it..(before deploying anything)

@saromba
Copy link
Author

saromba commented Aug 20, 2018

Hi @CecileRobertMichon,

we fixed that ourselves, scaling is now possible again...
See that "vmTags stuff" -> Cmd/scale.go_

ndexToVM := make(map[int]string)
if sc.agentPool.IsAvailabilitySets() {
   for vmsListPage, err := sc.client.ListVirtualMachines(ctx, sc.resourceGroupName);     vmsListPage.NotDone(); err = vmsListPage.Next() {
  if err != nil {
     return errors.Wrap(err, "failed to get vms in the resource group")
  } else if len(vmsListPage.Values()) < 1 {
     return errors.New("The provided resource group does not contain any vms")
  }
  for _, vm := range vmsListPage.Values() {
     vmTags := vm.Tags
     poolName := *vmTags["poolName"]
     nameSuffix := *vmTags["resourceNameSuffix"]

     //Changed to string contains for the nameSuffix as the Windows Agent Pools use only a substring of the first 5 characters of the entire nameSuffix
     if err != nil || !strings.EqualFold(poolName, sc.agentPoolToScale) || !strings.Contains(sc.nameSuffix, nameSuffix) {
        continue
     }

     osPublisher := vm.StorageProfile.ImageReference.Publisher
     if osPublisher != nil && strings.EqualFold(*osPublisher, "MicrosoftWindowsServer") {



}
} else {
   for vmssListPage, err := sc.client.ListVirtualMachineScaleSets(ctx, sc.resourceGroupName); vmssListPage.NotDone(); vmssListPage.Next() {
  if err != nil {
     return errors.Wrap(err, "failed to get vmss list in the resource group")
  }
  for _, vmss := range vmssListPage.Values() {
     vmTags := vmss.Tags
     poolName := *vmTags["poolName"]
     nameSuffix := *vmTags["resourceNameSuffix"]

     //Changed to string contains for the nameSuffix as the Windows Agent Pools use only a substring of the first 5 characters of the entire nameSuffix
     if err != nil || !strings.EqualFold(poolName, sc.agentPoolToScale) || !strings.Contains(sc.nameSuffix, nameSuffix) {
        continue

Regards,
saromba

@CecileRobertMichon
Copy link
Contributor

Hi @saromba, I'm glad to hear that you are unblocked. Was the root cause that the VM tags were missing? How can we fix this so that others don't run into the same issue?

@ryanlovett
Copy link

We have run into this on v0.20.6 and v0.21.1.

We did not manually change anything in our generated azuredeploy.json before deploying.

@ryanlovett
Copy link

@CecileRobertMichon Based on @saromba's comment, I cheched our VM tags and found both poolName and resourceNameSuffix present.

@CecileRobertMichon
Copy link
Contributor

@ryanlovett can you please share your apimodel and exact steps you took so I can try and repro?

@ryanlovett
Copy link

We had manually added a non-cluster VM to the resource group and this VM did not have those tags. I just manually added them and acs-engine hasn't crashed yet.

I think scale.go should check for whether the poolName and resourceNameSuffix tags exist before trying to reference them. I know nothing about go, otherwise I'd create a PR.

@CecileRobertMichon
Copy link
Contributor

@ryanlovett agreed, we have an issue tracking this problem at #3663. I will close this one.

@ryanlovett
Copy link

Great, thanks!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants