Upgrade results in node with 111 IPs #2668

EPinci · 2018-04-12T19:18:08Z

Is this a request for help?: Yes

Is this an ISSUE or FEATURE REQUEST? ISSUE

What version of acs-engine?: v15.1

Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) from v1.9.3 to v1.9.6

What happened:

I configured the cluster with AzureCNI and ipAddressCount set to 20.
During an upgrade run node get thorn down and rebuilt.
Original node had 20 IPs as expected, new node has 111 IPs resulting in quick subnet exhaustion.

What you expected to happen:

Original node having 20 IP, new node having the same number.

How to reproduce it (as minimally and precisely as possible):

Deploy a cluster with with AzureCNI and ipAddressCount set to 20.
Mine had 3 masters (with 20 IPs as well) and 3 node (with standard 30 IPs).

Anything else we need to know:

EPinci · 2018-04-12T19:21:28Z

@jackfrancis I filed this to track the issue found testing #2650
Any idea where I can start looking at?

jackfrancis · 2018-04-13T19:22:56Z

@EPinci will try to repro

jackfrancis · 2018-04-13T19:39:31Z

Running a deployment, then a series of upgrades against this api model:

{   "apiVersion": "vlabs",   "properties": {     "orchestratorProfile": {       "orchestratorType": "Kubernetes",       "orchestratorVersion": "1.7.0"     },     "masterProfile": {       "count": 1,       "dnsPrefix": "",       "vmSize": "Standard_D2_v2"     },     "agentPoolProfiles": [       {         "name": "agentpool1",         "count": 2,         "vmSize": "Standard_D2_v2",         "availabilityProfile": "AvailabilitySet",         "storageProfile" : "ManagedDisks"       }     ],     "linuxProfile": {       "adminUsername": "azureuser",       "ssh": {         "publicKeys": [           {             "keyData": ""           }         ]       }     },     "servicePrincipalProfile": {       "clientId": "",       "secret": ""     }   } }

After initial deployment:

$ az network vnet show -n k8s-vnet-24809053 -g kubernetes-ukwest-78035 | grep 'networkInterfaces' | wc -l
      94

jackfrancis · 2018-04-13T20:27:51Z

Holding steady after the 1st upgrade (from 1.7.0 to 1.7.12):

$ az network vnet show -n k8s-vnet-24809053 -g kubernetes-ukwest-78035 | grep 'networkInterfaces' | wc -l
      94

EPinci · 2018-04-13T20:34:37Z

Weird I tried this multiple times and always with the same result (even the IP count!).
In my case had 3 masters and 3 nodes and then manually deleted two nodes VM form the portal.

Can you try just deleting one of the two nodes to see if this has an impact: this is exactly my scenario where the original node count get changed outside ACS-Engine (e.g.: autoscaler)?

Do you want me to send you my api model?

jackfrancis · 2018-04-13T20:35:58Z

Let's let my test keep running (there are 12 more upgrades to go). I'm not saying for sure that we can't repro yet. :)

jackfrancis · 2018-04-13T22:41:38Z

I take it back, I've been unable to repro. Yeah, please paste in the api model you're seeing this behavior on post-upgrade, and we'll repro using it as exactly as possible. Thanks!

EPinci · 2018-04-13T23:21:54Z

Ok, since I don't know what is actually relevant, this is the entire process I'm using to replicate upgrading my production.

On an Empty RG, deploy a local VNet (nothing fancy, just three /24):

call az network vnet create -g <<RGNAME>> -n K8sVNet --address-prefix 10.24.0.0/16

call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n master --address-prefix 10.24.250.0/24
call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n frontend --address-prefix 10.24.1.0/24
call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n backend --address-prefix 10.24.2.0/24

Compile the following apimodel with ACS-Engine 13.1 (not sure if binary version is relevant but this is the same as my current production cluster):

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.9",
      "kubernetesConfig": {
        "addons": [
            {
                "name": "tiller",
                "enabled" : false
            }
        ]
      }
    },
    "aadProfile": {
      "serverAppID": "<<REMOVED>>",
      "clientAppID": "<<REMOVED>>",
      "tenantID": "<<REMOVED>>",
      "adminGroupID": "<<REMOVED>>"
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "cluster-dev",
      "vmSize": "Standard_A1_v2",
      "storageProfile" : "ManagedDisks",
      "OSDiskSizeGB": 128,
      "firstConsecutiveStaticIP": "10.24.250.230",
      "ipAddressCount": 20,
      "vnetCidr": "10.24.0.0/16",
      "vnetSubnetId": "/subscriptions/<<REMOVED>>/resourceGroups/<<REMOVED>>/providers/Microsoft.Network/virtualNetworks/K8sVNet/subnets/master"
  },
    "agentPoolProfiles": [
      {
        "name": "nodepool1",
        "count": 3,
        "vmSize": "Standard_A2_v2",
        "storageProfile" : "ManagedDisks",
        "OSDiskSizeGB": 128,
        "availabilityProfile": "AvailabilitySet",
        "vnetSubnetId": "/subscriptions/<<REMOVED>>/resourceGroups/<<REMOVED>>/providers/Microsoft.Network/virtualNetworks/K8sVNet/subnets/frontend"       
      }
    ],
    "linuxProfile": {
      "adminUsername": "clusteradm",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "<<REMOVED>>"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "<<REMOVED>>",
      "secret": "<<REMOVED>>"
    }
  }
}

Then I deploy it:

az group deployment create -g <<RGNAME>> -n "cluster-dev" --template-file ".\_output\cluster-dev\azuredeploy.json" --parameters ".\_output\cluster-dev\azuredeploy.parameters.json"

This results in a three masters and three agents 1.9.3 cluster.

I then delete from the Azure portal the last two agents to simulate a non ACS-Engine aware cluster node count change such as what results from a cluster autoscaler.
I manually cleanup also OS disks and NICs and verify that the agents are no longer listed in kubectl get nodes.

Run the upgrade with the current ACS-Engine:

acs-engine upgrade --subscription-id <<REMOVED>> ^
 --resource-group <<RGNAME>> --location westeurope ^
 --auth-method client_secret --client-id <<REMOVED>> --client-secret <<REMOVED>> ^
 --deployment-dir _output\cluster-dev --upgrade-version 1.9.6

Upgrade will delete the first master VM and redeploy it.
After that, current build will stop due to the #2560 / #2061 but the redeployed node still has 111 IPs.

I can run a custom ACS-Engine build from HEAD with the small patch from #2061 and the upgrade continues with master 2 but then fails on master 3 with subnet full (3 x 111 = more than a /24 subnet).

Thank you.

EPinci · 2018-04-20T19:45:29Z

@jackfrancis Any chance you can give this a go? What do you think about it?

stale · 2019-03-09T13:23:18Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

jackfrancis added orchestrator/k8s feature/upgrade labels Apr 13, 2018

stale bot added the stale label Mar 9, 2019

stale bot closed this as completed Mar 16, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrade results in node with 111 IPs #2668

Upgrade results in node with 111 IPs #2668

EPinci commented Apr 12, 2018

EPinci commented Apr 12, 2018

jackfrancis commented Apr 13, 2018

jackfrancis commented Apr 13, 2018 •

edited

Loading

jackfrancis commented Apr 13, 2018

EPinci commented Apr 13, 2018

jackfrancis commented Apr 13, 2018

jackfrancis commented Apr 13, 2018

EPinci commented Apr 13, 2018 •

edited

Loading

EPinci commented Apr 20, 2018

stale bot commented Mar 9, 2019

Upgrade results in node with 111 IPs #2668

Upgrade results in node with 111 IPs #2668

Comments

EPinci commented Apr 12, 2018

EPinci commented Apr 12, 2018

jackfrancis commented Apr 13, 2018

jackfrancis commented Apr 13, 2018 • edited Loading

jackfrancis commented Apr 13, 2018

EPinci commented Apr 13, 2018

jackfrancis commented Apr 13, 2018

jackfrancis commented Apr 13, 2018

EPinci commented Apr 13, 2018 • edited Loading

EPinci commented Apr 20, 2018

stale bot commented Mar 9, 2019

jackfrancis commented Apr 13, 2018 •

edited

Loading

EPinci commented Apr 13, 2018 •

edited

Loading