Skip to content
This repository has been archived by the owner on Jan 11, 2023. It is now read-only.

Upgrade results in node with 111 IPs #2668

Closed
EPinci opened this issue Apr 12, 2018 · 10 comments
Closed

Upgrade results in node with 111 IPs #2668

EPinci opened this issue Apr 12, 2018 · 10 comments

Comments

@EPinci
Copy link
Contributor

EPinci commented Apr 12, 2018

Is this a request for help?: Yes


Is this an ISSUE or FEATURE REQUEST? ISSUE


What version of acs-engine?: v15.1


Orchestrator and version (e.g. Kubernetes, DC/OS, Swarm) from v1.9.3 to v1.9.6

What happened:

I configured the cluster with AzureCNI and ipAddressCount set to 20.
During an upgrade run node get thorn down and rebuilt.
Original node had 20 IPs as expected, new node has 111 IPs resulting in quick subnet exhaustion.

What you expected to happen:

Original node having 20 IP, new node having the same number.

How to reproduce it (as minimally and precisely as possible):

Deploy a cluster with with AzureCNI and ipAddressCount set to 20.
Mine had 3 masters (with 20 IPs as well) and 3 node (with standard 30 IPs).

Anything else we need to know:

@EPinci
Copy link
Contributor Author

EPinci commented Apr 12, 2018

@jackfrancis I filed this to track the issue found testing #2650
Any idea where I can start looking at?

@jackfrancis
Copy link
Member

@EPinci will try to repro

@jackfrancis
Copy link
Member

jackfrancis commented Apr 13, 2018

Running a deployment, then a series of upgrades against this api model:

{   "apiVersion": "vlabs",   "properties": {     "orchestratorProfile": {       "orchestratorType": "Kubernetes",       "orchestratorVersion": "1.7.0"     },     "masterProfile": {       "count": 1,       "dnsPrefix": "",       "vmSize": "Standard_D2_v2"     },     "agentPoolProfiles": [       {         "name": "agentpool1",         "count": 2,         "vmSize": "Standard_D2_v2",         "availabilityProfile": "AvailabilitySet",         "storageProfile" : "ManagedDisks"       }     ],     "linuxProfile": {       "adminUsername": "azureuser",       "ssh": {         "publicKeys": [           {             "keyData": ""           }         ]       }     },     "servicePrincipalProfile": {       "clientId": "",       "secret": ""     }   } }

After initial deployment:

$ az network vnet show -n k8s-vnet-24809053 -g kubernetes-ukwest-78035 | grep 'networkInterfaces' | wc -l
      94

@jackfrancis
Copy link
Member

Holding steady after the 1st upgrade (from 1.7.0 to 1.7.12):

$ az network vnet show -n k8s-vnet-24809053 -g kubernetes-ukwest-78035 | grep 'networkInterfaces' | wc -l
      94

@EPinci
Copy link
Contributor Author

EPinci commented Apr 13, 2018

Weird I tried this multiple times and always with the same result (even the IP count!).
In my case had 3 masters and 3 nodes and then manually deleted two nodes VM form the portal.

Can you try just deleting one of the two nodes to see if this has an impact: this is exactly my scenario where the original node count get changed outside ACS-Engine (e.g.: autoscaler)?

Do you want me to send you my api model?

@jackfrancis
Copy link
Member

Let's let my test keep running (there are 12 more upgrades to go). I'm not saying for sure that we can't repro yet. :)

@jackfrancis
Copy link
Member

I take it back, I've been unable to repro. Yeah, please paste in the api model you're seeing this behavior on post-upgrade, and we'll repro using it as exactly as possible. Thanks!

@EPinci
Copy link
Contributor Author

EPinci commented Apr 13, 2018

Ok, since I don't know what is actually relevant, this is the entire process I'm using to replicate upgrading my production.

On an Empty RG, deploy a local VNet (nothing fancy, just three /24):

call az network vnet create -g <<RGNAME>> -n K8sVNet --address-prefix 10.24.0.0/16

call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n master --address-prefix 10.24.250.0/24
call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n frontend --address-prefix 10.24.1.0/24
call az network vnet subnet create -g <<RGNAME>> --vnet-name K8sVNet -n backend --address-prefix 10.24.2.0/24

Compile the following apimodel with ACS-Engine 13.1 (not sure if binary version is relevant but this is the same as my current production cluster):

{
  "apiVersion": "vlabs",
  "properties": {
    "orchestratorProfile": {
      "orchestratorType": "Kubernetes",
      "orchestratorRelease": "1.9",
      "kubernetesConfig": {
        "addons": [
            {
                "name": "tiller",
                "enabled" : false
            }
        ]
      }
    },
    "aadProfile": {
      "serverAppID": "<<REMOVED>>",
      "clientAppID": "<<REMOVED>>",
      "tenantID": "<<REMOVED>>",
      "adminGroupID": "<<REMOVED>>"
    },
    "masterProfile": {
      "count": 3,
      "dnsPrefix": "cluster-dev",
      "vmSize": "Standard_A1_v2",
      "storageProfile" : "ManagedDisks",
      "OSDiskSizeGB": 128,
      "firstConsecutiveStaticIP": "10.24.250.230",
      "ipAddressCount": 20,
      "vnetCidr": "10.24.0.0/16",
      "vnetSubnetId": "/subscriptions/<<REMOVED>>/resourceGroups/<<REMOVED>>/providers/Microsoft.Network/virtualNetworks/K8sVNet/subnets/master"
  },
    "agentPoolProfiles": [
      {
        "name": "nodepool1",
        "count": 3,
        "vmSize": "Standard_A2_v2",
        "storageProfile" : "ManagedDisks",
        "OSDiskSizeGB": 128,
        "availabilityProfile": "AvailabilitySet",
        "vnetSubnetId": "/subscriptions/<<REMOVED>>/resourceGroups/<<REMOVED>>/providers/Microsoft.Network/virtualNetworks/K8sVNet/subnets/frontend"       
      }
    ],
    "linuxProfile": {
      "adminUsername": "clusteradm",
      "ssh": {
        "publicKeys": [
          {
            "keyData": "<<REMOVED>>"
          }
        ]
      }
    },
    "servicePrincipalProfile": {
      "clientId": "<<REMOVED>>",
      "secret": "<<REMOVED>>"
    }
  }
}

Then I deploy it:

az group deployment create -g <<RGNAME>> -n "cluster-dev" --template-file ".\_output\cluster-dev\azuredeploy.json" --parameters ".\_output\cluster-dev\azuredeploy.parameters.json"

This results in a three masters and three agents 1.9.3 cluster.

I then delete from the Azure portal the last two agents to simulate a non ACS-Engine aware cluster node count change such as what results from a cluster autoscaler.
I manually cleanup also OS disks and NICs and verify that the agents are no longer listed in kubectl get nodes.

Run the upgrade with the current ACS-Engine:

acs-engine upgrade --subscription-id <<REMOVED>> ^
 --resource-group <<RGNAME>> --location westeurope ^
 --auth-method client_secret --client-id <<REMOVED>> --client-secret <<REMOVED>> ^
 --deployment-dir _output\cluster-dev --upgrade-version 1.9.6

Upgrade will delete the first master VM and redeploy it.
After that, current build will stop due to the #2560 / #2061 but the redeployed node still has 111 IPs.

I can run a custom ACS-Engine build from HEAD with the small patch from #2061 and the upgrade continues with master 2 but then fails on master 3 with subnet full (3 x 111 = more than a /24 subnet).

Thank you.

@EPinci
Copy link
Contributor Author

EPinci commented Apr 20, 2018

@jackfrancis Any chance you can give this a go? What do you think about it?

@stale
Copy link

stale bot commented Mar 9, 2019

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contribution. Note that acs-engine is deprecated--see https://github.com/Azure/aks-engine instead.

@stale stale bot added the stale label Mar 9, 2019
@stale stale bot closed this as completed Mar 16, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

2 participants