azurerm_kubernetes_cluster with multiple agent pools keeps getting re-created #4560

mikkoc · 2019-10-08T14:48:31Z

Community Note

Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
If you are interested in working on this issue or have submitted a pull request, please leave a comment

Terraform (and AzureRM Provider) Version

Terraform 0.12.6

AzureRM provider 1.35.0

Affected Resource(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

resource "azurerm_kubernetes_cluster" "aks" {
  name                = "${var.aks_cluster_name}"
  location            = "${var.location}"
  dns_prefix          = "${var.aks_cluster_name}"
  resource_group_name = "${var.rg_name}"
  node_resource_group = "${var.nodes_rg_name}"
  kubernetes_version  = "1.14.6"

  linux_profile {
    admin_username = "aks_admin"

    ssh_key {
      key_data = "ssh-rsa XYZ"
    }
  }

  agent_pool_profile {
    name                = "workers"
    count               = 1
    max_count           = 20
    min_count           = 1
    max_pods            = 50
    vm_size             = "Standard_B4ms"
    os_type             = "Linux"
    os_disk_size_gb     = 100
    enable_auto_scaling = true
    type                = "VirtualMachineScaleSets"
    vnet_subnet_id      = "${data.azurerm_subnet.aks_sub.id}"
    node_taints         = ["sku=workers:PreferNoSchedule"]
  }

  agent_pool_profile {
    name                = "gpularge"
    count               = 1
    max_count           = 20
    min_count           = 1
    max_pods            = 50
    vm_size             = "Standard_NC6_Promo"
    os_type             = "Linux"
    os_disk_size_gb     = 30
    enable_auto_scaling = true
    type                = "VirtualMachineScaleSets"
    vnet_subnet_id      = "${data.azurerm_subnet.aks_sub.id}"
    node_taints         = ["sku=gpularge:NoSchedule"]
  }

  agent_pool_profile {
    name                = "gpuxlarge"
    count               = 1
    max_count           = 10
    min_count           = 1
    max_pods            = 50
    vm_size             = "Standard_NC12_Promo"
    os_type             = "Linux"
    os_disk_size_gb     = 50
    enable_auto_scaling = true
    type                = "VirtualMachineScaleSets"
    vnet_subnet_id      = "${data.azurerm_subnet.aks_sub.id}"
    node_taints         = ["sku=gpuxlarge:NoSchedule"]
  } 

  agent_pool_profile {
    name                = "cpularge"
    count               = 1
    max_count           = 20
    min_count           = 1
    max_pods            = 50
    vm_size             = "Standard_F8s_v2"
    os_type             = "Linux"
    os_disk_size_gb     = 200
    enable_auto_scaling = true
    type                = "VirtualMachineScaleSets"
    vnet_subnet_id      = "${data.azurerm_subnet.aks_sub.id}"
    node_taints         = ["sku=cpularge:NoSchedule"]
  }    

  service_principal {
    client_id     = "${data.azuread_application.aks_sp.application_id}"
    client_secret = "${data.azurerm_key_vault_secret.aks_sp_secret.value}"
  }

  network_profile {
    network_plugin     = "azure"
    network_policy     = "calico"
    service_cidr       = "10.0.0.0/16"
    dns_service_ip     = "10.0.0.2"
    docker_bridge_cidr = "172.17.0.1/16"
  }

  role_based_access_control {
    enabled = true
    azure_active_directory {
      server_app_id     = "${var.aks_server_app_id}"
      server_app_secret = "${data.azurerm_key_vault_secret.aks_app_secret.value}"
      client_app_id     = "${var.aks_client_app_id}"
      tenant_id         = "${data.azurerm_client_config.current.tenant_id}"
    }
  }
}

Debug Output

The first Terraform Apply is fine, the cluster is created, no issues whatsoever.

Panic Output

On a second Terraform run, without ANY code changes, Terraform wants to replace the whole cluster because it thinks some agent_pool_profile have changed, which is false. It seems like the ordering of the agent_pool_profile is messed up in the statefile or something.

  # module.aks.azurerm_kubernetes_cluster.aks must be replaced
-/+ resource "azurerm_kubernetes_cluster" "aks" {
      - api_server_authorized_ip_ranges = [] -> null
        dns_prefix                      = "[MASKED]"
      ~ enable_pod_security_policy      = false -> (known after apply)
      ~ fqdn                            = "[MASKED]" -> (known after apply)
      ~ id                              = "/subscriptions/[MASKED]/resourcegroups/[MASKED]/providers/Microsoft.ContainerService/managedClusters/[MASKED]" -> (known after apply)
      ~ kube_admin_config               = [
          - {
              - client_certificate     = "[MASKED]"
              - client_key             = "[MASKED]"
              - host                   = "[MASKED]:443"
              - password               = "[MASKED]"
              - username               = "[MASKED]"
            },
        ] -> (known after apply)
      ~ kube_admin_config_raw           = (sensitive value)
      ~ kube_config                     = [
          - {
              - client_certificate     = ""
              - client_key             = ""
              - cluster_ca_certificate = "[MASKED]"
              - host                   = "[MASKED]:443"
              - password               = ""
              - username               = "[MASKED]"
            },
        ] -> (known after apply)
      ~ kube_config_raw                 = (sensitive value)
        kubernetes_version              = "1.14.6"
        location                        = "northeurope"
        name                            = "[MASKED]"
        node_resource_group             = "[MASKED]"
        resource_group_name             = "[MASKED]"


      ~ addon_profile {
          + aci_connector_linux {
              + enabled     = (known after apply)
              + subnet_name = (known after apply)
            }

          + http_application_routing {
              + enabled                            = (known after apply)
              + http_application_routing_zone_name = (known after apply)
            }

          + kube_dashboard {
              + enabled = (known after apply)
            }

          + oms_agent {
              + enabled                    = (known after apply)
              + log_analytics_workspace_id = (known after apply)
            }
        }

      ~ agent_pool_profile {
          - availability_zones  = [] -> null
            count               = 1
          + dns_prefix          = (known after apply)
            enable_auto_scaling = true
          ~ fqdn                = "[MASKED]" -> (known after apply)
            max_count           = 20
            max_pods            = 50
            min_count           = 1
          ~ name                = "cpularge" -> "workers" # forces replacement
          ~ node_taints         = [
              - "sku=cpularge:NoSchedule",
              + "sku=workers:PreferNoSchedule",
            ]
          ~ os_disk_size_gb     = 200 -> 100 # forces replacement
            os_type             = "Linux"
            type                = "VirtualMachineScaleSets"
          ~ vm_size             = "Standard_F8s_v2" -> "Standard_B4ms" # forces replacement
            vnet_subnet_id      = "/subscriptions/[MASKED]/resourceGroups/[MASKED]/providers/Microsoft.Network/virtualNetworks/[MASKED]/subnets/tf-aks-sub-northeu-01"
        }
      ~ agent_pool_profile {
          - availability_zones  = [] -> null
            count               = 1
          + dns_prefix          = (known after apply)
            enable_auto_scaling = true
          ~ fqdn                = "[MASKED]" -> (known after apply)
            max_count           = 20
            max_pods            = 50
            min_count           = 1
            name                = "gpularge"
            node_taints         = [
                "sku=gpularge:NoSchedule",
            ]
            os_disk_size_gb     = 30
            os_type             = "Linux"
            type                = "VirtualMachineScaleSets"
            vm_size             = "Standard_NC6_Promo"
            vnet_subnet_id      = "/subscriptions/[MASKED]/resourceGroups/[MASKED]/providers/Microsoft.Network/virtualNetworks/[MASKED]/subnets/tf-aks-sub-northeu-01"
        }
      ~ agent_pool_profile {
          - availability_zones  = [] -> null
            count               = 1
          + dns_prefix          = (known after apply)
            enable_auto_scaling = true
          ~ fqdn                = "[MASKED]" -> (known after apply)
            max_count           = 10
            max_pods            = 50
            min_count           = 1
            name                = "gpuxlarge"
            node_taints         = [
                "sku=gpuxlarge:NoSchedule",
            ]
            os_disk_size_gb     = 50
            os_type             = "Linux"
            type                = "VirtualMachineScaleSets"
            vm_size             = "Standard_NC12_Promo"
            vnet_subnet_id      = "/subscriptions/[MASKED]/resourceGroups/[MASKED]/providers/Microsoft.Network/virtualNetworks/[MASKED]/subnets/tf-aks-sub-northeu-01"
        }
      ~ agent_pool_profile {
          - availability_zones  = [] -> null
            count               = 1
          + dns_prefix          = (known after apply)
            enable_auto_scaling = true
          ~ fqdn                = "[MASKED]" -> (known after apply)
            max_count           = 20
            max_pods            = 50
            min_count           = 1
          ~ name                = "workers" -> "cpularge" # forces replacement
          ~ node_taints         = [
              - "sku=workers:PreferNoSchedule",
              + "sku=cpularge:NoSchedule",
            ]
          ~ os_disk_size_gb     = 100 -> 200 # forces replacement
            os_type             = "Linux"
            type                = "VirtualMachineScaleSets"
          ~ vm_size             = "Standard_B4ms" -> "Standard_F8s_v2" # forces replacement
            vnet_subnet_id      = "/subscriptions/[MASKED]/resourceGroups/[MASKED]/providers/Microsoft.Network/virtualNetworks/[MASKED]/subnets/tf-aks-sub-northeu-01"
        }

        linux_profile {
            admin_username = "aks_admin"

            ssh_key {
                key_data = "ssh-rsa XYZ"
            }
        }

      ~ network_profile {
            dns_service_ip     = "10.0.0.2"
            docker_bridge_cidr = "172.17.0.1/16"
          ~ load_balancer_sku  = "Basic" -> "basic"
            network_plugin     = "azure"
            network_policy     = "calico"
          + pod_cidr           = (known after apply)
            service_cidr       = "10.0.0.0/16"
        }

        role_based_access_control {
            enabled = true

            azure_active_directory {
                client_app_id     = (sensitive value)
                server_app_id     = (sensitive value)
                server_app_secret = (sensitive value)
                tenant_id         = "[MASKED]"
            }
        }

        service_principal {
            client_id     = (sensitive value)
            client_secret = (sensitive value)
        }
    }

Expected Behavior

Terraform should not detect any changes.

Actual Behavior

Terraform detects changes in the agent_pools and attempts to re-create the AKS cluster.

Steps to Reproduce

terraform apply

Important Factoids

References

#0000

The text was updated successfully, but these errors were encountered:

timio73 · 2019-10-08T19:52:44Z

I have a theory: The state file shows the pools in alphabetical order by name. So in my template that was experiencing the same issue, I swapped the two pools so they were alphabetical and the subsequent terraform plan shows the expected behaviour of no change instead of having to redeploy.

The example provided by mikkoc above aligns with this theory.

mikkoc · 2019-10-09T13:35:08Z

That was exactly the issue: I changed the order to alphabetical and the problem disappeared. Thanks @timio73 !!

timio73 · 2019-10-09T13:45:41Z

No problem. I would still consider this a bug as the pools should index and not be dependent on order in the template.

iameli · 2019-10-09T17:47:31Z

Also encountered this, thanks for figuring out the problem!

artburkart · 2019-10-18T15:25:48Z

It appears something similarly problematic occurs when you use a dynamic block:

  dynamic "agent_pool_profile" {
    for_each = var.agent_pools
    content {
      count           = agent_pool_profile.value.count
      name            = agent_pool_profile.value.name
      vm_size         = agent_pool_profile.value.vm_size # az vm list-sizes --location centralus
      os_type         = agent_pool_profile.value.os_type
      os_disk_size_gb = agent_pool_profile.value.os_disk_size_gb
    }
  }

If you add a new agent_pool_profile to the var.agent_pools variable, it forces deletion and creation of the cluster, even if it's in alphabetical order.

~ agent_pool_profile {
    - availability_zones  = [] -> null
      count               = 1
    + dns_prefix          = (known after apply)
    - enable_auto_scaling = false -> null
    - max_count           = 0 -> null
    ~ max_pods            = 110 -> (known after apply)
    - min_count           = 0 -> null
      name                = "default"
    - node_taints         = [] -> null
      os_disk_size_gb     = 30
      os_type             = "Linux"
      type                = "AvailabilitySet"
      vm_size             = "Standard_B2s"
  }
~ agent_pool_profile {
    - availability_zones  = [] -> null
      count               = 2
    + dns_prefix          = (known after apply)
    - enable_auto_scaling = false -> null
    - max_count           = 0 -> null
    ~ max_pods            = 110 -> (known after apply)
    - min_count           = 0 -> null
      name                = "nodepool"
    - node_taints         = [] -> null
      os_disk_size_gb     = 50
      os_type             = "Linux"
      type                = "AvailabilitySet"
      vm_size             = "Standard_F2"
  }
+ agent_pool_profile {
    + count           = 2
    + dns_prefix      = (known after apply)
    + fqdn            = (known after apply)
    + max_pods        = (known after apply)
    + name            = "nodepool2" # forces replacement
    + os_disk_size_gb = 50 # forces replacement
    + os_type         = "Linux" # forces replacement
    + type            = "AvailabilitySet" # forces replacement
    + vm_size         = "Standard_F2" # forces replacement
  }

EDIT: I now realize my specific issue was already reported in #3971

With this change, if you add a new "agent_pool_profile" block, it successfully creates the new "agent_pool_profile" without force creating a new cluster. It's important to note, by removing the `ForceNew` elements from the "agent_pool_profile" schema, a new behavior is introduced. Given a configuration that looks like this: ```hcl variable "agent_pools" { type = list(object({ count = number, name = string, os_disk_size_gb = number, os_type = string, vm_size = string, })) default = [ { count = 1, name = "nodepool1", os_disk_size_gb = 30, os_type = "Linux", vm_size = "Standard_B2s", }, { count = 1 name = "nodepool2" os_disk_size_gb = 30 os_type = "Linux" vm_size = "Standard_F2", }, ] } resource "azurerm_kubernetes_cluster" "cluster" { // Other required fields skipped for brevity dynamic "agent_pool_profile" { for_each = var.agent_pools content { count = agent_pool_profile.value.count name = agent_pool_profile.value.name vm_size = agent_pool_profile.value.vm_size # az vm list-sizes --location centralus os_type = agent_pool_profile.value.os_type os_disk_size_gb = agent_pool_profile.value.os_disk_size_gb } } } ``` The `terraform apply` is successful, but if you choose to swap your list around such that `nodepool2` comes before `nodepool1`: ```hcl variable "agent_pools" { type = list(object({ count = number, name = string, os_disk_size_gb = number, os_type = string, vm_size = string, })) default = [ { count = 1 name = "nodepool2" os_disk_size_gb = 30 os_type = "Linux" vm_size = "Standard_F2", }, { count = 1, name = "nodepool1", os_disk_size_gb = 30, os_type = "Linux", vm_size = "Standard_B2s", }, ] } ``` Subsequent `terraform plan`s work as epxected, but you can't `terraform apply`, because the request that is sent to the Azure RM is to update the agent pool profile with a name that already exists. All agent pool profile names *must* be unique. ```hcl ~ agent_pool_profile { availability_zones = [] count = 1 enable_auto_scaling = false fqdn = "dev-data-koreacentral-519d3aa3.hcp.koreacentral.azmk8s.io" max_count = 0 max_pods = 110 min_count = 0 ~ name = "nodepool1" -> "nodepool2" node_taints = [] os_disk_size_gb = 30 os_type = "Linux" type = "AvailabilitySet" vm_size = "Standard_B2s" } ~ agent_pool_profile { availability_zones = [] count = 1 enable_auto_scaling = false fqdn = "dev-data-koreacentral-519d3aa3.hcp.koreacentral.azmk8s.io" max_count = 0 max_pods = 110 min_count = 0 ~ name = "nodepool2" -> "nodepool1" node_taints = [] os_disk_size_gb = 30 os_type = "Linux" type = "AvailabilitySet" vm_size = "Standard_B2s" } ``` For my use case, it's far more practical to deal with this behavior than it is to create a new cluster every time a new agent pool profile is added to my config. This said, if someone knows an obvious fix, I'm happy to implement it. Thank you for your review 🙏

With this change, if you add a new "agent_pool_profile" block, it successfully creates the new "agent_pool_profile" without force creating a new cluster. It's important to note, by removing the `ForceNew` elements from the "agent_pool_profile" schema, a new behavior is introduced. Given a configuration that looks like this: ```hcl variable "agent_pools" { type = list(object({ count = number, name = string, os_disk_size_gb = number, os_type = string, vm_size = string, })) default = [ { count = 1, name = "nodepool1", os_disk_size_gb = 30, os_type = "Linux", vm_size = "Standard_B2s", }, { count = 1 name = "nodepool2" os_disk_size_gb = 30 os_type = "Linux" vm_size = "Standard_F2", }, ] } resource "azurerm_kubernetes_cluster" "cluster" { // Other required fields skipped for brevity dynamic "agent_pool_profile" { for_each = var.agent_pools content { count = agent_pool_profile.value.count name = agent_pool_profile.value.name vm_size = agent_pool_profile.value.vm_size # az vm list-sizes --location centralus os_type = agent_pool_profile.value.os_type os_disk_size_gb = agent_pool_profile.value.os_disk_size_gb } } } ``` The `terraform apply` is successful, but if you choose to swap your list around such that `nodepool2` comes before `nodepool1`: ```hcl variable "agent_pools" { type = list(object({ count = number, name = string, os_disk_size_gb = number, os_type = string, vm_size = string, })) default = [ { count = 1 name = "nodepool2" os_disk_size_gb = 30 os_type = "Linux" vm_size = "Standard_F2", }, { count = 1, name = "nodepool1", os_disk_size_gb = 30, os_type = "Linux", vm_size = "Standard_B2s", }, ] } ``` Subsequent `terraform plan`s work as epxected, but you can't `terraform apply`, because the request that is sent to the Azure RM is to update the agent pool profile with a name that already exists. All agent pool profile names *must* be unique. ```hcl ~ agent_pool_profile { availability_zones = [] count = 1 enable_auto_scaling = false max_count = 0 max_pods = 110 min_count = 0 ~ name = "nodepool1" -> "nodepool2" node_taints = [] os_disk_size_gb = 30 os_type = "Linux" type = "AvailabilitySet" vm_size = "Standard_B2s" } ~ agent_pool_profile { availability_zones = [] count = 1 enable_auto_scaling = false max_count = 0 max_pods = 110 min_count = 0 ~ name = "nodepool2" -> "nodepool1" node_taints = [] os_disk_size_gb = 30 os_type = "Linux" type = "AvailabilitySet" vm_size = "Standard_B2s" } ``` For my use case, it's far more practical to deal with this behavior than it is to create a new cluster every time a new agent pool profile is added to my config. This said, if someone knows an obvious fix, I'm happy to implement it. Thank you for your review 🙏

ghost · 2019-11-26T08:41:42Z

This has been released in version 1.37.0 of the provider. Please see the Terraform documentation on provider versioning or reach out if you need any assistance upgrading. As an example:

provider "azurerm" {
    version = "~> 1.37.0"
}
# ... other configuration ...

ghost · 2020-03-29T15:01:58Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 [email protected]. Thanks!

katbyte added bug service/kubernetes-cluster labels Oct 10, 2019

This was referenced Oct 21, 2019

Closes #4560: Compare agent node pools from Azure API in order of terraform config #4676

Closed

Terraform replacing AKS nodepool cluster when changing VM count #3835

Closed

tombuildsstuff mentioned this issue Nov 7, 2019

Add kubernetes_cluster_agent_pool #4046

Closed

tombuildsstuff mentioned this issue Nov 17, 2019

r/kubernetes_cluster: supporting conditional updates / introducing default_node_pool #4898

Merged

tombuildsstuff closed this as completed in #4898 Nov 20, 2019

tombuildsstuff added this to the v1.37.0 milestone Nov 20, 2019

ghost locked and limited conversation to collaborators Mar 29, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

azurerm_kubernetes_cluster with multiple agent pools keeps getting re-created #4560

azurerm_kubernetes_cluster with multiple agent pools keeps getting re-created #4560

mikkoc commented Oct 8, 2019

timio73 commented Oct 8, 2019

mikkoc commented Oct 9, 2019

timio73 commented Oct 9, 2019

iameli commented Oct 9, 2019

artburkart commented Oct 18, 2019 •

edited

Loading

ghost commented Nov 26, 2019

ghost commented Mar 29, 2020

azurerm_kubernetes_cluster with multiple agent pools keeps getting re-created #4560

azurerm_kubernetes_cluster with multiple agent pools keeps getting re-created #4560

Comments

mikkoc commented Oct 8, 2019

Community Note

Terraform (and AzureRM Provider) Version

Affected Resource(s)

Terraform Configuration Files

Debug Output

Panic Output

Expected Behavior

Actual Behavior

Steps to Reproduce

Important Factoids

References

timio73 commented Oct 8, 2019

mikkoc commented Oct 9, 2019

timio73 commented Oct 9, 2019

iameli commented Oct 9, 2019

artburkart commented Oct 18, 2019 • edited Loading

ghost commented Nov 26, 2019

ghost commented Mar 29, 2020

artburkart commented Oct 18, 2019 •

edited

Loading