Terraform wants to replace my cluster #557

Israphel · 2024-06-04T20:39:35Z

Is there an existing issue for this?

I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

1.5.5

Module Version

8.0.0

AzureRM Provider Version

3.106.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

# AKS clusters (EU North)
module "aks-eu-north" {
  source   = "Azure/aks/azurerm"
  version  = "8.0.0"
  for_each = local.config[local.environment]["aks"]["eu-north"]

  prefix                            = each.value.name
  resource_group_name               = module.resource-group-eu-north["default"].name
  node_resource_group               = "${each.value.name}-nodes"
  kubernetes_version                = each.value.kubernetes_version.control_plane
  orchestrator_version              = each.value.kubernetes_version.node_pool
  oidc_issuer_enabled               = true
  workload_identity_enabled         = true
  agents_pool_name                  = "default"
  agents_availability_zones         = ["1", "2", "3"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = try(each.value.agents_size, "Standard_D2s_v3")
  temporary_name_for_rotation       = "tmp"
  enable_auto_scaling               = true
  agents_count                      = null
  agents_min_count                  = try(each.value.agents_min_count, 1)
  agents_max_count                  = try(each.value.agents_max_count, 3)
  azure_policy_enabled              = true
  log_analytics_workspace_enabled   = try(each.value.log_analytics_workspace_enabled, true)
  log_retention_in_days             = try(each.value.log_retention_in_days, 30)
  network_plugin                    = "azure"
  load_balancer_sku                 = "standard"
  ebpf_data_plane                   = "cilium"
  os_disk_size_gb                   = try(each.value.os_disk_size_gb, 30)
  rbac_aad                          = true
  rbac_aad_managed                  = true
  rbac_aad_azure_rbac_enabled       = true
  role_based_access_control_enabled = true
  rbac_aad_admin_group_object_ids   = [local.inputs["groups"]["infra"]]
  sku_tier                          = "Standard"
  vnet_subnet_id                    = module.virtual-network-eu-north["default"].vnet_subnets_name_id["nodes"]
  pod_subnet_id                     = module.virtual-network-eu-north["default"].vnet_subnets_name_id["pods"]
  agents_labels                     = try(each.value.agents_labels, {})
  agents_tags                       = try(each.value.agents_tags, {})

  tags = {
    environment = local.environment
    region      = module.resource-group-eu-north["default"].location
    managed_by  = "terraform"
  }

  providers = {
    azurerm = azurerm.eu-north
  }
}

tfvars variables values

    eun-1:
      name: eun-prod-1
      kubernetes_version:
        control_plane: 1.29.2
        node_pool: 1.29.2
      log_analytics_workspace_enabled: false
      agents_size: Standard_B4s_v2
      agents_min_count: 1
      agents_max_count: 8
      os_disk_size_gb: 60
      agents_labels:
        node.kubernetes.io/node-type: default

Debug Output/Panic Output

# module.aks-eu-north["eun-1"].azurerm_kubernetes_cluster.main must be replaced
+/- resource "azurerm_kubernetes_cluster" "main" {
      ~ api_server_authorized_ip_ranges     = [] -> (known after apply)
      - cost_analysis_enabled               = false -> null
      ~ current_kubernetes_version          = "1.29.2" -> (known after apply)
      - custom_ca_trust_certificates_base64 = [] -> null
      - enable_pod_security_policy          = false -> null
      ~ fqdn                                = "eun-prod-1-k2q3x6en.hcp.northeurope.azmk8s.io" -> (known after apply)
      - http_application_routing_enabled    = false -> null
      + http_application_routing_zone_name  = (known after apply)
      ~ id                                  = "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/prod-eu-north/providers/Microsoft.ContainerService/managedClusters/eun-prod-1-aks" -> (known after apply)
      ~ kube_admin_config                   = (sensitive value)
      ~ kube_admin_config_raw               = (sensitive value)
      ~ kube_config                         = (sensitive value)
      ~ kube_config_raw                     = (sensitive value)
      - local_account_disabled              = false -> null
        name                                = "eun-prod-1-aks"
      ~ node_resource_group_id              = "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/eun-prod-1-nodes" -> (known after apply)
      ~ oidc_issuer_url                     = "https://northeurope.oic.prod-aks.azure.com/e46dcf00-9155-4b3f-aabc-61af2e446cd1/b79048c1-3760-41d1-95c5-9000ee47978c/" -> (known after apply)
      - open_service_mesh_enabled           = false -> null
      ~ portal_fqdn                         = "eun-prod-1-k2q3x6en.portal.hcp.northeurope.azmk8s.io" -> (known after apply)
      + private_dns_zone_id                 = (known after apply)
      + private_fqdn                        = (known after apply)
        tags                                = {
            "environment" = "prod"
            "managed_by"  = "terraform"
            "region"      = "northeurope"
        }
        # (17 unchanged attributes hidden)

      - auto_scaler_profile {
          - balance_similar_node_groups      = false -> null
          - empty_bulk_delete_max            = "10" -> null
          - expander                         = "random" -> null
          - max_graceful_termination_sec     = "600" -> null
          - max_node_provisioning_time       = "15m" -> null
          - max_unready_nodes                = 3 -> null
          - max_unready_percentage           = 45 -> null
          - new_pod_scale_up_delay           = "0s" -> null
          - scale_down_delay_after_add       = "10m" -> null
          - scale_down_delay_after_delete    = "10s" -> null
          - scale_down_delay_after_failure   = "3m" -> null
          - scale_down_unneeded              = "10m" -> null
          - scale_down_unready               = "20m" -> null
          - scale_down_utilization_threshold = "0.5" -> null
          - scan_interval                    = "10s" -> null
          - skip_nodes_with_local_storage    = false -> null
          - skip_nodes_with_system_pods      = true -> null
        }

      ~ azure_active_directory_role_based_access_control {
          ~ tenant_id              = "e46dcf00-9155-4b3f-aabc-61af2e446cd1" -> (known after apply)
            # (3 unchanged attributes hidden)
        }

      ~ default_node_pool {
          - custom_ca_trust_enabled      = false -> null
          - fips_enabled                 = false -> null
          ~ kubelet_disk_type            = "OS" -> (known after apply)
          ~ max_pods                     = 250 -> (known after apply)
            name                         = "default"
          ~ node_count                   = 5 -> (known after apply)
          - node_taints                  = [] -> null
          - only_critical_addons_enabled = false -> null
          ~ os_sku                       = "Ubuntu" -> (known after apply)
            tags                         = {
                "environment" = "prod"
                "managed_by"  = "terraform"
                "region"      = "northeurope"
            }
          + workload_runtime             = (known after apply)
            # (17 unchanged attributes hidden)

          - upgrade_settings {
              - drain_timeout_in_minutes      = 30 -> null # forces replacement
              - max_surge                     = "10%" -> null
              - node_soak_duration_in_minutes = 10 -> null
            }
        }

      ~ identity {
          - identity_ids = [] -> null
          ~ principal_id = "c2a8e310-0f33-4c13-b2a7-f5003149e590" -> (known after apply)
          ~ tenant_id    = "e46dcf00-9155-4b3f-aabc-61af2e446cd1" -> (known after apply)
            # (1 unchanged attribute hidden)
        }

      - kubelet_identity {
          - client_id                 = "cffcdc3a-2c74-4f8f-9edc-6646572bb1d2" -> null
          - object_id                 = "89acc530-d5b8-405e-9e63-c791fb0ada3d" -> null
          - user_assigned_identity_id = "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/eun-prod-1-nodes/providers/Microsoft.ManagedIdentity/userAssignedIdentities/eun-prod-1-aks-agentpool" -> null
        }

      ~ network_profile {
          ~ dns_service_ip          = "10.0.0.10" -> (known after apply)
          + docker_bridge_cidr      = (known after apply)
          ~ ip_versions             = [
              - "IPv4",
            ] -> (known after apply)
          + network_mode            = (known after apply)
          ~ network_policy          = "cilium" -> (known after apply)
          ~ outbound_ip_address_ids = [] -> (known after apply)
          ~ outbound_ip_prefix_ids  = [] -> (known after apply)
          + pod_cidr                = (known after apply)
          ~ pod_cidrs               = [] -> (known after apply)
          ~ service_cidr            = "10.0.0.0/16" -> (known after apply)
          ~ service_cidrs           = [
              - "10.0.0.0/16",
            ] -> (known after apply)
            # (4 unchanged attributes hidden)

          - load_balancer_profile {
              - effective_outbound_ips      = [
                  - "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/eun-prod-1-nodes/providers/Microsoft.Network/publicIPAddresses/92c8f5cf-86d1-493f-8911-4d5bd3eb7205",
                ] -> null
              - idle_timeout_in_minutes     = 0 -> null
              - managed_outbound_ip_count   = 1 -> null
              - managed_outbound_ipv6_count = 0 -> null
              - outbound_ip_address_ids     = [] -> null
              - outbound_ip_prefix_ids      = [] -> null
              - outbound_ports_allocated    = 0 -> null
            }
        }

      - windows_profile {
          - admin_username = "azureuser" -> null
        }
    }

More info

The only thing I did was applying soak time via the command line since the module doesn't support it, but I wouldn't expect the whole cluster to be destroyed just for that.

The same issue doesn't occur with provider 3.105.0

zioproto · 2024-06-05T08:39:12Z

This is triggered in provider v3.106.0 because of this PR:
hashicorp/terraform-provider-azurerm#26137

tagging @ms-henglu and @stephybun

Is it correct that changing drain_timeout_in_minutes forces a replacement of the cluster ?

          - upgrade_settings {
              - drain_timeout_in_minutes      = 30 -> null # forces replacement
              - max_surge                     = "10%" -> null
              - node_soak_duration_in_minutes = 10 -> null
            }

I see in the docs that for nodepools --drain-timeout can be used both for add and update commands. Should be the same for the default nodepool.

Is it only the "unsetting" to null that forces the resource replacement ? Any other value would be accepted ?

https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-value

@Israphel the module does not support this feature yet, as tracked in #530
You should have not changed settings in AKS outside of Terraform, because this caused the Terraform state drift you are facing now.

I am not sure if you can revert the change with CLI so that the ARM API returns again drain_timeout_in_minutes = null to solve the Terraform state drift issue.

I suggest as temporary workaround to pin the Terraform provider version to v3.105.0 until this module supports the drain_timeout_in_minutes option.

Israphel · 2024-06-05T14:19:05Z

Thanks, went back to 105.

we're changing the default node group instance type and the rotation behaviour is extremely aggresive. Any way to make it better without the support of soak time, as of today?

zioproto · 2024-07-01T07:22:19Z

@Israphel this PR is now merged: #564

Is it possible for you to pin the module at commit 5858b26 ?

for example:

  module "aks" {
    source = git::https://github.com/Azure/terraform-azurerm-aks.git?ref=5858b260a1d6a9d2ee3687a08690e8932ca86af1
    [..CUT..]

and then set your configuration for drain_timeout_in_minutes.

This should unblock you until there is a new release that includes the feature.

Please let us know if this works for you. Thanks

Israphel · 2024-07-01T14:41:20Z

hello. I actually got unblocked by going back to 105, then applying/refreshing, and then I could continue upgrading normally, the state drift was fixed.

lonegunmanb · 2024-07-02T07:48:20Z

I've tried to reproduce this issue by using the following config @Israphel @zioproto but I can't reproduce the same issue:

resource "random_id" "prefix" {
  byte_length = 8
}

resource "random_id" "name" {
  byte_length = 8
}

resource "azurerm_resource_group" "main" {
  count = var.create_resource_group ? 1 : 0

  location = var.location
  name     = coalesce(var.resource_group_name, "${random_id.prefix.hex}-rg")
}

locals {
  resource_group = {
    name     = var.create_resource_group ? azurerm_resource_group.main[0].name : var.resource_group_name
    location = var.location
  }
}

resource "azurerm_virtual_network" "test" {
  address_space       = ["10.52.0.0/16"]
  location            = local.resource_group.location
  name                = "${random_id.prefix.hex}-vn"
  resource_group_name = local.resource_group.name
}

resource "azurerm_subnet" "test" {
  address_prefixes                               = ["10.52.0.0/24"]
  name                                           = "${random_id.prefix.hex}-sn"
  resource_group_name                            = local.resource_group.name
  virtual_network_name                           = azurerm_virtual_network.test.name
  enforce_private_link_endpoint_network_policies = true
}

resource "azurerm_subnet" "pod" {
  address_prefixes = ["10.52.1.0/24"]
  name                 = "${random_id.prefix.hex}-pod"
  resource_group_name  = local.resource_group.name
  virtual_network_name = azurerm_virtual_network.test.name
  enforce_private_link_endpoint_network_policies = true
}

# resource "azurerm_resource_group" "nodepool" {
#   location = local.resource_group.location
#   name     = "f557-nodepool"
# }

module "aks-eu-north" {
  source   = "Azure/aks/azurerm"
  version  = "8.0.0"

  prefix                            = "f557"
  resource_group_name               = local.resource_group.name
  node_resource_group               = "f557-nodepool${random_id.name.hex}"
  kubernetes_version                = "1.29.2"
  orchestrator_version              = "1.29.2"
  oidc_issuer_enabled               = true
  workload_identity_enabled         = true
  agents_pool_name                  = "default"
  agents_availability_zones         = ["1", "2", "3"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = try("Standard_B4s_v2", "Standard_D2s_v3")
  temporary_name_for_rotation       = "tmp"
  enable_auto_scaling               = true
  agents_count                      = null
  agents_min_count                  = 1
  agents_max_count                  = 8
  azure_policy_enabled              = true
  log_analytics_workspace_enabled   = false
  log_retention_in_days             = 30
  network_plugin                    = "azure"
  load_balancer_sku                 = "standard"
  ebpf_data_plane                   = "cilium"
  os_disk_size_gb                   = 60
  rbac_aad                          = true
  rbac_aad_managed                  = true
  rbac_aad_azure_rbac_enabled       = true
  role_based_access_control_enabled = true
#   rbac_aad_admin_group_object_ids   = [local.inputs["groups"]["infra"]]
  sku_tier                          = "Standard"
  vnet_subnet_id                    = azurerm_subnet.test.id
  pod_subnet_id                     = azurerm_subnet.pod.id
  agents_labels                     = {}
  agents_tags                       = {}
}

After apply, I updated soak duration time via Azure CLI:

az aks nodepool update --cluster-name f557-aks --resource-group ba4d95fcea318222-rg --name default --node-soak-duration 5

Then I ran terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # azurerm_subnet.pod will be updated in-place
  ~ resource "azurerm_subnet" "pod" {
        id                                             = "/subscriptions/xxxxxxxxxxxx/resourceGroups/ba4d95fcea318222-rg/providers/Microsoft.Network/virtualNetworks/ba4d95fcea318222-vn/subnets/ba4d95fcea318222-pod"
        name                                           = "ba4d95fcea318222-pod"
        # (11 unchanged attributes hidden)

      - delegation {
          - name = "aks-delegation" -> null

          - service_delegation {
              - actions = [
                  - "Microsoft.Network/virtualNetworks/subnets/join/action",
                ] -> null
              - name    = "Microsoft.ContainerService/managedClusters" -> null
            }
        }
    }

  # module.aks-eu-north.azurerm_kubernetes_cluster.main will be updated in-place
  ~ resource "azurerm_kubernetes_cluster" "main" {
        id                                  = "/subscriptions/xxxxxxxxxxxx/resourceGroups/ba4d95fcea318222-rg/providers/Microsoft.ContainerService/managedClusters/f557-aks"
        name                                = "f557-aks"
        tags                                = {}
        # (39 unchanged attributes hidden)

      ~ default_node_pool {
            name                          = "default"
            tags                          = {}
            # (33 unchanged attributes hidden)

          - upgrade_settings {
              - drain_timeout_in_minutes      = 0 -> null
              - max_surge                     = "10%" -> null
              - node_soak_duration_in_minutes = 0 -> null
            }
        }

        # (6 unchanged blocks hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

We were not able to reproduce this issue on our side.

We've also consulted the service team but we have no idea where this 30 came from, @Israphel could you please try to give us a minimum example that could reproduce this issue

Israphel · 2024-07-02T14:34:41Z

Try with:

az aks nodepool update --cluster-name f557-aks --resource-group ba4d95fcea318222-rg --name default --max-surge 10% --node-soak-duration 10 --drain-timeout 30

Israphel added the bug Something isn't working label Jun 4, 2024

github-project-automation bot added this to Azure Module Kanban Jun 4, 2024

github-project-automation bot moved this to Todo in Azure Module Kanban Jun 4, 2024

zioproto mentioned this issue Jun 11, 2024

add drain_timeout_in_minutes and node_soak_duration_in_minutes #564

Merged

3 tasks

zioproto mentioned this issue Jul 2, 2024

set drainTimeoutInMinutes default value to null #575

Merged

lonegunmanb closed this as completed Aug 20, 2024

github-project-automation bot moved this from Todo to Done in Azure Module Kanban Aug 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terraform wants to replace my cluster #557

Terraform wants to replace my cluster #557

Israphel commented Jun 4, 2024 •

edited

Loading

zioproto commented Jun 5, 2024

Israphel commented Jun 5, 2024

zioproto commented Jul 1, 2024

Israphel commented Jul 1, 2024

lonegunmanb commented Jul 2, 2024

Israphel commented Jul 2, 2024

Terraform wants to replace my cluster #557

Terraform wants to replace my cluster #557

Comments

Israphel commented Jun 4, 2024 • edited Loading

Is there an existing issue for this?

Greenfield/Brownfield provisioning

Terraform Version

Module Version

AzureRM Provider Version

Affected Resource(s)/Data Source(s)

Terraform Configuration Files

tfvars variables values

Debug Output/Panic Output

More info

zioproto commented Jun 5, 2024

Israphel commented Jun 5, 2024

zioproto commented Jul 1, 2024

Israphel commented Jul 1, 2024

lonegunmanb commented Jul 2, 2024

Israphel commented Jul 2, 2024

Israphel commented Jun 4, 2024 •

edited

Loading