Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform wants to replace my cluster #557

Closed
1 task done
Israphel opened this issue Jun 4, 2024 · 6 comments
Closed
1 task done

Terraform wants to replace my cluster #557

Israphel opened this issue Jun 4, 2024 · 6 comments
Labels
bug Something isn't working

Comments

@Israphel
Copy link

Israphel commented Jun 4, 2024

Is there an existing issue for this?

  • I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

1.5.5

Module Version

8.0.0

AzureRM Provider Version

3.106.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

# AKS clusters (EU North)
module "aks-eu-north" {
  source   = "Azure/aks/azurerm"
  version  = "8.0.0"
  for_each = local.config[local.environment]["aks"]["eu-north"]

  prefix                            = each.value.name
  resource_group_name               = module.resource-group-eu-north["default"].name
  node_resource_group               = "${each.value.name}-nodes"
  kubernetes_version                = each.value.kubernetes_version.control_plane
  orchestrator_version              = each.value.kubernetes_version.node_pool
  oidc_issuer_enabled               = true
  workload_identity_enabled         = true
  agents_pool_name                  = "default"
  agents_availability_zones         = ["1", "2", "3"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = try(each.value.agents_size, "Standard_D2s_v3")
  temporary_name_for_rotation       = "tmp"
  enable_auto_scaling               = true
  agents_count                      = null
  agents_min_count                  = try(each.value.agents_min_count, 1)
  agents_max_count                  = try(each.value.agents_max_count, 3)
  azure_policy_enabled              = true
  log_analytics_workspace_enabled   = try(each.value.log_analytics_workspace_enabled, true)
  log_retention_in_days             = try(each.value.log_retention_in_days, 30)
  network_plugin                    = "azure"
  load_balancer_sku                 = "standard"
  ebpf_data_plane                   = "cilium"
  os_disk_size_gb                   = try(each.value.os_disk_size_gb, 30)
  rbac_aad                          = true
  rbac_aad_managed                  = true
  rbac_aad_azure_rbac_enabled       = true
  role_based_access_control_enabled = true
  rbac_aad_admin_group_object_ids   = [local.inputs["groups"]["infra"]]
  sku_tier                          = "Standard"
  vnet_subnet_id                    = module.virtual-network-eu-north["default"].vnet_subnets_name_id["nodes"]
  pod_subnet_id                     = module.virtual-network-eu-north["default"].vnet_subnets_name_id["pods"]
  agents_labels                     = try(each.value.agents_labels, {})
  agents_tags                       = try(each.value.agents_tags, {})

  tags = {
    environment = local.environment
    region      = module.resource-group-eu-north["default"].location
    managed_by  = "terraform"
  }

  providers = {
    azurerm = azurerm.eu-north
  }
}

tfvars variables values

    eun-1:
      name: eun-prod-1
      kubernetes_version:
        control_plane: 1.29.2
        node_pool: 1.29.2
      log_analytics_workspace_enabled: false
      agents_size: Standard_B4s_v2
      agents_min_count: 1
      agents_max_count: 8
      os_disk_size_gb: 60
      agents_labels:
        node.kubernetes.io/node-type: default

Debug Output/Panic Output

# module.aks-eu-north["eun-1"].azurerm_kubernetes_cluster.main must be replaced
+/- resource "azurerm_kubernetes_cluster" "main" {
      ~ api_server_authorized_ip_ranges     = [] -> (known after apply)
      - cost_analysis_enabled               = false -> null
      ~ current_kubernetes_version          = "1.29.2" -> (known after apply)
      - custom_ca_trust_certificates_base64 = [] -> null
      - enable_pod_security_policy          = false -> null
      ~ fqdn                                = "eun-prod-1-k2q3x6en.hcp.northeurope.azmk8s.io" -> (known after apply)
      - http_application_routing_enabled    = false -> null
      + http_application_routing_zone_name  = (known after apply)
      ~ id                                  = "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/prod-eu-north/providers/Microsoft.ContainerService/managedClusters/eun-prod-1-aks" -> (known after apply)
      ~ kube_admin_config                   = (sensitive value)
      ~ kube_admin_config_raw               = (sensitive value)
      ~ kube_config                         = (sensitive value)
      ~ kube_config_raw                     = (sensitive value)
      - local_account_disabled              = false -> null
        name                                = "eun-prod-1-aks"
      ~ node_resource_group_id              = "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/eun-prod-1-nodes" -> (known after apply)
      ~ oidc_issuer_url                     = "https://northeurope.oic.prod-aks.azure.com/e46dcf00-9155-4b3f-aabc-61af2e446cd1/b79048c1-3760-41d1-95c5-9000ee47978c/" -> (known after apply)
      - open_service_mesh_enabled           = false -> null
      ~ portal_fqdn                         = "eun-prod-1-k2q3x6en.portal.hcp.northeurope.azmk8s.io" -> (known after apply)
      + private_dns_zone_id                 = (known after apply)
      + private_fqdn                        = (known after apply)
        tags                                = {
            "environment" = "prod"
            "managed_by"  = "terraform"
            "region"      = "northeurope"
        }
        # (17 unchanged attributes hidden)

      - auto_scaler_profile {
          - balance_similar_node_groups      = false -> null
          - empty_bulk_delete_max            = "10" -> null
          - expander                         = "random" -> null
          - max_graceful_termination_sec     = "600" -> null
          - max_node_provisioning_time       = "15m" -> null
          - max_unready_nodes                = 3 -> null
          - max_unready_percentage           = 45 -> null
          - new_pod_scale_up_delay           = "0s" -> null
          - scale_down_delay_after_add       = "10m" -> null
          - scale_down_delay_after_delete    = "10s" -> null
          - scale_down_delay_after_failure   = "3m" -> null
          - scale_down_unneeded              = "10m" -> null
          - scale_down_unready               = "20m" -> null
          - scale_down_utilization_threshold = "0.5" -> null
          - scan_interval                    = "10s" -> null
          - skip_nodes_with_local_storage    = false -> null
          - skip_nodes_with_system_pods      = true -> null
        }

      ~ azure_active_directory_role_based_access_control {
          ~ tenant_id              = "e46dcf00-9155-4b3f-aabc-61af2e446cd1" -> (known after apply)
            # (3 unchanged attributes hidden)
        }

      ~ default_node_pool {
          - custom_ca_trust_enabled      = false -> null
          - fips_enabled                 = false -> null
          ~ kubelet_disk_type            = "OS" -> (known after apply)
          ~ max_pods                     = 250 -> (known after apply)
            name                         = "default"
          ~ node_count                   = 5 -> (known after apply)
          - node_taints                  = [] -> null
          - only_critical_addons_enabled = false -> null
          ~ os_sku                       = "Ubuntu" -> (known after apply)
            tags                         = {
                "environment" = "prod"
                "managed_by"  = "terraform"
                "region"      = "northeurope"
            }
          + workload_runtime             = (known after apply)
            # (17 unchanged attributes hidden)

          - upgrade_settings {
              - drain_timeout_in_minutes      = 30 -> null # forces replacement
              - max_surge                     = "10%" -> null
              - node_soak_duration_in_minutes = 10 -> null
            }
        }

      ~ identity {
          - identity_ids = [] -> null
          ~ principal_id = "c2a8e310-0f33-4c13-b2a7-f5003149e590" -> (known after apply)
          ~ tenant_id    = "e46dcf00-9155-4b3f-aabc-61af2e446cd1" -> (known after apply)
            # (1 unchanged attribute hidden)
        }

      - kubelet_identity {
          - client_id                 = "cffcdc3a-2c74-4f8f-9edc-6646572bb1d2" -> null
          - object_id                 = "89acc530-d5b8-405e-9e63-c791fb0ada3d" -> null
          - user_assigned_identity_id = "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/eun-prod-1-nodes/providers/Microsoft.ManagedIdentity/userAssignedIdentities/eun-prod-1-aks-agentpool" -> null
        }

      ~ network_profile {
          ~ dns_service_ip          = "10.0.0.10" -> (known after apply)
          + docker_bridge_cidr      = (known after apply)
          ~ ip_versions             = [
              - "IPv4",
            ] -> (known after apply)
          + network_mode            = (known after apply)
          ~ network_policy          = "cilium" -> (known after apply)
          ~ outbound_ip_address_ids = [] -> (known after apply)
          ~ outbound_ip_prefix_ids  = [] -> (known after apply)
          + pod_cidr                = (known after apply)
          ~ pod_cidrs               = [] -> (known after apply)
          ~ service_cidr            = "10.0.0.0/16" -> (known after apply)
          ~ service_cidrs           = [
              - "10.0.0.0/16",
            ] -> (known after apply)
            # (4 unchanged attributes hidden)

          - load_balancer_profile {
              - effective_outbound_ips      = [
                  - "/subscriptions/5181fe1e-1064-432e-8d21-5ad0d3f86e9b/resourceGroups/eun-prod-1-nodes/providers/Microsoft.Network/publicIPAddresses/92c8f5cf-86d1-493f-8911-4d5bd3eb7205",
                ] -> null
              - idle_timeout_in_minutes     = 0 -> null
              - managed_outbound_ip_count   = 1 -> null
              - managed_outbound_ipv6_count = 0 -> null
              - outbound_ip_address_ids     = [] -> null
              - outbound_ip_prefix_ids      = [] -> null
              - outbound_ports_allocated    = 0 -> null
            }
        }

      - windows_profile {
          - admin_username = "azureuser" -> null
        }
    }

More info

The only thing I did was applying soak time via the command line since the module doesn't support it, but I wouldn't expect the whole cluster to be destroyed just for that.

The same issue doesn't occur with provider 3.105.0

@Israphel Israphel added the bug Something isn't working label Jun 4, 2024
@zioproto
Copy link
Collaborator

zioproto commented Jun 5, 2024

This is triggered in provider v3.106.0 because of this PR:
hashicorp/terraform-provider-azurerm#26137

tagging @ms-henglu and @stephybun

Is it correct that changing drain_timeout_in_minutes forces a replacement of the cluster ?

          - upgrade_settings {
              - drain_timeout_in_minutes      = 30 -> null # forces replacement
              - max_surge                     = "10%" -> null
              - node_soak_duration_in_minutes = 10 -> null
            }

I see in the docs that for nodepools --drain-timeout can be used both for add and update commands. Should be the same for the default nodepool.

Is it only the "unsetting" to null that forces the resource replacement ? Any other value would be accepted ?

https://learn.microsoft.com/en-us/azure/aks/upgrade-aks-cluster?tabs=azure-cli#set-node-drain-timeout-value

@Israphel the module does not support this feature yet, as tracked in #530
You should have not changed settings in AKS outside of Terraform, because this caused the Terraform state drift you are facing now.

I am not sure if you can revert the change with CLI so that the ARM API returns again drain_timeout_in_minutes = null to solve the Terraform state drift issue.

I suggest as temporary workaround to pin the Terraform provider version to v3.105.0 until this module supports the drain_timeout_in_minutes option.

@Israphel
Copy link
Author

Israphel commented Jun 5, 2024

Thanks, went back to 105.

we're changing the default node group instance type and the rotation behaviour is extremely aggresive. Any way to make it better without the support of soak time, as of today?

@zioproto
Copy link
Collaborator

zioproto commented Jul 1, 2024

@Israphel this PR is now merged: #564

Is it possible for you to pin the module at commit 5858b26 ?

for example:

  module "aks" {
    source = git::https://github.com/Azure/terraform-azurerm-aks.git?ref=5858b260a1d6a9d2ee3687a08690e8932ca86af1
    [..CUT..]

and then set your configuration for drain_timeout_in_minutes.

This should unblock you until there is a new release that includes the feature.

Please let us know if this works for you. Thanks

@Israphel
Copy link
Author

Israphel commented Jul 1, 2024

hello. I actually got unblocked by going back to 105, then applying/refreshing, and then I could continue upgrading normally, the state drift was fixed.

@lonegunmanb
Copy link
Member

I've tried to reproduce this issue by using the following config @Israphel @zioproto but I can't reproduce the same issue:

resource "random_id" "prefix" {
  byte_length = 8
}

resource "random_id" "name" {
  byte_length = 8
}

resource "azurerm_resource_group" "main" {
  count = var.create_resource_group ? 1 : 0

  location = var.location
  name     = coalesce(var.resource_group_name, "${random_id.prefix.hex}-rg")
}

locals {
  resource_group = {
    name     = var.create_resource_group ? azurerm_resource_group.main[0].name : var.resource_group_name
    location = var.location
  }
}

resource "azurerm_virtual_network" "test" {
  address_space       = ["10.52.0.0/16"]
  location            = local.resource_group.location
  name                = "${random_id.prefix.hex}-vn"
  resource_group_name = local.resource_group.name
}

resource "azurerm_subnet" "test" {
  address_prefixes                               = ["10.52.0.0/24"]
  name                                           = "${random_id.prefix.hex}-sn"
  resource_group_name                            = local.resource_group.name
  virtual_network_name                           = azurerm_virtual_network.test.name
  enforce_private_link_endpoint_network_policies = true
}

resource "azurerm_subnet" "pod" {
  address_prefixes = ["10.52.1.0/24"]
  name                 = "${random_id.prefix.hex}-pod"
  resource_group_name  = local.resource_group.name
  virtual_network_name = azurerm_virtual_network.test.name
  enforce_private_link_endpoint_network_policies = true
}

# resource "azurerm_resource_group" "nodepool" {
#   location = local.resource_group.location
#   name     = "f557-nodepool"
# }

module "aks-eu-north" {
  source   = "Azure/aks/azurerm"
  version  = "8.0.0"

  prefix                            = "f557"
  resource_group_name               = local.resource_group.name
  node_resource_group               = "f557-nodepool${random_id.name.hex}"
  kubernetes_version                = "1.29.2"
  orchestrator_version              = "1.29.2"
  oidc_issuer_enabled               = true
  workload_identity_enabled         = true
  agents_pool_name                  = "default"
  agents_availability_zones         = ["1", "2", "3"]
  agents_type                       = "VirtualMachineScaleSets"
  agents_size                       = try("Standard_B4s_v2", "Standard_D2s_v3")
  temporary_name_for_rotation       = "tmp"
  enable_auto_scaling               = true
  agents_count                      = null
  agents_min_count                  = 1
  agents_max_count                  = 8
  azure_policy_enabled              = true
  log_analytics_workspace_enabled   = false
  log_retention_in_days             = 30
  network_plugin                    = "azure"
  load_balancer_sku                 = "standard"
  ebpf_data_plane                   = "cilium"
  os_disk_size_gb                   = 60
  rbac_aad                          = true
  rbac_aad_managed                  = true
  rbac_aad_azure_rbac_enabled       = true
  role_based_access_control_enabled = true
#   rbac_aad_admin_group_object_ids   = [local.inputs["groups"]["infra"]]
  sku_tier                          = "Standard"
  vnet_subnet_id                    = azurerm_subnet.test.id
  pod_subnet_id                     = azurerm_subnet.pod.id
  agents_labels                     = {}
  agents_tags                       = {}
}

After apply, I updated soak duration time via Azure CLI:

az aks nodepool update --cluster-name f557-aks --resource-group ba4d95fcea318222-rg --name default --node-soak-duration 5

Then I ran terraform plan

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  ~ update in-place

Terraform will perform the following actions:

  # azurerm_subnet.pod will be updated in-place
  ~ resource "azurerm_subnet" "pod" {
        id                                             = "/subscriptions/xxxxxxxxxxxx/resourceGroups/ba4d95fcea318222-rg/providers/Microsoft.Network/virtualNetworks/ba4d95fcea318222-vn/subnets/ba4d95fcea318222-pod"
        name                                           = "ba4d95fcea318222-pod"
        # (11 unchanged attributes hidden)

      - delegation {
          - name = "aks-delegation" -> null

          - service_delegation {
              - actions = [
                  - "Microsoft.Network/virtualNetworks/subnets/join/action",
                ] -> null
              - name    = "Microsoft.ContainerService/managedClusters" -> null
            }
        }
    }

  # module.aks-eu-north.azurerm_kubernetes_cluster.main will be updated in-place
  ~ resource "azurerm_kubernetes_cluster" "main" {
        id                                  = "/subscriptions/xxxxxxxxxxxx/resourceGroups/ba4d95fcea318222-rg/providers/Microsoft.ContainerService/managedClusters/f557-aks"
        name                                = "f557-aks"
        tags                                = {}
        # (39 unchanged attributes hidden)

      ~ default_node_pool {
            name                          = "default"
            tags                          = {}
            # (33 unchanged attributes hidden)

          - upgrade_settings {
              - drain_timeout_in_minutes      = 0 -> null
              - max_surge                     = "10%" -> null
              - node_soak_duration_in_minutes = 0 -> null
            }
        }

        # (6 unchanged blocks hidden)
    }

Plan: 0 to add, 2 to change, 0 to destroy.

We were not able to reproduce this issue on our side.

We've also consulted the service team but we have no idea where this 30 came from, @Israphel could you please try to give us a minimum example that could reproduce this issue

@Israphel
Copy link
Author

Israphel commented Jul 2, 2024

Try with:

az aks nodepool update --cluster-name f557-aks --resource-group ba4d95fcea318222-rg --name default --max-surge 10% --node-soak-duration 10 --drain-timeout 30

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Archived in project
Development

No branches or pull requests

3 participants