Terraform unexpectedly replaced all nodes in AKS #604

YichenTFlexciton · 2024-11-11T15:57:49Z

Is there an existing issue for this?

I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

Terraform v1.4.4

Module Version

9.2.0

AzureRM Provider Version

4.9.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

I am managing an AKS cluster via terraform and it is a cluster of 4 node pools, a default, 3 workload pools (one of which had a node count of 0). 

Sample config as follows:

resource "azurerm_kubernetes_cluster" "aks" {
  lifecycle {
    ignore_changes = [
      default_node_pool[0].node_count
    ]
  }
  sku_tier                            = "Standard"
  name                                = "xxx"
  location                            = "xxx"
  resource_group_name                 = "xxx"
  dns_prefix                          = "xxx"
  kubernetes_version                  = var.kubernetes_version
  private_cluster_enabled             = true
  private_cluster_public_fqdn_enabled = true
  azure_policy_enabled                = false
  http_application_routing_enabled    = false
  role_based_access_control_enabled   = true
  workload_identity_enabled           = true
  oidc_issuer_enabled                 = true
  image_cleaner_interval_hours        = 48
  node_os_upgrade_channel             = "Unmanaged"


I do not want the default "NodeImage" for node_os_upgrade_channel since I would like to control when my node image gets updated. 


During upgrades of azurerm provider from 3.95.0 to 4.9.0, I have seen a few default values changes, which are expected from the change log. 

 ~ resource "azurerm_kubernetes_cluster" "aks" {
        id                                  = "xxx"
        name                                = "xxx"
      + node_os_upgrade_channel             = "Unmanaged"
        # (31 unchanged attributes hidden)
        # (6 unchanged blocks hidden)
    }
  # module.xxx.azurerm_kubernetes_cluster_node_pool.auto_scaling_node_pool["zeronode"] will be updated in-place
  ~ resource "azurerm_kubernetes_cluster_node_pool" "auto_scaling_node_pool" {
        id                      = "zeronode"
        name                    = "zeronode"
        # (25 unchanged attributes hidden)
      + upgrade_settings {
          + drain_timeout_in_minutes      = 60
          + max_surge                     = "1"
          + node_soak_duration_in_minutes = 0
        }
        # (1 unchanged block hidden)
    }

I have applied this plan but it has failed due to 

╷
│ Error: updating Kubernetes Cluster 
│ Kubernetes Cluster Name: "xxx"): performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {
│   "code": "EtagMismatch",
│   "details": [
│    {
│     "code": "Unspecified",
│     "message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
│    }
│   ],
│   "message": "Operation is not allowed: Another operation is in progress",
│   "subcode": "PutManagedClusterAndComponents_FailedPrecondition"
│  }

I am certain that no one else is running terraform at this point in time. 

I have re-run plan and I have also found that the cluster is showing "Updating" in Azure portal. Despite there were no "forces replacement" indicator in terraform plan. 

I have therefore left the update to finish, after that I have found my node image is "kubernetes.azure.com/node-image-version=AKSUbuntu-2204containerd-202410.27.0", different from my other clusters that have the same setup.

tfvars variables values

node_os_upgrade_channel = "Unmanaged"
kubernetes_version = 1.30.3

Debug Output/Panic Output

performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {
│   "code": "EtagMismatch",
│   "details": [
│    {
│     "code": "Unspecified",
│     "message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
│    }
│   ],
│   "message": "Operation is not allowed: Another operation is in progress",
│   "subcode": "PutManagedClusterAndComponents_FailedPrecondition"
│  }

Expected Behaviour

Terraform should not replace all nodes

Actual Behaviour

All nodes were replaced

Steps to Reproduce

No response

Important Factoids

No response

References

No response

djmcgreal-cc · 2024-11-15T09:30:22Z

Hi, I'm also getting a similar error, but module 9.2.0 provider 3.117.0 on

Error: Failed to update resource

with module.target.module.service[0].module.kubernetes_cluster.module.aks.azapi_update_resource.aks_cluster_post_create,
on .terraform/modules/target.service.kubernetes_cluster.aks/main.tf line 653, in resource "azapi_update_resource" "aks_cluster_post_create":
653: resource "azapi_update_resource" "aks_cluster_post_create" {

updating "Resource: (ResourceId
"/subscriptions/[redacted]/resourceGroups/McGreal/providers/Microsoft.ContainerService/managedClusters/aks-McGreal-service" / Api Version
"2024-02-01")": PUT
https://management.azure.com/subscriptions/[redacted]/resourceGroups/McGreal/providers/Microsoft.ContainerService/managedClusters/aks-McGreal-service

RESPONSE 409: 409 Conflict
ERROR CODE: EtagMismatch

{
"code": "EtagMismatch",
"details": [
{
"code": "Unspecified",
"message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
}
],
"message": "Operation is not allowed: Another operation is in progress",
"subcode": "PutManagedClusterAndComponents_FailedPrecondition"
}

There doesn't seem to be any other operations running on the cluster, based on the activity log - only Succeeded ones or these Failed ones. It was a brand new cluster yesterday - we have made several clusters with the current approach and not encountered this issue before.

zioproto · 2024-11-28T08:55:04Z

@YichenTFlexciton in your code example you are using directly the resources azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool.

Can you please clarify if you are using the Terraform module https://registry.terraform.io/modules/Azure/aks/azurerm/latest ? You opened a GitHub issue to the repository of that module.

In case you are using the resource azurerm_kubernetes_cluster directly, you should open an issue at https://github.com/hashicorp/terraform-provider-azurerm/issues

Cc: @ms-henglu for possible race condition in the provider doing update operations to azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool.

Tiana125 · 2024-12-03T15:17:43Z

Hi @zioproto

Thanks for your reply! I am using azurerm provider and azurerm_kubernetes_cluster resource, I will open an issue in the provider, thanks!

YichenTFlexciton added the bug Something isn't working label Nov 11, 2024

github-project-automation bot added this to Azure Module Kanban Nov 11, 2024

github-project-automation bot moved this to Todo in Azure Module Kanban Nov 11, 2024

Tiana125 mentioned this issue Dec 3, 2024

Terraform unexpectedly replaced all nodes in AKS hashicorp/terraform-provider-azurerm#28165

Open

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Terraform unexpectedly replaced all nodes in AKS #604

Terraform unexpectedly replaced all nodes in AKS #604

YichenTFlexciton commented Nov 11, 2024

djmcgreal-cc commented Nov 15, 2024

RESPONSE 409: 409 Conflict
ERROR CODE: EtagMismatch

{
"code": "EtagMismatch",
"details": [
{
"code": "Unspecified",
"message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
}
],
"message": "Operation is not allowed: Another operation is in progress",
"subcode": "PutManagedClusterAndComponents_FailedPrecondition"
}

zioproto commented Nov 28, 2024

Tiana125 commented Dec 3, 2024

Terraform unexpectedly replaced all nodes in AKS #604

Terraform unexpectedly replaced all nodes in AKS #604

Comments

YichenTFlexciton commented Nov 11, 2024

Is there an existing issue for this?

Greenfield/Brownfield provisioning

Terraform Version

Module Version

AzureRM Provider Version

Affected Resource(s)/Data Source(s)

Terraform Configuration Files

tfvars variables values

Debug Output/Panic Output

Expected Behaviour

Actual Behaviour

Steps to Reproduce

Important Factoids

References

djmcgreal-cc commented Nov 15, 2024

RESPONSE 409: 409 Conflict ERROR CODE: EtagMismatch

{ "code": "EtagMismatch", "details": [ { "code": "Unspecified", "message": "rpc error: code = FailedPrecondition desc = Etag mismatched" } ], "message": "Operation is not allowed: Another operation is in progress", "subcode": "PutManagedClusterAndComponents_FailedPrecondition" }

zioproto commented Nov 28, 2024

Tiana125 commented Dec 3, 2024

RESPONSE 409: 409 Conflict
ERROR CODE: EtagMismatch

{
"code": "EtagMismatch",
"details": [
{
"code": "Unspecified",
"message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
}
],
"message": "Operation is not allowed: Another operation is in progress",
"subcode": "PutManagedClusterAndComponents_FailedPrecondition"
}