Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Terraform unexpectedly replaced all nodes in AKS #604

Open
1 task done
YichenTFlexciton opened this issue Nov 11, 2024 · 3 comments
Open
1 task done

Terraform unexpectedly replaced all nodes in AKS #604

YichenTFlexciton opened this issue Nov 11, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@YichenTFlexciton
Copy link

Is there an existing issue for this?

  • I have searched the existing issues

Greenfield/Brownfield provisioning

brownfield

Terraform Version

Terraform v1.4.4

Module Version

9.2.0

AzureRM Provider Version

4.9.0

Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

Terraform Configuration Files

I am managing an AKS cluster via terraform and it is a cluster of 4 node pools, a default, 3 workload pools (one of which had a node count of 0). 

Sample config as follows:

resource "azurerm_kubernetes_cluster" "aks" {
  lifecycle {
    ignore_changes = [
      default_node_pool[0].node_count
    ]
  }
  sku_tier                            = "Standard"
  name                                = "xxx"
  location                            = "xxx"
  resource_group_name                 = "xxx"
  dns_prefix                          = "xxx"
  kubernetes_version                  = var.kubernetes_version
  private_cluster_enabled             = true
  private_cluster_public_fqdn_enabled = true
  azure_policy_enabled                = false
  http_application_routing_enabled    = false
  role_based_access_control_enabled   = true
  workload_identity_enabled           = true
  oidc_issuer_enabled                 = true
  image_cleaner_interval_hours        = 48
  node_os_upgrade_channel             = "Unmanaged"


I do not want the default "NodeImage" for node_os_upgrade_channel since I would like to control when my node image gets updated. 


During upgrades of azurerm provider from 3.95.0 to 4.9.0, I have seen a few default values changes, which are expected from the change log. 

 ~ resource "azurerm_kubernetes_cluster" "aks" {
        id                                  = "xxx"
        name                                = "xxx"
      + node_os_upgrade_channel             = "Unmanaged"
        # (31 unchanged attributes hidden)
        # (6 unchanged blocks hidden)
    }
  # module.xxx.azurerm_kubernetes_cluster_node_pool.auto_scaling_node_pool["zeronode"] will be updated in-place
  ~ resource "azurerm_kubernetes_cluster_node_pool" "auto_scaling_node_pool" {
        id                      = "zeronode"
        name                    = "zeronode"
        # (25 unchanged attributes hidden)
      + upgrade_settings {
          + drain_timeout_in_minutes      = 60
          + max_surge                     = "1"
          + node_soak_duration_in_minutes = 0
        }
        # (1 unchanged block hidden)
    }

I have applied this plan but it has failed due to 

╷
│ Error: updating Kubernetes Cluster 
│ Kubernetes Cluster Name: "xxx"): performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {
│   "code": "EtagMismatch",
│   "details": [
│    {
│     "code": "Unspecified",
│     "message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
│    }
│   ],
│   "message": "Operation is not allowed: Another operation is in progress",
│   "subcode": "PutManagedClusterAndComponents_FailedPrecondition"
│  }

I am certain that no one else is running terraform at this point in time. 

I have re-run plan and I have also found that the cluster is showing "Updating" in Azure portal. Despite there were no "forces replacement" indicator in terraform plan. 

I have therefore left the update to finish, after that I have found my node image is "kubernetes.azure.com/node-image-version=AKSUbuntu-2204containerd-202410.27.0", different from my other clusters that have the same setup.

tfvars variables values

node_os_upgrade_channel = "Unmanaged"
kubernetes_version = 1.30.3

Debug Output/Panic Output

performing CreateOrUpdate: unexpected status 409 (409 Conflict) with response: {
│   "code": "EtagMismatch",
│   "details": [
│    {
│     "code": "Unspecified",
│     "message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
│    }
│   ],
│   "message": "Operation is not allowed: Another operation is in progress",
│   "subcode": "PutManagedClusterAndComponents_FailedPrecondition"
│  }

Expected Behaviour

Terraform should not replace all nodes

Actual Behaviour

All nodes were replaced

Steps to Reproduce

No response

Important Factoids

No response

References

No response

@djmcgreal-cc
Copy link

Hi, I'm also getting a similar error, but module 9.2.0 provider 3.117.0 on

Error: Failed to update resource

with module.target.module.service[0].module.kubernetes_cluster.module.aks.azapi_update_resource.aks_cluster_post_create,
on .terraform/modules/target.service.kubernetes_cluster.aks/main.tf line 653, in resource "azapi_update_resource" "aks_cluster_post_create":
653: resource "azapi_update_resource" "aks_cluster_post_create" {

updating "Resource: (ResourceId
"/subscriptions/[redacted]/resourceGroups/McGreal/providers/Microsoft.ContainerService/managedClusters/aks-McGreal-service" / Api Version
"2024-02-01")": PUT
https://management.azure.com/subscriptions/[redacted]/resourceGroups/McGreal/providers/Microsoft.ContainerService/managedClusters/aks-McGreal-service


RESPONSE 409: 409 Conflict
ERROR CODE: EtagMismatch

{
"code": "EtagMismatch",
"details": [
{
"code": "Unspecified",
"message": "rpc error: code = FailedPrecondition desc = Etag mismatched"
}
],
"message": "Operation is not allowed: Another operation is in progress",
"subcode": "PutManagedClusterAndComponents_FailedPrecondition"
}

There doesn't seem to be any other operations running on the cluster, based on the activity log - only Succeeded ones or these Failed ones. It was a brand new cluster yesterday - we have made several clusters with the current approach and not encountered this issue before.

@zioproto
Copy link
Collaborator

@YichenTFlexciton in your code example you are using directly the resources azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool.

Can you please clarify if you are using the Terraform module https://registry.terraform.io/modules/Azure/aks/azurerm/latest ? You opened a GitHub issue to the repository of that module.

In case you are using the resource azurerm_kubernetes_cluster directly, you should open an issue at https://github.com/hashicorp/terraform-provider-azurerm/issues

Cc: @ms-henglu for possible race condition in the provider doing update operations to azurerm_kubernetes_cluster and azurerm_kubernetes_cluster_node_pool.

@Tiana125
Copy link

Tiana125 commented Dec 3, 2024

Hi @zioproto

Thanks for your reply! I am using azurerm provider and azurerm_kubernetes_cluster resource, I will open an issue in the provider, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

4 participants