Recommended way to change agents_size without downtime? #559

Israphel · 2024-06-05T19:35:48Z

Is there an existing issue for this?

I have searched the existing issues

Description

We deploy our clusters with a default node_pool, using:

agents_pool_name            = "default"
agents_pool_max_surge       = try(each.value.max_surge, "10%")
agents_availability_zones   = ["1", "2", "3"]
agents_type                 = "VirtualMachineScaleSets"
agents_size                 = try(each.value.agents_size, "Standard_D2s_v3")
temporary_name_for_rotation = "tmp"

We're replacing agents_size with the ARM equivalent, and we can see the "tmp" node_pool being created, but then all the default nodes are drained at once, without respecting PDB, essentially taking down every service

1s          Normal   Drain             node/aks-default-15731243-vmss000009      Draining node: aks-default-15731243-vmss000009
2s          Normal   Drain             node/aks-default-15731243-vmss00000x      Draining node: aks-default-15731243-vmss00000x
2s          Normal   Drain             node/aks-default-15731243-vmss00000e      Draining node: aks-default-15731243-vmss00000e

Are we doing it the wrong way? how can we change the agents size without such a drastic draining?

New or Affected Resource(s)/Data Source(s)

azurerm_kubernetes_cluster

The text was updated successfully, but these errors were encountered:

zioproto · 2024-06-07T06:25:34Z

@Israphel could you please confirm which version of the module you are using ?

zioproto · 2024-06-07T06:34:45Z

@Israphel I understand you are trying to change the "agents_size" of the system node pool. If you look at the provider documentation this is changing the default_node_pool block of the azurerm_kubernetes_cluster resource.

Please check this documentation:

https://registry.terraform.io/providers/hashicorp/azurerm/latest/docs/resources/kubernetes_cluster

The behaviour you see is expected, and I dont think this is something we can workaround in the module.

I found this related provider issue:

Doc: Emphasize that azurerm_kubernetes_cluster.temporary_name_for_rotation affects service availability hashicorp/terraform-provider-azurerm#22849

Feel free to open upstream at https://github.com/hashicorp/terraform-provider-azurerm/issues a new issue if you would like this behaviour to change.

I will keep this issue open in case you have additional questions.

Thanks

Israphel · 2024-06-07T14:19:47Z

I use 8.0.0

the only way we found was creating a new node_pool, drain all the defaults, change agents_size and then drain the temporary node_pool once more. Is this what everyone is doing to prevent downtime?

The problem we see is that when you upgrade kubernetes, this doesn't happens, everything goes smoothly and the PDBs are respected. But changing the instance type just drains all at once, too aggresive.

github-project-automation bot added this to Azure Module Kanban Jun 5, 2024

github-project-automation bot moved this to Todo in Azure Module Kanban Jun 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recommended way to change agents_size without downtime? #559

Recommended way to change agents_size without downtime? #559

Israphel commented Jun 5, 2024 •

edited

Loading

zioproto commented Jun 7, 2024

zioproto commented Jun 7, 2024

Israphel commented Jun 7, 2024

Recommended way to change agents_size without downtime? #559

Recommended way to change agents_size without downtime? #559

Comments

Israphel commented Jun 5, 2024 • edited Loading

Is there an existing issue for this?

Description

New or Affected Resource(s)/Data Source(s)

zioproto commented Jun 7, 2024

zioproto commented Jun 7, 2024

Israphel commented Jun 7, 2024

Israphel commented Jun 5, 2024 •

edited

Loading