CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

elankath · 2024-12-22T08:55:45Z

What happened:

CA requested an undesired additional reduction of the MCM MachineDeployment deployment while the first request to remove the machine was still in processing. This occurred due to api-server timeouts.

Example log

As shown below, the machine set was scaled down at 21:40:04 but till 21:40:38 it was not processed due to client-side throttling

2024-12-14 21:40:03 | {"log":"Processing the machinedeployment \"shoot--hc-eu10--prod-haas-edge-g-z3\" (with replicas 3)","pid":"1","severity":"INFO","source":"deployment.go:433"}
  |   | 2024-12-14 21:40:03 | {"log":"Desired replicas annotation value: 6 on machineSet shoot--hc-eu10--prod-haas-edge-g-z3-85b5b, Spec Desired Replicas value: 3 on corresponding machineDeployment, so scaling has happened.","pid":"1","severity":"INFO","source":"deployment_sync.go:693"}
  |   | 2024-12-14 21:40:03 | {"log":"Scaling detected for machineDeployment shoot--hc-eu10--prod-haas-edge-g-z3","pid":"1","severity":"INFO","source":"deployment.go:533"}
  |   | 2024-12-14 21:40:03 | {"log":"Scaling latest/theOnlyActive machineSet shoot--hc-eu10--prod-haas-edge-g-z3-85b5b","pid":"1","severity":"INFO","source":"deployment_sync.go:412"}
2024-12-14 21:40:04 | {"log":"Waited for 980.323975ms due to client-side throttling, not priority and fairness, request: PUT:https://api.aws-eu2.garden.internal.live.k8s.ondemand.com:443/apis/machine.sapcloud.io/v1alpha1/namespaces/shoot--hc-eu10--prod-haas/machinesets/shoot--hc-eu10--prod-haas-edge-g-z3-85b5b?timeout=1m0s","pid":"1","severity":"INFO","source":"request.go:632"}
  |   | 2024-12-14 21:40:04 | {"log":"shoot--hc-eu10--prod-haas-edge-g-z3-85b5b updated. Desired machine count change: 6-\u003e3","pid":"1","severity":"INFO","source":"machineset.go:155"}
  |   | 2024-12-14 21:40:04 | {"log":"Event(v1.ObjectReference{Kind:\"MachineDeployment\", Namespace:\"shoot--hc-eu10--prod-haas\", Name:\"shoot--hc-eu10--prod-haas-edge-g-z3\", UID:\"e537417f-4659-4f95-80fe-5bc448a5992e\", APIVersion:\"machine.sapcloud.io/v1alpha1\", ResourceVersion:\"54387612340\", FieldPath:\"\"}): type: 'Normal' reason: 'ScalingMachineSet' Scaled down machine set shoot--hc-eu10--prod-haas-edge-g-z3-85b5b to 3","pid":"1","severity":"INFO","source":"event.go:377"}

During this time, CA again requested to delete the nodes, which caused another scale-down.

2024-12-14 21:40:38 | {"log":"Processing the machineset \"shoot--hc-eu10--prod-haas-edge-g-z3-85b5b\" with replicas 0 associated with machine class: \"\"","pid":"1","severity":"INFO","source":"machineset.go:501"}

What you expected to happen:

CA scale down should be idempotent and should only occur once regardless of any timeouts or throttling. There should be no further reduction of replications by the CA just because MCM is delayed or there are problems executing the scale down.

How to reproduce it (as minimally and precisely as possible):

Hang the scale down in the MCM with artificial extended delay so that the CA issues erroneous scaledowns in subsequent processing cycles

The text was updated successfully, but these errors were encountered:

elankath added the kind/bug Bug label Dec 22, 2024

elankath assigned rishabh-11 and elankath and unassigned elankath Dec 22, 2024

elankath added the area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related label Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

elankath commented Dec 22, 2024 •

edited

Loading

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

CA Issues invalid scale downs due to scale-down processing delay in the MCM #342

Comments

elankath commented Dec 22, 2024 • edited Loading

elankath commented Dec 22, 2024 •

edited

Loading