Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530

gdemonet · 2020-05-06T21:38:46Z

Component: salt, kubernetes, lifecycle

What happened:

When running the upgrade.sh script on a multi-nodes cluster with some workload (here, Keycloak) deployed on it, the rolling upgrade failed with:

[ERROR   ] An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/state.py", line 1981, in call
    **cdata['kwargs'])
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1977, in wrapper
    return f(*args, **kwargs)
  File "/var/cache/salt/master/extmods/states/metalk8s_drain.py", line 25, in node_drained
    res = __salt__['metalk8s_kubernetes.node_drain'](name, **kwargs)
  File "/var/cache/salt/master/extmods/modules/metalk8s_drain.py", line 532, in node_drain
    return drainer.run_drain(dry_run=dry_run)
  File "/var/cache/salt/master/extmods/modules/metalk8s_drain.py", line 381, in run_drain
    [pod['metadata']['name'] for pod in remaining_pods]
CommandExecutionError: Failed to evict pod "keycloak-0" in namespace "keycloak": (429)
Reason: Too Many Requests
HTTP response headers: HTTPHeaderDict({'date': 'Wed, 06 May 2020 10:14:20 GMT', 'content-length': '322', 'content-type': 'application/json', 'cache-control': 'no-cache, private'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget keycloak needs 1 healthy pods and has 1 currently"}]},"code":429}

The custom drain fails with an uncaught error because some DisruptionBudget doesn't allow the eviction: the previous Node we had to upgrade was just uncordoned when we try to drain the following Node, hence workloads on the previous Node weren't back up yet.

What was expected:

For the upgrade script not to fail due to instability it induces by draining Nodes in a rolling fashion.

Steps to reproduce:

Run a slow starting workload on multiple Nodes with a PodDisruptionBudget set with maxUnavailable: 1, and run the upgrade script. If the workload takes too long to come back up after the uncordon, next drain should fail with this error.

Resolution proposal (optional):

The upgrade orchestration should wait and retry the eviction for some time, to let things stabilize, when being faced with such errors.
A timeout must be implemented to ensure we don't block the orchestration indefinitely.

The text was updated successfully, but these errors were encountered:

gdemonet · 2020-05-07T08:54:14Z

Seems like we're missing this piece of logic (excerpt from the kubectl implementation):

err := d.EvictPod(pod, policyGroupVersion)
[...]
} else if apierrors.IsTooManyRequests(err) {
    fmt.Fprintf(d.ErrOut, "error when evicting pod %q (will retry after 5s): %v\n", pod.Name, err)
    time.Sleep(5 * time.Second)
}

Edit: we definitely are missing it

# salt/_modules/metalk8s_drain.py L.394-395
        for pod in pods:
            # TODO: "too many requests" error not handled

The Eviction API sends '429 Too Many Requests' when a requested eviction can't be applied due to some disruption budget. This way, the client can wait and retry later. We didn't handle this error in our implementation, hence rolling upgrades were failing as soon as we hit this situation. We add support for this, and also move the "timeout" scope to the whole eviction process (since we now may be stuck at the eviction creation). Fixes: #2530

See: #2530

Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530

gdemonet · 2020-05-08T11:49:52Z

Missing a time.sleep(5) after a 429 error in the current implementation, and also missing some logs.

Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530

gdemonet · 2020-05-12T10:43:26Z

Closing since #2536 is merged.

gdemonet added kind:bug Something isn't working topic:lifecycle Issues related to upgrade or downgrade of MetalK8s labels May 6, 2020

gdemonet added this to the MetalK8s 2.5.1 milestone May 6, 2020

gdemonet self-assigned this May 6, 2020

gdemonet mentioned this issue May 7, 2020

salt/drain: Handle retryable errors when evicting #2531

Merged

gdemonet modified the milestones: MetalK8s 2.5.1, MetalK8s 2.4.4 May 7, 2020

bert-e closed this as completed in 0c7ee85 May 7, 2020

gdemonet added a commit that referenced this issue May 7, 2020

salt: Add a sleep between eviction attempts

561b134

See: #2530

gdemonet added a commit that referenced this issue May 7, 2020

salt: Add logs to custom drain module

5074002

Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530

gdemonet reopened this May 8, 2020

gdemonet mentioned this issue May 8, 2020

Add a sleep between failed eviction attempts, and add more logs to drain module #2536

Merged

gdemonet added a commit that referenced this issue May 8, 2020

salt: Add logs to custom drain module

a94a2cc

Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530

gdemonet added a commit that referenced this issue May 11, 2020

salt: Add logs to custom drain module

19939ba

Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530

gdemonet closed this as completed May 12, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530

Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530

gdemonet commented May 6, 2020

gdemonet commented May 7, 2020 •

edited

Loading

gdemonet commented May 8, 2020

gdemonet commented May 12, 2020

Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530

Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530

Comments

gdemonet commented May 6, 2020

gdemonet commented May 7, 2020 • edited Loading

gdemonet commented May 8, 2020

gdemonet commented May 12, 2020

gdemonet commented May 7, 2020 •

edited

Loading