Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530

Closed
gdemonet opened this issue May 6, 2020 · 3 comments
Assignees
Labels
kind:bug Something isn't working topic:lifecycle Issues related to upgrade or downgrade of MetalK8s

Comments

@gdemonet
Copy link
Contributor

gdemonet commented May 6, 2020

Component: salt, kubernetes, lifecycle

What happened:

When running the upgrade.sh script on a multi-nodes cluster with some workload (here, Keycloak) deployed on it, the rolling upgrade failed with:

[ERROR   ] An exception occurred in this state: Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/salt/state.py", line 1981, in call
    **cdata['kwargs'])
  File "/usr/lib/python2.7/site-packages/salt/loader.py", line 1977, in wrapper
    return f(*args, **kwargs)
  File "/var/cache/salt/master/extmods/states/metalk8s_drain.py", line 25, in node_drained
    res = __salt__['metalk8s_kubernetes.node_drain'](name, **kwargs)
  File "/var/cache/salt/master/extmods/modules/metalk8s_drain.py", line 532, in node_drain
    return drainer.run_drain(dry_run=dry_run)
  File "/var/cache/salt/master/extmods/modules/metalk8s_drain.py", line 381, in run_drain
    [pod['metadata']['name'] for pod in remaining_pods]
CommandExecutionError: Failed to evict pod "keycloak-0" in namespace "keycloak": (429)
Reason: Too Many Requests
HTTP response headers: HTTPHeaderDict({'date': 'Wed, 06 May 2020 10:14:20 GMT', 'content-length': '322', 'content-type': 'application/json', 'cache-control': 'no-cache, private'})
HTTP response body: {"kind":"Status","apiVersion":"v1","metadata":{},"status":"Failure","message":"Cannot evict pod as it would violate the pod's disruption budget.","reason":"TooManyRequests","details":{"causes":[{"reason":"DisruptionBudget","message":"The disruption budget keycloak needs 1 healthy pods and has 1 currently"}]},"code":429}

The custom drain fails with an uncaught error because some DisruptionBudget doesn't allow the eviction: the previous Node we had to upgrade was just uncordoned when we try to drain the following Node, hence workloads on the previous Node weren't back up yet.

What was expected:

For the upgrade script not to fail due to instability it induces by draining Nodes in a rolling fashion.

Steps to reproduce:

Run a slow starting workload on multiple Nodes with a PodDisruptionBudget set with maxUnavailable: 1, and run the upgrade script. If the workload takes too long to come back up after the uncordon, next drain should fail with this error.

Resolution proposal (optional):

The upgrade orchestration should wait and retry the eviction for some time, to let things stabilize, when being faced with such errors.
A timeout must be implemented to ensure we don't block the orchestration indefinitely.

@gdemonet gdemonet added kind:bug Something isn't working topic:lifecycle Issues related to upgrade or downgrade of MetalK8s labels May 6, 2020
@gdemonet gdemonet added this to the MetalK8s 2.5.1 milestone May 6, 2020
@gdemonet gdemonet self-assigned this May 6, 2020
@gdemonet
Copy link
Contributor Author

gdemonet commented May 7, 2020

Seems like we're missing this piece of logic (excerpt from the kubectl implementation):

err := d.EvictPod(pod, policyGroupVersion)
[...]
} else if apierrors.IsTooManyRequests(err) {
    fmt.Fprintf(d.ErrOut, "error when evicting pod %q (will retry after 5s): %v\n", pod.Name, err)
    time.Sleep(5 * time.Second)
}

Edit: we definitely are missing it

# salt/_modules/metalk8s_drain.py L.394-395
        for pod in pods:
            # TODO: "too many requests" error not handled

gdemonet added a commit that referenced this issue May 7, 2020
The Eviction API sends '429 Too Many Requests' when a requested eviction
can't be applied due to some disruption budget. This way, the client can
wait and retry later.
We didn't handle this error in our implementation, hence rolling
upgrades were failing as soon as we hit this situation.
We add support for this, and also move the "timeout" scope to the whole
eviction process (since we now may be stuck at the eviction creation).

Fixes: #2530
@bert-e bert-e closed this as completed in 0c7ee85 May 7, 2020
gdemonet added a commit that referenced this issue May 7, 2020
gdemonet added a commit that referenced this issue May 7, 2020
Mostly debug information, one more visible log about Eviction creation
retried in case we receive a 429 from k-a.

See: #2530
@gdemonet
Copy link
Contributor Author

gdemonet commented May 8, 2020

Missing a time.sleep(5) after a 429 error in the current implementation, and also missing some logs.

@gdemonet gdemonet reopened this May 8, 2020
gdemonet added a commit that referenced this issue May 8, 2020
Mostly debug information, one more visible log about Eviction creation
retried in case we receive a 429 from k-a.

See: #2530
gdemonet added a commit that referenced this issue May 11, 2020
Mostly debug information, one more visible log about Eviction creation
retried in case we receive a 429 from k-a.

See: #2530
@gdemonet
Copy link
Contributor Author

Closing since #2536 is merged.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug Something isn't working topic:lifecycle Issues related to upgrade or downgrade of MetalK8s
Projects
None yet
Development

No branches or pull requests

1 participant