-
Notifications
You must be signed in to change notification settings - Fork 45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Upgrading a cluster with deployed workloads may fail due to disallowed disruptions #2530
Labels
Milestone
Comments
gdemonet
added
kind:bug
Something isn't working
topic:lifecycle
Issues related to upgrade or downgrade of MetalK8s
labels
May 6, 2020
Seems like we're missing this piece of logic (excerpt from the err := d.EvictPod(pod, policyGroupVersion)
[...]
} else if apierrors.IsTooManyRequests(err) {
fmt.Fprintf(d.ErrOut, "error when evicting pod %q (will retry after 5s): %v\n", pod.Name, err)
time.Sleep(5 * time.Second)
} Edit: we definitely are missing it # salt/_modules/metalk8s_drain.py L.394-395
for pod in pods:
# TODO: "too many requests" error not handled |
gdemonet
added a commit
that referenced
this issue
May 7, 2020
The Eviction API sends '429 Too Many Requests' when a requested eviction can't be applied due to some disruption budget. This way, the client can wait and retry later. We didn't handle this error in our implementation, hence rolling upgrades were failing as soon as we hit this situation. We add support for this, and also move the "timeout" scope to the whole eviction process (since we now may be stuck at the eviction creation). Fixes: #2530
gdemonet
added a commit
that referenced
this issue
May 7, 2020
gdemonet
added a commit
that referenced
this issue
May 7, 2020
Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530
Missing a |
gdemonet
added a commit
that referenced
this issue
May 8, 2020
Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530
gdemonet
added a commit
that referenced
this issue
May 11, 2020
Mostly debug information, one more visible log about Eviction creation retried in case we receive a 429 from k-a. See: #2530
Closing since #2536 is merged. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Labels
Component: salt, kubernetes, lifecycle
What happened:
When running the
upgrade.sh
script on a multi-nodes cluster with some workload (here, Keycloak) deployed on it, the rolling upgrade failed with:The custom drain fails with an uncaught error because some DisruptionBudget doesn't allow the eviction: the previous Node we had to upgrade was just uncordoned when we try to drain the following Node, hence workloads on the previous Node weren't back up yet.
What was expected:
For the upgrade script not to fail due to instability it induces by draining Nodes in a rolling fashion.
Steps to reproduce:
Run a slow starting workload on multiple Nodes with a PodDisruptionBudget set with
maxUnavailable: 1
, and run the upgrade script. If the workload takes too long to come back up after the uncordon, next drain should fail with this error.Resolution proposal (optional):
The upgrade orchestration should wait and retry the eviction for some time, to let things stabilize, when being faced with such errors.
A timeout must be implemented to ensure we don't block the orchestration indefinitely.
The text was updated successfully, but these errors were encountered: