-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Issue 493 Allow operator to recover from FailedUpgrade #496
Issue 493 Allow operator to recover from FailedUpgrade #496
Conversation
… taking more than the timeout to get back in the cluster) the operator cannot recover even if the cluster is healthy. adding steps to recover once the cluster is completley upgraded (potentially manual work) Signed-off-by: Frank Vissing <[email protected]>
Signed-off-by: Frank Vissing <[email protected]>
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #496 +/- ##
==========================================
+ Coverage 85.12% 85.91% +0.79%
==========================================
Files 12 12
Lines 1613 1633 +20
==========================================
+ Hits 1373 1403 +30
+ Misses 155 145 -10
Partials 85 85 ☔ View full report in Codecov by Sentry. |
Signed-off-by: Frank Vissing <[email protected]>
So I applied this to a cluster having the same issue as #493 and the cluster successfully healed |
But there can be issues, when upgrade is failing due to valid reason rt |
not sure i understand,
not sure I understand, so if the statefull set is n/n and the revision is correct what could not be ok? |
Suppose user has given wrong image name or repository during upgrade. In that case upgrade will never succeed and we would like to make upgrade as failed. As the upgrade strategy is |
But i guess in that case the statefulset will not reach n/n but rather (n-1)/n as the pod that have gotten the invalid image applied will never become ready |
or is there something i am missing here? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
If the upgrade is taking more than the timeout to get back in the cluster) the operator cannot recover even if the cluster is healthy. adding steps to recover once the cluster is completely upgraded (potentially manual work)
Change log description
Fix operator stuck in FailedUpgrade, even after the cluster is upgraded and healthy.
Purpose of the change
fixes #493
What the code does
Check if the Statefullset is fully upgrade when the operator is in UpgradeFailed mode, if the cluster is fully upgraded and healthy, remove the failed state and complete the upgrade
How to verify it
make sure that a node will not get online whit-in the 10 min timeout when doing an upgrade
verify that the cluster is stuck in UpgradeFailed and e.g. 2 out of 3 nodes ready
upgrade the operator to a build containing this fix, and verify that the upgrade completes.
obviously if you are already running this version, a failed upgrade vil recover once the cluster is in a good and upgraded state