Issue 493 Allow operator to recover from FailedUpgrade #496

lunarfs · 2022-09-01T20:05:37Z

If the upgrade is taking more than the timeout to get back in the cluster) the operator cannot recover even if the cluster is healthy. adding steps to recover once the cluster is completely upgraded (potentially manual work)

Change log description

Fix operator stuck in FailedUpgrade, even after the cluster is upgraded and healthy.

Purpose of the change

fixes #493

What the code does

Check if the Statefullset is fully upgrade when the operator is in UpgradeFailed mode, if the cluster is fully upgraded and healthy, remove the failed state and complete the upgrade

How to verify it

make sure that a node will not get online whit-in the 10 min timeout when doing an upgrade
verify that the cluster is stuck in UpgradeFailed and e.g. 2 out of 3 nodes ready
upgrade the operator to a build containing this fix, and verify that the upgrade completes.
obviously if you are already running this version, a failed upgrade vil recover once the cluster is in a good and upgraded state

… taking more than the timeout to get back in the cluster) the operator cannot recover even if the cluster is healthy. adding steps to recover once the cluster is completley upgraded (potentially manual work) Signed-off-by: Frank Vissing <[email protected]>

Signed-off-by: Frank Vissing <[email protected]>

codecov · 2022-09-01T20:13:34Z

Codecov Report

Attention: 6 lines in your changes are missing coverage. Please review.

Comparison is base (88b9307) 85.12% compared to head (fc3d6b7) 85.91%.

Files	Patch %	Lines
controllers/zookeepercluster_controller.go	70.00%	3 Missing and 3 partials ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #496      +/-   ##
==========================================
+ Coverage   85.12%   85.91%   +0.79%     
==========================================
  Files          12       12              
  Lines        1613     1633      +20     
==========================================
+ Hits         1373     1403      +30     
+ Misses        155      145      -10     
  Partials       85       85

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Signed-off-by: Frank Vissing <[email protected]>

lunarfs · 2022-09-05T08:34:03Z

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

anishakj · 2022-09-05T08:52:51Z

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

lunarfs · 2022-09-05T09:09:00Z

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure i understand,

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure I understand, so if the statefull set is n/n and the revision is correct what could not be ok?

anishakj · 2022-09-05T09:11:53Z

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure i understand,

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure I understand, so if the statefull set is n/n and the revision is correct what could not be ok?

Suppose user has given wrong image name or repository during upgrade. In that case upgrade will never succeed and we would like to make upgrade as failed. As the upgrade strategy is rollingupdate second pod upgrade wont get started only

lunarfs · 2022-09-05T09:15:42Z

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure i understand,

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure I understand, so if the statefull set is n/n and the revision is correct what could not be ok?

Suppose user has given wrong image name or repository during upgrade. In that case upgrade will never succeed and we would like to make upgrade as failed. As the upgrade strategy is rollingupdate second pod upgrade wont get started only

But i guess in that case the statefulset will not reach n/n but rather (n-1)/n as the pod that have gotten the invalid image applied will never become ready

lunarfs · 2022-09-06T12:48:58Z

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure i understand,

So I applied this to a cluster having the same issue as #493 and the cluster successfully healed

But there can be issues, when upgrade is failing due to valid reason rt

not sure I understand, so if the statefull set is n/n and the revision is correct what could not be ok?

Suppose user has given wrong image name or repository during upgrade. In that case upgrade will never succeed and we would like to make upgrade as failed. As the upgrade strategy is rollingupdate second pod upgrade wont get started only

But i guess in that case the statefulset will not reach n/n but rather (n-1)/n as the pod that have gotten the invalid image applied will never become ready

or is there something i am missing here?

anishakj

LGTM

lunarfs added 2 commits September 1, 2022 21:58

go fmt

193e2a4

Signed-off-by: Frank Vissing <[email protected]>

add testcase

e97559d

Signed-off-by: Frank Vissing <[email protected]>

lunarfs and others added 3 commits October 24, 2022 09:52

Merge branch 'pravega:master' into issue_493_recover_failed_upgrades

f0d91be

Merge branch 'master' into issue_493_recover_failed_upgrades

571fdad

Merge branch 'master' into issue_493_recover_failed_upgrades

c455745

anishakj requested a review from jkhalack March 3, 2023 16:15

anishakj added 2 commits November 16, 2023 13:25

Merge branch 'master' into issue_493_recover_failed_upgrades

2c55314

Merge branch 'master' into issue_493_recover_failed_upgrades

fc3d6b7

anishakj self-requested a review December 12, 2023 14:55

anishakj approved these changes Dec 12, 2023

View reviewed changes

anishakj merged commit 074d8b0 into pravega:master Dec 12, 2023
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issue 493 Allow operator to recover from FailedUpgrade #496

Issue 493 Allow operator to recover from FailedUpgrade #496

lunarfs commented Sep 1, 2022

codecov bot commented Sep 1, 2022 •

edited

Loading

lunarfs commented Sep 5, 2022

anishakj commented Sep 5, 2022

lunarfs commented Sep 5, 2022

anishakj commented Sep 5, 2022

lunarfs commented Sep 5, 2022

lunarfs commented Sep 6, 2022

anishakj left a comment

Issue 493 Allow operator to recover from FailedUpgrade #496

Issue 493 Allow operator to recover from FailedUpgrade #496

Conversation

lunarfs commented Sep 1, 2022

Change log description

Purpose of the change

What the code does

How to verify it

codecov bot commented Sep 1, 2022 • edited Loading

Codecov Report

lunarfs commented Sep 5, 2022

anishakj commented Sep 5, 2022

lunarfs commented Sep 5, 2022

anishakj commented Sep 5, 2022

lunarfs commented Sep 5, 2022

lunarfs commented Sep 6, 2022

anishakj left a comment

Choose a reason for hiding this comment

codecov bot commented Sep 1, 2022 •

edited

Loading