-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Recover cluster after multiple pod failures #366
Recover cluster after multiple pod failures #366
Conversation
Signed-off-by: Sebastian Woehrl <[email protected]>
Any plans to merge this PR? We're stuck with failed cluster for the second time in a month. This time it was caused by node upgrade on AKS. |
Hey @swoehrl-mw, Just passed on the implementation, I didnt know the |
Signed-off-by: Sebastian Woehrl <[email protected]>
Signed-off-by: Sebastian Woehrl <[email protected]>
@idanl21 @prudhvigodithi I tried to add some unittests for this PR but in the end had to give up as I could not get envtest to behave like I wanted to. We'll have to cover this with functional tests in the future. Tests I did:
|
@idanl21 we had to back from using the operator and return to manually-provisioned cluster. But we're ready to go back to testing. |
Signed-off-by: Sebastian Woehrl <[email protected]>
Signed-off-by: Sebastian Woehrl <[email protected]>
Hey @grzeg1, Im glad that you guys wanna keep work and progress with the Operator, i believe that fix will be merged to main in few days :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ready to merge form my point of view :)
Just to let you know: we've been using the merged fix for a few days and tried to make the cluster break. We did not succeed - the cluster always recovered. Thumbs up! |
This PR implements a potential solution for cluster recovery (issue #289).
The idea is to detect that more than one pod is missing (thus quorum potentially being broken) by checking if there are more PVCs than pods. If so the operator temporarily recreates the affected STS with
podManagementPolicy = Parallel
so that all pods are started again.@idanl21 @prudhvigodithi I'm not yet sure if this is a good solution, please provide your opinion on this (also if there are cases not covered) or alternative ideas you have. If you are ok with this approach I will add some tests.
I've tested two scenarios: