Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recover cluster after multiple pod failures #366

Merged

Conversation

swoehrl-mw
Copy link
Collaborator

This PR implements a potential solution for cluster recovery (issue #289).
The idea is to detect that more than one pod is missing (thus quorum potentially being broken) by checking if there are more PVCs than pods. If so the operator temporarily recreates the affected STS with podManagementPolicy = Parallel so that all pods are started again.

@idanl21 @prudhvigodithi I'm not yet sure if this is a good solution, please provide your opinion on this (also if there are cases not covered) or alternative ideas you have. If you are ok with this approach I will add some tests.

I've tested two scenarios:

  • All master pods down -> pods get recreated in parallel and can form a quorum
  • OpenSearchCluster object is freshly created but PVCs exist (from old run) -> pods get created in parallel and can form a quorum, bootstrap pod is ignored and deleted after recovery

@grzeg1
Copy link

grzeg1 commented Jan 2, 2023

Any plans to merge this PR? We're stuck with failed cluster for the second time in a month. This time it was caused by node upgrade on AKS.

@idanl21
Copy link
Collaborator

idanl21 commented Jan 5, 2023

Hey @swoehrl-mw, Just passed on the implementation, I didnt know the podManagementPolicy = Parallel but looks like it can help us to solve the problem.
did we also tested the the scenario that 2 nodes are failed an only 1 left ?
@grzeg1, Thanks for using the operator, Please - contact me in person, i can help you with your problem (and also you can help us to test that fix).
you are welcome to ping me in mail, [email protected]
Thank !

@swoehrl-mw
Copy link
Collaborator Author

@idanl21 @prudhvigodithi I tried to add some unittests for this PR but in the end had to give up as I could not get envtest to behave like I wanted to. We'll have to cover this with functional tests in the future.
I added an option to disable the recovery so if we didn't cover an edge case the users can prevent it from doing stupid stuff.
Please review and also test in your environments.

Tests I did:

  • Delete all pods -> pods get recreated in parallel
  • Delete 3 of 4 pods -> pods get recreated in parallel
  • Delete 2 of 3 pods -> normal one-by-one recovery works (2 can form a quorum)
  • Delete cluster, keep PVCs, recreate cluster -> pods get created in parallel

docs/userguide/main.md Show resolved Hide resolved
opensearch-operator/controllers/tls_test.go Show resolved Hide resolved
opensearch-operator/pkg/helpers/constants.go Show resolved Hide resolved
opensearch-operator/pkg/helpers/helpers.go Show resolved Hide resolved
opensearch-operator/pkg/helpers/helpers.go Outdated Show resolved Hide resolved
@grzeg1
Copy link

grzeg1 commented Feb 2, 2023

Hey @swoehrl-mw, Just passed on the implementation, I didnt know the podManagementPolicy = Parallel but looks like it can help us to solve the problem. did we also tested the the scenario that 2 nodes are failed an only 1 left ? @grzeg1, Thanks for using the operator, Please - contact me in person, i can help you with your problem (and also you can help us to test that fix). you are welcome to ping me in mail, [email protected] Thank !

@idanl21 we had to back from using the operator and return to manually-provisioned cluster. But we're ready to go back to testing.

Signed-off-by: Sebastian Woehrl <[email protected]>
Signed-off-by: Sebastian Woehrl <[email protected]>
@idanl21
Copy link
Collaborator

idanl21 commented Feb 5, 2023

Hey @grzeg1, Im glad that you guys wanna keep work and progress with the Operator, i believe that fix will be merged to main in few days :)
If you have additional questions, please let me know send me an email ([email protected])

Copy link
Collaborator

@idanl21 idanl21 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ready to merge form my point of view :)

@swoehrl-mw swoehrl-mw merged commit fe9da68 into opensearch-project:main Feb 8, 2023
@swoehrl-mw swoehrl-mw deleted the feature/parallel-recovery branch February 8, 2023 09:53
@grzeg1
Copy link

grzeg1 commented Mar 3, 2023

Hey @grzeg1, Im glad that you guys wanna keep work and progress with the Operator, i believe that fix will be merged to main in few days :) If you have additional questions, please let me know send me an email ([email protected])

Just to let you know: we've been using the merged fix for a few days and tried to make the cluster break. We did not succeed - the cluster always recovered. Thumbs up!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Cluster fails to recover after quorum of master/cluster_manager pods being deleted at the same time
3 participants