Recover cluster after multiple pod failures #366

swoehrl-mw · 2022-11-15T10:02:56Z

This PR implements a potential solution for cluster recovery (issue #289).
The idea is to detect that more than one pod is missing (thus quorum potentially being broken) by checking if there are more PVCs than pods. If so the operator temporarily recreates the affected STS with podManagementPolicy = Parallel so that all pods are started again.

@idanl21 @prudhvigodithi I'm not yet sure if this is a good solution, please provide your opinion on this (also if there are cases not covered) or alternative ideas you have. If you are ok with this approach I will add some tests.

I've tested two scenarios:

All master pods down -> pods get recreated in parallel and can form a quorum
OpenSearchCluster object is freshly created but PVCs exist (from old run) -> pods get created in parallel and can form a quorum, bootstrap pod is ignored and deleted after recovery

Signed-off-by: Sebastian Woehrl <[email protected]>

grzeg1 · 2023-01-02T16:03:52Z

Any plans to merge this PR? We're stuck with failed cluster for the second time in a month. This time it was caused by node upgrade on AKS.

idanl21 · 2023-01-05T11:43:31Z

Hey @swoehrl-mw, Just passed on the implementation, I didnt know the podManagementPolicy = Parallel but looks like it can help us to solve the problem.
did we also tested the the scenario that 2 nodes are failed an only 1 left ?
@grzeg1, Thanks for using the operator, Please - contact me in person, i can help you with your problem (and also you can help us to test that fix).
you are welcome to ping me in mail, [email protected]
Thank !

Signed-off-by: Sebastian Woehrl <[email protected]>

swoehrl-mw · 2023-02-01T14:03:01Z

@idanl21 @prudhvigodithi I tried to add some unittests for this PR but in the end had to give up as I could not get envtest to behave like I wanted to. We'll have to cover this with functional tests in the future.
I added an option to disable the recovery so if we didn't cover an edge case the users can prevent it from doing stupid stuff.
Please review and also test in your environments.

Tests I did:

Delete all pods -> pods get recreated in parallel
Delete 3 of 4 pods -> pods get recreated in parallel
Delete 2 of 3 pods -> normal one-by-one recovery works (2 can form a quorum)
Delete cluster, keep PVCs, recreate cluster -> pods get created in parallel

docs/userguide/main.md

opensearch-operator/controllers/tls_test.go

opensearch-operator/pkg/helpers/constants.go

opensearch-operator/pkg/helpers/helpers.go

opensearch-operator/pkg/reconcilers/cluster.go

grzeg1 · 2023-02-02T11:34:58Z

Hey @swoehrl-mw, Just passed on the implementation, I didnt know the podManagementPolicy = Parallel but looks like it can help us to solve the problem. did we also tested the the scenario that 2 nodes are failed an only 1 left ? @grzeg1, Thanks for using the operator, Please - contact me in person, i can help you with your problem (and also you can help us to test that fix). you are welcome to ping me in mail, [email protected] Thank !

@idanl21 we had to back from using the operator and return to manually-provisioned cluster. But we're ready to go back to testing.

Signed-off-by: Sebastian Woehrl <[email protected]>

idanl21 · 2023-02-05T12:12:54Z

Hey @grzeg1, Im glad that you guys wanna keep work and progress with the Operator, i believe that fix will be merged to main in few days :)
If you have additional questions, please let me know send me an email ([email protected])

idanl21

Ready to merge form my point of view :)

grzeg1 · 2023-03-03T16:43:16Z

Hey @grzeg1, Im glad that you guys wanna keep work and progress with the Operator, i believe that fix will be merged to main in few days :) If you have additional questions, please let me know send me an email ([email protected])

Just to let you know: we've been using the merged fix for a few days and tried to make the cluster break. We did not succeed - the cluster always recovered. Thumbs up!

Detect cluster failure and recover with parallel pod start

fe46fd7

Signed-off-by: Sebastian Woehrl <[email protected]>

swoehrl-mw mentioned this pull request Dec 8, 2022

scaling down a cluster (by merging nodepool components) doesn't delete remove nodes #386

Closed

idanl21 marked this pull request as ready for review December 25, 2022 17:59

swoehrl-mw marked this pull request as draft January 9, 2023 13:26

swoehrl-mw mentioned this pull request Jan 30, 2023

Cluster fail after deleting and recreating it. #419

Closed

swoehrl-mw added 2 commits January 31, 2023 15:52

Merge branch 'main' into feature/parallel-recovery

e63770d

Signed-off-by: Sebastian Woehrl <[email protected]>

Make parallel recovery configurable

c8738e1

Signed-off-by: Sebastian Woehrl <[email protected]>

swoehrl-mw marked this pull request as ready for review February 1, 2023 14:03

swoehrl-mw mentioned this pull request Feb 1, 2023

Recreate cluster with existing data #261

Closed

swoehrl-mw linked an issue Feb 1, 2023 that may be closed by this pull request

Cluster fails to recover after quorum of master/cluster_manager pods being deleted at the same time #289

Closed

idanl21 reviewed Feb 2, 2023

View reviewed changes

swoehrl-mw added 2 commits February 3, 2023 13:37

Review changes

f851a70

Signed-off-by: Sebastian Woehrl <[email protected]>

Fix quote in helm chart

487a6cb

Signed-off-by: Sebastian Woehrl <[email protected]>

idanl21 approved these changes Feb 5, 2023

View reviewed changes

swoehrl-mw merged commit fe9da68 into opensearch-project:main Feb 8, 2023

swoehrl-mw deleted the feature/parallel-recovery branch February 8, 2023 09:53

prudhvigodithi mentioned this pull request Mar 26, 2024

[BUG] Parallel Cluster Recovery didn't work #730

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Recover cluster after multiple pod failures #366

Recover cluster after multiple pod failures #366

swoehrl-mw commented Nov 15, 2022

grzeg1 commented Jan 2, 2023

idanl21 commented Jan 5, 2023

swoehrl-mw commented Feb 1, 2023

grzeg1 commented Feb 2, 2023

idanl21 commented Feb 5, 2023

idanl21 left a comment

grzeg1 commented Mar 3, 2023

Recover cluster after multiple pod failures #366

Recover cluster after multiple pod failures #366

Conversation

swoehrl-mw commented Nov 15, 2022

grzeg1 commented Jan 2, 2023

idanl21 commented Jan 5, 2023

swoehrl-mw commented Feb 1, 2023

grzeg1 commented Feb 2, 2023

idanl21 commented Feb 5, 2023

idanl21 left a comment

Choose a reason for hiding this comment

grzeg1 commented Mar 3, 2023