Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

Open
4n4nd opened this issue Jan 5, 2023 · 4 comments

Comments

@4n4nd
Copy link
Contributor

4n4nd commented Jan 5, 2023

Redis Cluster is not able to recover after the redis node pods are deleted sequentially. Example command:

for i in `kubectl get pods -n redis-cluster-ns --no-headers | awk '{print $1}'`; do kubectl delete pods -n redis-cluster-ns $i; sleep 10; done;

After new pods are spawned, they fail the readiness probe:

E0104 18:03:42.732541       1 redisnode.go:247] readiness check failed, err:Readiness failed, cluster slots response empty
@cin
Copy link
Contributor

cin commented Jan 5, 2023

Oof, that sounds like a bug. Seems easy enough to reproduce. I should get some time later this afternoon to test.

@cin
Copy link
Contributor

cin commented Jan 11, 2023

@4n4nd, I can reproduce this exactly as you outlined above. This seems like a bug and something the operator should be able to recover from. Unfortunately, I don't have any free cycles to try to dig deeper into the issue at the moment. I'll try to make some time next week.

@cin
Copy link
Contributor

cin commented Jan 11, 2023

For further context, if you remove the sleep or bump it up higher (tried 30s) things come back as you'd expect. So there's probably a race condition that's coming into play here.

@4n4nd
Copy link
Contributor Author

4n4nd commented Jan 12, 2023

I believe what's happening is, when the operator starts to bring up new pods, there are pods still terminating. So, the operator makes the new pods join the old cluster, but by the time these new pods are ready the old pods are deleted. Essentially instead of initializing a whole new cluster, it tries to join the old cluster and fails. This leads to no hash slots being assigned to the new pods and hence they get stuck in a not ready state.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants