[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

4n4nd · 2023-01-05T07:51:43Z

Redis Cluster is not able to recover after the redis node pods are deleted sequentially. Example command:

for i in `kubectl get pods -n redis-cluster-ns --no-headers | awk '{print $1}'`; do kubectl delete pods -n redis-cluster-ns $i; sleep 10; done;

After new pods are spawned, they fail the readiness probe:

E0104 18:03:42.732541       1 redisnode.go:247] readiness check failed, err:Readiness failed, cluster slots response empty

The text was updated successfully, but these errors were encountered:

cin · 2023-01-05T16:10:22Z

Oof, that sounds like a bug. Seems easy enough to reproduce. I should get some time later this afternoon to test.

cin · 2023-01-11T21:09:52Z

@4n4nd, I can reproduce this exactly as you outlined above. This seems like a bug and something the operator should be able to recover from. Unfortunately, I don't have any free cycles to try to dig deeper into the issue at the moment. I'll try to make some time next week.

cin · 2023-01-11T21:19:13Z

For further context, if you remove the sleep or bump it up higher (tried 30s) things come back as you'd expect. So there's probably a race condition that's coming into play here.

4n4nd · 2023-01-12T06:01:04Z

I believe what's happening is, when the operator starts to bring up new pods, there are pods still terminating. So, the operator makes the new pods join the old cluster, but by the time these new pods are ready the old pods are deleted. Essentially instead of initializing a whole new cluster, it tries to join the old cluster and fails. This leads to no hash slots being assigned to the new pods and hence they get stuck in a not ready state.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

4n4nd commented Jan 5, 2023 •

edited

Loading

cin commented Jan 5, 2023

cin commented Jan 11, 2023

cin commented Jan 11, 2023

4n4nd commented Jan 12, 2023

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

[Resiliency] All Redis Node pods stuck in 1/2 readiness state after sequential deletion of all pods #82

Comments

4n4nd commented Jan 5, 2023 • edited Loading

cin commented Jan 5, 2023

cin commented Jan 11, 2023

cin commented Jan 11, 2023

4n4nd commented Jan 12, 2023

4n4nd commented Jan 5, 2023 •

edited

Loading