-
Notifications
You must be signed in to change notification settings - Fork 211
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Second ZK pod doesn't starts due to 'java.lang.RuntimeException: My id 2 not in the peer list' exception #315
Comments
@BassinD I am hit by this issue for quite some time, no resolution so far. This is a critical issue for me as well. It happens intermittently even in our production environment. from what I understand, this failure can happen when promotion of node from observer to participant fails in ZookeeperReady.sh which is run as part of Readiness probe. For me sometimes, I get the error as below ++ echo ruok This could mean ZK server is not ready to accept connections? I was wondering, if by adding an initial delay to "readinessprobe" , so "ruok" requests are sent after a delay (hopefully ZK server will be started and running by then) I had requested for a fix to make these probes configurable #275 . Wonder when next release of ZK operator will be so we can get these fixes? |
hi, when will this fix be available? thanks |
@priyavj08 The fix is available in master, will be part of the next release. If you want to use the build now, you can |
thanks @anishakj any ETA on next release so I can plan? thanks |
@priyavj08 We are planning to do release sometimes this week. |
thanks for your continued support @anishakj @amuraru I pulled the latest code from master, built ZK image and ran a continued install/uninstall tests, about 20 odd iterations it was ok, it failed at the 25th iteration. This issue still exists, unfortunately it happens most of the time in production (murphy's law coming into play) When I was able to exec in to the failure pod-container I found cat /data/myid cat /data/conf/zoo.cfg cat /data/conf/zoo.cfg.dynamic.1000000b4 seems to be some timing issue, please make this part of the code in zookeeperStart.sh more reliable thanks |
@priyavj08 please provide the logs from |
Reopening issue since problem is not fixed still. |
I am looking to reproduce this to get the first log from zk-1 pod when it crashes but in the mean time, here is the log from ZK-1 pod also, here is the output of describe from pod ZK-0, see connection refused from the probes kc describe pod -n fed-kafka fed-kafka-affirmedzk-0 Normal Scheduled 5m44s default-scheduler Successfully assigned fed-kafka/fed-kafka-affirmedzk-0 to priya-vijaya-tx-k8-node-2-cxkrlz-0zdmawqhtomva4wa
|
in my recent tests of repeated install/uninstall after 18th iteration, it got in to a bad state. for some reason ZK-1 failed to join the ensemble and ZK-2 is in crashloopback state (but not the error .my id 2 is missing) attaching all the logs failure-zk0.log attaching the pod describe output |
@priyavj08 , Could you please confirm base zookeeper image version used is |
@priyavj08 From the
|
Ran this inside the ZK pod, I am using build 3.6.1 Zookeeper version: 3.6.1--104dcb3e3fb464b30c5186d229e00af9f332524b, built on 04/21/2020 15:01 GMT |
with the latest zk image and setting InitialDelayseconds to "60" for readiness probe, I was able to run install/uninstall tests continuously, tried 2 sets of 30 iterations. Haven't seen pod1/pod2 crashing with error "My id 2 not in the peer list" question: will setting 60 seconds as Initiadelay for readiness probe have any impact on the functionality of the ZK ensemble? another concern, during the first set of tests I saw the other issue though where ZK server in pod0 wasn't running and this caused issue in ZK pod 1 |
In zookeeper cluster deployment, only after the first pod has started and running, second pod will start. Is it not happening in your case? Also, |
looks like ZK-0 pod was in running state, though the ZK server wasn't running and logs shows that error , similarly ZK-1 pod was in running state with errors . you can see pods status in the desc-output I had attached |
Are you seeing these errors with increased initialdelay? |
@anishakj yes it happened with the set of tests I did by adding initialdelay but this is also not always seen. |
By deafult |
ZK-1 pod got into crashloopackoff couple of times but eventually worked, here is the log. This is the new fix right?
** server can't find fed-kafka-affirmedzk-headless.fed-kafka.svc.cluster.local: NXDOMAIN
fed-kafka-affirmedzk-0 1/1 Running 0 122m |
yes, with this fix while on restart pod will become ready. If it is working for you after couple of pod restarts, could you please close the issue. |
so far it is working fine. please close this bug |
@priyavj08 Thanks for the confirmation. Please feel free to reopen if you see the same issue again. |
Did anyone try setting |
@iampranabroy Does this work for you:
|
I'm facing the same issue. @anishakj even with long initialDelaySeconds (20,30,60) for liveness- and readinessProbes I'm unable to successfully deploy a zookeeper cluster with replicas > 1. Any further hint what could go wrong? The issues stays the same: Zookeeper-Operator is indirectly deployed using the Solr-Operator, version 0.2.14 |
Hey @mmoscher - Can you please try setting only @priyavj08 @anishakj - Did you try any additional changes? |
Issue solved - at least for me, I think. There were two unique issues (on my side) which kept me running into the error mentioned above. Issue 1 - wrong network policies Issue 2 - old state/wrong data on pvc TL;DR: after setting correct network polices and cleaning up old (corrupt) data, i.e. all zookeeper related pvcs, I was able to successfully bootstrap a zookeeper cluster (4-times in row, same namespace and same k8s-cluster) . No need to adjust liveness- or readinessProbe's. @iampranabroy hopefully these tips help you! |
Thanks much @mmoscher |
@iampranabroy sure, here we go: We've two policies in place related to solr/zookeeper. One (a) to allow traffic between the zookeeper members itself (z<->z) and another one (b) to allow traffic from solr to the zookeeper (s->z) pods. a) apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-zookeeper-access-zookeeper
spec:
egress:
- ports:
- port: 2181
protocol: TCP
- port: 2888
protocol: TCP
- port: 3888
protocol: TCP
- port: 7000
protocol: TCP
- port: 8080
protocol: TCP
to:
- podSelector:
matchLabels:
kind: ZookeeperMember
technology: zookeeper
podSelector:
matchLabels:
kind: ZookeeperMember
technology: zookeeper
policyTypes:
- Egress b) apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: allow-solr-access-zookeeper
spec:
egress:
- ports:
- port: 2181
protocol: TCP
- port: 7000
protocol: TCP
- port: 8080
protocol: TCP
to:
- podSelector:
matchLabels:
kind: ZookeeperMember
technology: zookeeper
podSelector:
matchLabels:
allow-zookeeper-access: "true"
policyTypes:
- Egress Good luck 🤞 |
Thanks, @mmoscher for sharing the details. A good point to keep in mind about In my case, the |
Description
We are using zookeeper v.0.2.9.
Sometimes (not in all environements) zookeeper-1 pod unable to start due to RuntimeException.
Pod 's log:
Zookeeper cluster CRD description
Previously we used ZK v0.2.7 and there was no this issue.
Also I tried the fix described in issue #259 , but it didn't helped.
Importance
Blocker issue. We need some fixes related to 0.2.9 version (#257), so upgrade is required.
Location
(Where is the piece of code, package, or document affected by this issue?)
Suggestions for an improvement
(How do you suggest to fix or proceed with this issue?)
The text was updated successfully, but these errors were encountered: