-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Observing "BKException$BKLedgerRecoveryException: Error while recovering ledger" and Unhealthy Readiness probe failed with error "ERROR Closing ledger due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper #210
Comments
Thanks for the report @sumit-bm @vedanthh, it would be useful to know if the BK is failing or have exceptions while the readiness probe failed. |
Not able to understand what is the exact problem being reported here? Also it would help to have steps to reproduce the issue. |
I didn't observed any bookie POD restart/failure during this but i'm observing reported errors in
Steps:
|
Could you please quantify the frequency of these failures over a period of time? |
I'm not observing any restart or pods being in not ready state in bookkeeper during these events, regarding frequency we are observing this repeatedly in
|
From logs, I see 18 Readiness probe failures in 30 mins. Each pod should have 180 probes in 30 mins ( given interval of 10 secs). Which means roughly 10% of the probes are failing. Though this is not enough to cause disruption, it needs to be investigated.
Reducing priority of the issue to P2 as it is not causing any disruption... |
Need to check on the following:
|
@vedanthh Thanks for the report. I have tried to reproduce this today using the |
In all above probe failures, I see this exception "org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available" @vedanthh, Around time when you see this probe failure, could you please check if any bookies are down using this command on the bookie pod: |
I see this issue with "Bookkeeper on IPV6" where sanity check fails inspite of healthy bookies: |
@pbelgundi The For the pulsar issue, I also checked that before, it seems to be a different problem since their issue is |
I have reproduced the error today using the longevity mid-scale workload and I am using the default configuration of bookkeeper and pravega. This probe failure happens quickly within 5 minutes. Here is the bookie server side log(DEBUG level) when a readiness probe fails
Notice that there is a 5 seconds gap at time It looks to me that this stall is caused by the high workload since I cannot reproduce the error using the My questions are, 1) is this stall expected under any circumstances? 2) will yourkit help to provide some more info if enabled for this issue? It would be great to have some insights @fpj @pbelgundi. Much appreciated. |
While I understand that
For how long are one or more bookie(s) unavailable and why, is what we need to answer to be able to solve this.... |
|
@pbelgundi would this My suggestion would be please reproduce the issue and then let us know how to go forward. |
@sumit-bm Wenqi has reproduced this issue, as stated above. |
@sumit-bm @pbelgundi, thanks for the comments. Sorry for the late updates. I have modified the timeout to 5 seconds and it seems the problem has gone. Will do some more experiments and build an image for the test team to test, also will raise a PR. Thanks! |
@vedanthh could you please run the longevity test again using this image |
@Tristan1900 I will run the longevity test after retest of |
@vedanthh Thank you! |
Closing this for now. If seen again please open an issue on Bookkeeper Operator: |
Observing "BKException$BKLedgerRecoveryException: Error while recovering ledger" and Unhealthy Readiness probe failed with error "ERROR Closing ledger 33175 due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper during Longevity run with moderate workload (Total - 8 readers, 4 writers, 4000 events/sec, ~ 5 MB/s IO)
Environment details: PKS / K8 with medium cluster:
Bookie_logs_describe_events.zip
The text was updated successfully, but these errors were encountered: