Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Observing "BKException$BKLedgerRecoveryException: Error while recovering ledger" and Unhealthy Readiness probe failed with error "ERROR Closing ledger due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper #210

Closed
vedanthh opened this issue Jun 7, 2019 · 21 comments
Assignees
Labels
kind/bug Something isn't working priority/P2 Slight inconvenience or annoyance to applications, system continues to function status/needs-info Extra information is needed; add a comment with the exact information requested version 0.3.2 This is an issue with operator version 0.3.2

Comments

@vedanthh
Copy link

vedanthh commented Jun 7, 2019

Observing "BKException$BKLedgerRecoveryException: Error while recovering ledger" and Unhealthy Readiness probe failed with error "ERROR Closing ledger 33175 due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper during Longevity run with moderate workload (Total - 8 readers, 4 writers, 4000 events/sec, ~ 5 MB/s IO)

Environment details: PKS / K8 with medium cluster:

3 master: xlarge: 4 CPU, 16 GB Ram, 32 GB Disk
5 worker: 2xlarge: 8 CPU, 32 GB Ram, 64 GB Disk
Tier-1 storage is from VSAN datastore
Tier-2 storage curved on NFS Client provisioner using Isilon as backend
Pravega version : 0.5.0-2269.6f8a820
Zookeeper Operator : pravega/zookeeper-operator:0.2.2
Pravega Operator: pravega/pravega-operator:0.3.2

Bookie_logs_describe_events.zip

]# kubectl get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                KIND      SUBOBJECT                 TYPE      REASON      SOURCE                                          MESSAGE
24m         24m          1         pravega-bookie-0.15a5d0ae811f3943   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 9ff19990-e26c-49bf-8db5-4116ec011efd   Readiness probe failed: 04:38:08,968 INFO  Created ledger 32742
04:38:09,487 INFO  Written 10 entries in ledger 32742
04:38:12,727 ERROR Failure Bookie operation timeout while reading entries: (10 - 10), ledger: 32742 while recovering ledger
04:38:12,732 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKLedgerRecoveryException: Error while recovering ledger
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncOpenCallback.openComplete(SyncCallbackUtils.java:157)
          at org.apache.bookkeeper.client.LedgerOpenOp.openComplete(LedgerOpenOp.java:225)
          at org.apache.bookkeeper.client.LedgerOpenOp$1.safeOperationComplete(LedgerOpenOp.java:194)
          at org.apache.bookkeeper.client.LedgerOpenOp$1.safeOperationComplete(LedgerOpenOp.java:186)
          at org.apache.bookkeeper.util.OrderedGenericCallback.operationComplete(OrderedGenericCallback.java:55)
          at org.apache.bookkeeper.proto.BookkeeperInternalCallbacks$TimedGenericCallback.operationComplete(BookkeeperInternalCallbacks.java:131)
          at org.apache.bookkeeper.client.LedgerRecoveryOp.submitCallback(LedgerRecoveryOp.java:145)
          at org.apache.bookkeeper.client.LedgerRecoveryOp.onEntryComplete(LedgerRecoveryOp.java:230)
          at org.apache.bookkeeper.client.ListenerBasedPendingReadOp.submitCallback(ListenerBasedPendingReadOp.java:82)
          at org.apache.bookkeeper.client.PendingReadOp$LedgerEntryRequest.fail(PendingReadOp.java:170)
          at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.sendNextRead(PendingReadOp.java:390)
          at org.apache.bookkeeper.client.PendingReadOp$SequenceReadRequest.logErrorAndReattemptRead(PendingReadOp.java:431)
          at org.apache.bookkeeper.client.PendingReadOp.readEntryComplete(PendingReadOp.java:588)
          at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion$1.readEntryComplete(PerChannelBookieClient.java:1556)
          at org.apache.bookkeeper.proto.PerChannelBookieClient$ReadCompletion.lambda$errorOut$0(PerChannelBookieClient.java:1572)
          at org.apache.bookkeeper.proto.PerChannelBookieClient$CompletionValue$1.safeRun(PerChannelBookieClient.java:1414)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
04:38:12,932 ERROR Could not get additional bookie to remake ensemble, closing ledger: 32742
04:38:12,932 ERROR Write of ledger entry to quorum failed: L32742 E9
04:38:12,933 ERROR Failure NotEnoughBookiesException: Not enough non-faulty bookies available while writing entry: 10 while recovering ledger: 32742
04:38:12,933 INFO  Deleted ledger 32742

14m       14m       1         pravega-bookie-0.15a5d1333dc8ac03   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 9ff19990-e26c-49bf-8db5-4116ec011efd   Readiness probe failed: 04:47:38,945 INFO  Created ledger 32983
04:47:42,741 ERROR Could not get additional bookie to remake ensemble, closing ledger: 32983
04:47:42,741 ERROR Closing ledger 32983 due to NotEnoughBookiesException: Not enough non-faulty bookies available
04:47:42,742 ERROR Write of ledger entry to quorum failed: L32983 E0
04:47:42,743 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
04:47:42,827 INFO  Deleted ledger 32983

1m        1m        1         pravega-bookie-0.15a5d1e9096a8d76   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 9ff19990-e26c-49bf-8db5-4116ec011efd   Readiness probe failed: 05:00:41,744 INFO  Created ledger 33307
05:00:44,575 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33307
05:00:44,622 ERROR Closing ledger 33307 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:00:44,624 ERROR Write of ledger entry to quorum failed: L33307 E0
05:00:44,625 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:00:44,660 INFO  Deleted ledger 33307

<invalid>   <invalid>   1         pravega-bookie-0.15a5d247fefb96bb   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 9ff19990-e26c-49bf-8db5-4116ec011efd   Readiness probe failed: 05:07:29,094 INFO  Created ledger 33476
05:07:31,848 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33476
05:07:31,848 ERROR Closing ledger 33476 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:07:31,849 ERROR Write of ledger entry to quorum failed: L33476 E4
05:07:31,850 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:07:31,860 INFO  Deleted ledger 33476

15m       15m       1         pravega-bookie-1.15a5d12c11baa881   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 576cae52-f8b5-4d18-855f-4fb76efa519e   Readiness probe failed: 04:47:06,984 INFO  Created ledger 32969
04:47:11,734 ERROR Could not get additional bookie to remake ensemble, closing ledger: 32969
04:47:11,734 ERROR Closing ledger 32969 due to NotEnoughBookiesException: Not enough non-faulty bookies available
04:47:11,737 ERROR Write of ledger entry to quorum failed: L32969 E0
04:47:11,762 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
04:47:11,828 INFO  Deleted ledger 32969

<invalid>   <invalid>   1         pravega-bookie-1.15a5d22b6efdf467   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 576cae52-f8b5-4d18-855f-4fb76efa519e   Readiness probe failed: 05:05:26,359 INFO  Created ledger 33427
05:05:29,243 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33427
05:05:29,244 ERROR Closing ledger 33427 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:05:29,246 ERROR Write of ledger entry to quorum failed: L33427 E0
05:05:29,324 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:05:29,430 INFO  Deleted ledger 33427

7m        7m        1         pravega-bookie-2.15a5d19f1a88edb4   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 2be62db7-72cf-460d-bf69-2c6d6ef13760   Readiness probe failed: 04:55:23,576 INFO  Created ledger 33175
04:55:26,478 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33175
04:55:26,478 ERROR Closing ledger 33175 due to NotEnoughBookiesException: Not enough non-faulty bookies available
04:55:26,479 ERROR Write of ledger entry to quorum failed: L33175 E1
04:55:26,480 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
04:55:26,506 INFO  Deleted ledger 33175

<invalid>   <invalid>   1         pravega-bookie-2.15a5d239fd4fef58   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 2be62db7-72cf-460d-bf69-2c6d6ef13760   Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "process_linux.go:90: adding pid 18075 to cgroups caused \"failed to write 18075 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/podc796e2ec-8825-11e9-af33-005056946a8b/c6d306921263651160d9dcc0a7861b7676fe2c270021824797b01ff5eea3f0c2/cgroup.procs: invalid argument\"": unknown

<invalid>   <invalid>   1         pravega-bookie-2.15a5d24975be9342   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 2be62db7-72cf-460d-bf69-2c6d6ef13760   Readiness probe failed: 05:07:33,528 INFO  Created ledger 33478
05:07:36,785 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33478
05:07:36,785 ERROR Closing ledger 33478 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:07:36,785 ERROR Write of ledger entry to quorum failed: L33478 E6
05:07:36,786 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:07:36,874 INFO  Deleted ledger 33478

<invalid>   <invalid>   1         pravega-bookie-2.15a5d2777e2191cc   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 2be62db7-72cf-460d-bf69-2c6d6ef13760   Readiness probe failed: 05:10:53,631 INFO  Created ledger 33565
05:10:55,478 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33565
05:10:55,478 ERROR Closing ledger 33565 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:10:55,481 ERROR Write of ledger entry to quorum failed: L33565 E0
05:10:55,482 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:10:55,576 ERROR Error update ledger metadata for ledger 33565 : NoSuchLedgerExistsException: No such ledger exists
05:10:55,576 INFO  Deleted ledger 33565

<invalid>   <invalid>   1         pravega-bookie-2.15a5d27e4e78bfbc   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 2be62db7-72cf-460d-bf69-2c6d6ef13760   Readiness probe failed: 05:11:24,122 INFO  Created ledger 33577
05:11:25,723 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33577
05:11:25,723 ERROR Closing ledger 33577 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:11:25,725 ERROR Write of ledger entry to quorum failed: L33577 E0
05:11:25,726 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:11:25,787 INFO  Deleted ledger 33577

11m       11m       1         pravega-bookie-3.15a5d16726804133   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, e2e8965a-39db-4d68-9218-6dcc66d9d4a4   Readiness probe failed: 04:51:22,121 INFO  Created ledger 33076
04:51:25,909 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33076
04:51:25,910 ERROR Closing ledger 33076 due to NotEnoughBookiesException: Not enough non-faulty bookies available
04:51:25,910 ERROR Write of ledger entry to quorum failed: L33076 E0
04:51:25,911 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
04:51:25,998 INFO  Deleted ledger 33076

1m        1m        1         pravega-bookie-3.15a5d1f0b789cf68   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, e2e8965a-39db-4d68-9218-6dcc66d9d4a4   Readiness probe failed: 05:01:14,813 INFO  Created ledger 33321
05:01:16,813 ERROR Could not get additional bookie to remake ensemble, closing ledger: 33321
05:01:16,813 ERROR Closing ledger 33321 due to NotEnoughBookiesException: Not enough non-faulty bookies available
05:01:16,814 ERROR Write of ledger entry to quorum failed: L33321 E0
05:01:16,815 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
05:01:16,897 INFO  Deleted ledger 33321

25m       25m       1         pravega-zk-1.15a5d0a15c12624f   Pod       spec.containers{zookeeper}   Warning   Unhealthy   kubelet, e2e8965a-39db-4d68-9218-6dcc66d9d4a4   Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "process_linux.go:90: adding pid 13417 to cgroups caused \"failed to write 13417 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod6f8f52f1-8823-11e9-af33-005056946a8b/94219856b44f73a058c05e3c1b5a99eb8e47669fbfdb954a3e1102d7e95cfc4c/cgroup.procs: invalid argument\"": unknown


@vedanthh vedanthh changed the title Observing "BKException$BKLedgerRecoveryException: Error while recovering ledger" and Unhealthy Readiness probe failed with error "ERROR Closing ledger 33175 due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper Observing "BKException$BKLedgerRecoveryException: Error while recovering ledger" and Unhealthy Readiness probe failed with error "ERROR Closing ledger due to NotEnoughBookiesException: Not enough non-faulty bookies available" events in Bookkeeper Jun 7, 2019
@sumit-bm sumit-bm added kind/bug Something isn't working priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage labels Jun 7, 2019
@Tristan1900
Copy link
Member

Thanks for the report @sumit-bm @vedanthh, it would be useful to know if the BK is failing or have exceptions while the readiness probe failed.
The sporadic readiness probe failure is rarely related to the operator and it is more like an environment issue. For example this message
25m 25m 1 pravega-zk-1.15a5d0a15c12624f Pod spec.containers{zookeeper} Warning Unhealthy kubelet, e2e8965a-39db-4d68-9218-6dcc66d9d4a4 Readiness probe failed: OCI runtime exec failed: exec failed: container_linux.go:348: starting container process caused "process_linux.go:90: adding pid 13417 to cgroups caused \"failed to write 13417 to cgroup.procs: write /sys/fs/cgroup/cpu,cpuacct/kubepods/burstable/pod6f8f52f1-8823-11e9-af33-005056946a8b/94219856b44f73a058c05e3c1b5a99eb8e47669fbfdb954a3e1102d7e95cfc4c/cgroup.procs: invalid argument\"": unknown
is an issue in the runc and the fix is included in here. We are currently using docker 18.6.3 and that fix might be included in the future.

@pbelgundi pbelgundi added the version 0.3.2 This is an issue with operator version 0.3.2 label Jul 2, 2019
@pbelgundi pbelgundi assigned pbelgundi and unassigned Tristan1900 Jul 2, 2019
@pbelgundi
Copy link
Contributor

Not able to understand what is the exact problem being reported here?
Is it the readiness probe failure or the bookie going down?

Also it would help to have steps to reproduce the issue.

@vedanthh
Copy link
Author

vedanthh commented Jul 3, 2019

I didn't observed any bookie POD restart/failure during this but i'm observing reported errors in kubectl get events, I reported this based on the comment from adrian made on issue #146 as we are observing this frequently.

adrianmo commented on Apr 25 -- Issue https://github.com/pravega/pravega-operator/issues/146
Right, @Tristan1900, it looks like this is a very sporadic issue caused by an external factor and it's not causing any interruption of the service.

I'll close the issue. Please open a new one if the issue is manifested frequently or is causing any disruption.

Steps:

  1. Deploy Pravega
  2. Run IO workload using Longevity or Pravega-Benchmark
  3. Monitor kubectl get events

@pbelgundi
Copy link
Contributor

Could you please quantify the frequency of these failures over a period of time?
Also has this caused any disruption (restarts or pods being in not ready state) in bookeeper ?

@pbelgundi pbelgundi assigned Tristan1900 and unassigned pbelgundi Jul 4, 2019
@vedanthh
Copy link
Author

vedanthh commented Jul 4, 2019

I'm not observing any restart or pods being in not ready state in bookkeeper during these events, regarding frequency we are observing this repeatedly in kubectl get events in all the cluster while running IO.

[root@manager1 ~]# kubectl get events
LAST SEEN   FIRST SEEN   COUNT     NAME                                KIND      SUBOBJECT                 TYPE      REASON      SOURCE                                          MESSAGE
24m         7d           4493      pravega-bookie-1.15abb5c21f6065e9   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   (combined from similar events): Readiness probe failed: 03:34:53,983 INFO  Created ledger 421125
03:34:56,792 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421125
03:34:56,792 ERROR Closing ledger 421125 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:34:56,793 ERROR Write of ledger entry to quorum failed: L421125 E9
03:34:56,794 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:34:57,175 INFO  Deleted ledger 421125

34m       34m       1         pravega-bookie-1.15ae1658342959ae   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:24:44,355 INFO  Created ledger 420806
03:24:47,990 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420806
03:24:47,990 ERROR Closing ledger 420806 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:24:47,991 ERROR Write of ledger entry to quorum failed: L420806 E3
03:24:47,991 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:24:48,326 INFO  Deleted ledger 420806

33m       33m       1         pravega-bookie-1.15ae166880d0e808   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:25:53,996 INFO  Created ledger 420843
03:25:57,779 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420843
03:25:57,779 ERROR Closing ledger 420843 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:25:57,780 ERROR Write of ledger entry to quorum failed: L420843 E2
03:25:57,781 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:25:58,206 INFO  Deleted ledger 420843

32m       32m       1         pravega-bookie-1.15ae167463e3ca8e   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:26:43,993 INFO  Created ledger 420869
03:26:48,766 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420869
03:26:48,766 ERROR Closing ledger 420869 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:26:48,767 ERROR Write of ledger entry to quorum failed: L420869 E3
03:26:48,768 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:26:49,383 INFO  Deleted ledger 420869

31m       31m       1         pravega-bookie-1.15ae167d5ed23b6a   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:27:24,571 INFO  Created ledger 420889
03:27:27,319 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420889
03:27:27,319 ERROR Closing ledger 420889 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:27:27,404 ERROR Write of ledger entry to quorum failed: L420889 E2
03:27:27,414 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:27:27,815 INFO  Deleted ledger 420889

31m       31m       1         pravega-bookie-1.15ae1681e32282bb   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:27:44,299 INFO  Created ledger 420898
03:27:46,790 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420898
03:27:46,791 ERROR Closing ledger 420898 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:27:46,808 ERROR Write of ledger entry to quorum failed: L420898 E1
03:27:46,810 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:27:47,072 ERROR Error update ledger metadata for ledger 420898 : NoSuchLedgerExistsException: No such ledger exists
03:27:47,232 INFO  Deleted ledger 420898

30m       30m       1         pravega-bookie-1.15ae1696ceb1e3d5   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:29:14,090 INFO  Created ledger 420945
03:29:17,205 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420945
03:29:17,205 ERROR Closing ledger 420945 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:29:17,206 ERROR Write of ledger entry to quorum failed: L420945 E7
03:29:17,207 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:29:17,225 INFO  Deleted ledger 420945

29m       29m       1         pravega-bookie-1.15ae169b7438e543   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:29:34,255 INFO  Created ledger 420956
03:29:36,817 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420956
03:29:36,817 ERROR Closing ledger 420956 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:29:36,818 ERROR Write of ledger entry to quorum failed: L420956 E1
03:29:36,822 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:29:37,153 INFO  Deleted ledger 420956

29m       29m       1         pravega-bookie-1.15ae16a525f2e476   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:30:14,127 INFO  Created ledger 420979
03:30:17,736 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420979
03:30:17,736 ERROR Closing ledger 420979 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:30:17,737 ERROR Write of ledger entry to quorum failed: L420979 E3
03:30:17,738 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:30:18,648 INFO  Deleted ledger 420979

27m       27m       1         pravega-bookie-1.15ae16c156575d47   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:32:15,515 INFO  Created ledger 421042
03:32:19,335 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421042
03:32:19,335 ERROR Closing ledger 421042 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:32:19,336 ERROR Write of ledger entry to quorum failed: L421042 E2
03:32:19,337 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:32:19,677 ERROR Error update ledger metadata for ledger 421042 : NoSuchLedgerExistsException: No such ledger exists
03:32:19,824 INFO  Deleted ledger 421042

11m       11m       1         pravega-bookie-1.15ae1799057fb228   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:47:44,560 INFO  Created ledger 421524
03:47:45,865 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421524
03:47:45,865 ERROR Closing ledger 421524 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:47:45,867 ERROR Write of ledger entry to quorum failed: L421524 E0
03:47:45,867 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:47:46,240 INFO  Deleted ledger 421524

7m        7m        1         pravega-bookie-1.15ae17cec4d65ec1   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:51:34,312 INFO  Created ledger 421641
03:51:36,767 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421641
03:51:36,767 ERROR Closing ledger 421641 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:51:36,768 ERROR Write of ledger entry to quorum failed: L421641 E1
03:51:36,769 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:51:37,107 INFO  Deleted ledger 421641

4m        4m        1         pravega-bookie-1.15ae17fd38aada3a   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:54:54,967 INFO  Created ledger 421746
03:54:56,389 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421746
03:54:56,390 ERROR Closing ledger 421746 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:54:56,391 ERROR Write of ledger entry to quorum failed: L421746 E0
03:54:56,392 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:54:56,579 INFO  Deleted ledger 421746

3m        3m        1         pravega-bookie-1.15ae180d7dee2fdd   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:56:04,412 INFO  Created ledger 421781
03:56:06,506 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421781
03:56:06,506 ERROR Closing ledger 421781 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:56:06,507 ERROR Write of ledger entry to quorum failed: L421781 E0
03:56:06,509 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:56:06,704 ERROR Error update ledger metadata for ledger 421781 : NoSuchLedgerExistsException: No such ledger exists
03:56:06,804 INFO  Deleted ledger 421781

1m        1m        1         pravega-bookie-1.15ae182017ea90bb   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1   Readiness probe failed: 03:57:23,990 INFO  Created ledger 421823
03:57:26,217 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421823
03:57:26,217 ERROR Closing ledger 421823 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:57:26,221 ERROR Write of ledger entry to quorum failed: L421823 E6
03:57:26,222 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:57:26,311 ERROR Error update ledger metadata for ledger 421823 : NoSuchLedgerExistsException: No such ledger exists
03:57:26,358 INFO  Deleted ledger 421823

30m       30m       1         pravega-bookie-2.15ae168c92f47597   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 449610c4-0d4e-4eaa-be02-70d0eac3d455   Readiness probe failed: 03:28:31,998 INFO  Created ledger 420924
03:28:33,618 ERROR Could not get additional bookie to remake ensemble, closing ledger: 420924
03:28:33,619 ERROR Closing ledger 420924 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:28:33,620 ERROR Write of ledger entry to quorum failed: L420924 E1
03:28:33,621 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:28:33,696 ERROR Error update ledger metadata for ledger 420924 : NoSuchLedgerExistsException: No such ledger exists
03:28:33,697 INFO  Deleted ledger 420924

28m       28m       1         pravega-bookie-2.15ae16b15ba0c097   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 449610c4-0d4e-4eaa-be02-70d0eac3d455   Readiness probe failed: 03:31:04,866 INFO  Created ledger 421006
03:31:11,216 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421006
03:31:11,216 ERROR Closing ledger 421006 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:31:11,223 ERROR Write of ledger entry to quorum failed: L421006 E0
03:31:11,224 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:31:11,618 INFO  Deleted ledger 421006
03:31:11,619 ERROR Error update ledger metadata for ledger 421006 : NoSuchLedgerExistsException: No such ledger exists

5m        5m        1         pravega-bookie-2.15ae17f13791fe53   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 449610c4-0d4e-4eaa-be02-70d0eac3d455   Readiness probe failed: 03:54:01,667 INFO  Created ledger 421720
03:54:05,113 ERROR Could not get additional bookie to remake ensemble, closing ledger: 421720
03:54:05,113 ERROR Closing ledger 421720 due to NotEnoughBookiesException: Not enough non-faulty bookies available
03:54:05,118 ERROR Write of ledger entry to quorum failed: L421720 E0
03:54:05,119 WARN  Error in bookie sanity test
org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available
          at org.apache.bookkeeper.client.SyncCallbackUtils.finish(SyncCallbackUtils.java:83)
          at org.apache.bookkeeper.client.SyncCallbackUtils$SyncAddCallback.addComplete(SyncCallbackUtils.java:251)
          at org.apache.bookkeeper.client.AsyncCallback$AddCallback.addCompleteWithLatency(AsyncCallback.java:91)
          at org.apache.bookkeeper.client.PendingAddOp.submitCallback(PendingAddOp.java:361)
          at org.apache.bookkeeper.client.LedgerHandle.errorOutPendingAdds(LedgerHandle.java:1546)
          at org.apache.bookkeeper.client.LedgerHandle$5.safeRun(LedgerHandle.java:523)
          at org.apache.bookkeeper.common.util.SafeRunnable.run(SafeRunnable.java:36)
          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
          at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
          at java.lang.Thread.run(Thread.java:748)
03:54:05,213 INFO  Deleted ledger 421720
03:54:05,217 ERROR Error update ledger metadata for ledger 421720 : NoSuchLedgerExistsException: No such ledger exists

@pbelgundi pbelgundi added priority/P2 Slight inconvenience or annoyance to applications, system continues to function and removed priority/P1 Recoverable error, functionality/performance impaired but not lost, no permanent damage labels Jul 4, 2019
@pbelgundi
Copy link
Contributor

pbelgundi commented Jul 4, 2019

From logs, I see 18 Readiness probe failures in 30 mins. Each pod should have 180 probes in 30 mins ( given interval of 10 secs). Which means roughly 10% of the probes are failing. Though this is not enough to cause disruption, it needs to be investigated.
Also I see this one aggregated event showing 4497 failures in last 7 days for the same pod:

LAST SEEN   FIRST SEEN   COUNT     NAME                                KIND      SUBOBJECT                 TYPE      REASON      SOURCE                                          MESSAGE
24m         7d           4493      pravega-bookie-1.15abb5c21f6065e9   Pod       spec.containers{bookie}   Warning   Unhealthy   kubelet, 4bf9a49e-dc94-4a01-99c7-06300c575ac1

Reducing priority of the issue to P2 as it is not causing any disruption...

@pbelgundi
Copy link
Contributor

Need to check on the following:

  1. Are some/all of these readiness probe failures, false positives?
  2. Root cause above failures and find out if root cause is a pravega problem (bk issue) or operator (readiness probe configuration) problem.

@kevinhan88 kevinhan88 added pravega version 0.5.0 Issues with Pravega release 0.5.0 and removed pravega version 0.5.0 Issues with Pravega release 0.5.0 labels Jul 8, 2019
@Tristan1900 Tristan1900 added the status/needs-info Extra information is needed; add a comment with the exact information requested label Jul 9, 2019
@Tristan1900
Copy link
Member

@vedanthh Thanks for the report. I have tried to reproduce this today using the pravega-benchmark as you suggested. After for like 10 runs, the bug still cannot manifest itself. Could you please provide detailed instructions on how to reproduce this bug? It would definitely be helpful if you could show the parameters of when you run the benchmark. Thanks a lot.

@pbelgundi
Copy link
Contributor

In all above probe failures, I see this exception "org.apache.bookkeeper.client.BKException$BKNotEnoughBookiesException: Not enough non-faulty bookies available"
which seems to suggest that some bookies are either down or unreachable by the readiness probe.

@vedanthh, Around time when you see this probe failure, could you please check if any bookies are down using this command on the bookie pod:
./bin/bookkeeper shell listbookies -rw

@pbelgundi
Copy link
Contributor

I see this issue with "Bookkeeper on IPV6" where sanity check fails inspite of healthy bookies:
apache/pulsar#1953

@Tristan1900
Copy link
Member

Tristan1900 commented Jul 9, 2019

@pbelgundi The BKNotEnoughBookiesException means the local bookie is not available, it doesn't have anything to do with the whole bookkeeper cluster. As you know the readiness probe is to check if the individual pod is ready, so BKNotEnoughBookiesException just means that the bookie pod we kubectl describe is not ready. Check bookiesanity in this page.

For the pulsar issue, I also checked that before, it seems to be a different problem since their issue is connectionException. Our probe actually connects to the bookie but failed to write to ledger. It fails in this line.

@Tristan1900
Copy link
Member

I have reproduced the error today using the longevity mid-scale workload and I am using the default configuration of bookkeeper and pravega. This probe failure happens quickly within 5 minutes.

Here is the bookie server side log(DEBUG level) when a readiness probe fails

2019-07-09 21:03:33,337 - INFO  - [bookie-io-1-2:AuthHandler$ServerSideHandler$AuthHandshakeCompleteCallback@211] - Authentication success on server side
2019-07-09 21:03:33,337 - INFO  - [bookie-io-1-2:BookieRequestHandler@50] - Channel connected  [id: 0xf0b4f6d4, L:/172.24.183.10:3181 - R:/172.24.183.10:37138]
2019-07-09 21:03:33,450 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 62  Entry: 2096
2019-07-09 21:03:33,450 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:33,454 - DEBUG - [bookie-journal-callback-6-1:Journal$QueueEntry@313] - Acknowledge Ledger: 62, Entry: 2096
2019-07-09 21:03:33,464 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 98  Entry: -4096
2019-07-09 21:03:33,464 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:33,465 - DEBUG - [bookie-journal-callback-6-1:Journal$QueueEntry@313] - Acknowledge Ledger: 98, Entry: -4096
2019-07-09 21:03:33,465 - DEBUG - [bookie-journal-callback-6-1:Bookie$NopWriteCallback@197] - Finished writing entry -4096 @ ledger 98 for null : 0
2019-07-09 21:03:33,466 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 98  Entry: 0
2019-07-09 21:03:33,466 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:33,466 - DEBUG - [bookie-journal-callback-6-1:Journal$QueueEntry@313] - Acknowledge Ledger: 98, Entry: 0
2019-07-09 21:03:33,469 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 63  Entry: 2008
2019-07-09 21:03:33,469 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:33,473 - DEBUG - [bookie-journal-callback-6-1:Journal$QueueEntry@313] - Acknowledge Ledger: 63, Entry: 2008
2019-07-09 21:03:33,475 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 64  Entry: 1956
2019-07-09 21:03:33,475 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:33,485 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 65  Entry: 1654
2019-07-09 21:03:33,486 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:33,486 - DEBUG - [bookie-journal-callback-6-1:Journal$QueueEntry@313] - Acknowledge Ledger: 64, Entry: 1956
2019-07-09 21:03:38,716 - INFO  - [bookie-io-1-2:BookieRequestHandler@61] - Channels disconnected: [id: 0xf0b4f6d4, L:/172.24.183.10:3181 ! R:/172.24.183.10:37138]
2019-07-09 21:03:38,717 - DEBUG - [bookie-journal-callback-6-1:Journal$QueueEntry@313] - Acknowledge Ledger: 65, Entry: 1654
2019-07-09 21:03:38,717 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 98  Entry: 1
2019-07-09 21:03:38,717 - DEBUG - [ForceWriteThread:JournalChannel@254] - Journal ForceWrite
2019-07-09 21:03:38,718 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 64  Entry: 1957
2019-07-09 21:03:38,719 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 62  Entry: 2097
2019-07-09 21:03:38,720 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 63  Entry: 2009
2019-07-09 21:03:38,720 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 64  Entry: 1958
2019-07-09 21:03:38,722 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 62  Entry: 2098
2019-07-09 21:03:38,722 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 64  Entry: 1959
2019-07-09 21:03:38,722 - DEBUG - [BookieJournal-3181:Journal@1023] - Written and queuing for flush Ledger: 62  Entry: 2099

Notice that there is a 5 seconds gap at time 2019-07-09 21:03:38,716 which means that bookie is not doing anything before Channels disconnected. The workload is high in the experiment and we expect at least a hundred lines of log for every second, so this gap is definitly not normal. This stall in bookkeeper causes the readiness probe failure because readiness probe has a timeout.

It looks to me that this stall is caused by the high workload since I cannot reproduce the error using the pravega-benchmark, which I think the load is small.

My questions are, 1) is this stall expected under any circumstances? 2) will yourkit help to provide some more info if enabled for this issue?

It would be great to have some insights @fpj @pbelgundi. Much appreciated.

@pbelgundi
Copy link
Contributor

pbelgundi commented Jul 15, 2019

@pbelgundi The BKNotEnoughBookiesException means the local bookie is not available, it doesn't have anything to do with the whole bookkeeper cluster. As you know the readiness probe is to check if the individual pod is ready, so BKNotEnoughBookiesException just means that the bookie pod we kubectl describe is not ready. Check bookiesanity in [this page]
(https://bookkeeper.apache.org/docs/4.7.2/reference/cli/#the-bookkeeper-shell).

While I understand that BKNotEnoughBookiesException occurs when trying to write to the local bookie, since this is what the readiness check attempts to do, please note every write to a ledger can translate to writes to multiple bookies, based on write quorum and ack quorum:
https://bookkeeper.apache.org/archives/docs/r4.4.0/bookkeeperProtocol.html
Given this, it would help to know if one or more bookies are actually down (not ready) as per k8s, when this error is seen.....

BKNotEnoughBookiesException is just the outer exception, the inner cause is Error update ledger metadata for ledger 421720 : NoSuchLedgerExistsException: No such ledger exists
Need to investigate what causes this.

For how long are one or more bookie(s) unavailable and why, is what we need to answer to be able to solve this....

@pbelgundi
Copy link
Contributor

pbelgundi commented Jul 15, 2019

This stall in bookkeeper causes the readiness probe failure because readiness probe has a timeout.

The timeout you point to here is inside bookeeper shell code.
However, there seems to be an option to specify the timeout duration when invoking the sanity check :
https://github.com/apache/bookkeeper/blob/release-4.7.3/bookkeeper-server/src/main/java/org/apache/bookkeeper/bookie/BookieShell.java#L1149
@Tristan1900, Can you try using a longer timeout and see if the probe failures disappear?

@sumit-bm
Copy link

@pbelgundi would this timeout be passed as an option in pravega manifest for operator based deployment? How would you like us to validate this? Have you tested it in actual deployment and got a fruitful result? Have you reproduced the issue as I see the above comments are very generic in nature and we need more concrete points as we are experiencing these issues?

My suggestion would be please reproduce the issue and then let us know how to go forward.

@pbelgundi
Copy link
Contributor

@sumit-bm Wenqi has reproduced this issue, as stated above.
In case that was not clear, the recommendation to add timeout and check the result of probes was for Wenqi.

@Tristan1900
Copy link
Member

@sumit-bm @pbelgundi, thanks for the comments. Sorry for the late updates. I have modified the timeout to 5 seconds and it seems the problem has gone. Will do some more experiments and build an image for the test team to test, also will raise a PR. Thanks!

@Tristan1900
Copy link
Member

@vedanthh could you please run the longevity test again using this image tristan1900/pravega-operator:probe built from the above PR? Thanks!

@vedanthh
Copy link
Author

@Tristan1900 I will run the longevity test after retest of #issue 3900 and #issue 4011 complete and upon cluster availability, will update you shortly and validate probe failure issue with tristan1900/pravega-operator:probe build.

@Tristan1900
Copy link
Member

@vedanthh Thank you!

@pbelgundi
Copy link
Contributor

Closing this for now. If seen again please open an issue on Bookkeeper Operator:
https://github.com/pravega/bookkeeper-operator

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working priority/P2 Slight inconvenience or annoyance to applications, system continues to function status/needs-info Extra information is needed; add a comment with the exact information requested version 0.3.2 This is an issue with operator version 0.3.2
Projects
None yet
Development

No branches or pull requests

6 participants