Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

[BUG] Failed BrokerService#getTopic cause metadata request always fail #324

Closed
BewareMyPower opened this issue Jan 15, 2021 · 2 comments
Closed

Comments

@BewareMyPower
Copy link
Collaborator

Describe the bug
Sometimes the Kafka Client failed with

Topic not present in metadata after 60000 ms.

Obviously, it's because METADATA request failed. After checking the logs, I found that it's caused by the first failed BrokerService#getTopic.

Here's the logs:

12:14:28.472 [zk-metadata-store-callback-20-1] ERROR io.streamnative.pulsar.handlers.kop.KafkaTopicManager - [[id: 0x65c5c076, L:/172.21.0.5:9092 - R:/172.21.0.2:58652]] Failed to getTopic persistent://public/fexfiusuynjxuxrs/my-kafka-topic-partition-0. exception:
org.apache.pulsar.broker.service.BrokerServiceException$PersistenceException: org.apache.bookkeeper.mledger.ManagedLedgerException: org.apache.bookkeeper.mledger.ManagedLedgerException$BadVersionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /managed-ledgers/public/fexfiusuynjxuxrs/persistent/my-kafka-topic-partition-0
        at org.apache.pulsar.broker.service.BrokerService$3.openLedgerFailed(BrokerService.java:1122) ~[org.apache.pulsar-pulsar-broker-2.7.0.3.jar:2.7.0.3]
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl.lambda$asyncOpen$9(ManagedLedgerFactoryImpl.java:426) ~[org.apache.pulsar-managed-ledger-2.7.0.3.jar:2.7.0.3]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_275]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_275]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_275]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_275]
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerFactoryImpl$2.initializeFailed(ManagedLedgerFactoryImpl.java:421) ~[org.apache.pulsar-managed-ledger-2.7.0.3.jar:2.7.0.3]
        at org.apache.bookkeeper.mledger.impl.ManagedLedgerImpl$1.operationFailed(ManagedLedgerImpl.java:364) ~[org.apache.pulsar-managed-ledger-2.7.0.3.jar:2.7.0.3]
        at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.lambda$null$1(MetaStoreImpl.java:81) ~[org.apache.pulsar-managed-ledger-2.7.0.3.jar:2.7.0.3]
        at java.util.concurrent.CompletableFuture.uniExceptionally(CompletableFuture.java:884) ~[?:1.8.0_275]
        at java.util.concurrent.CompletableFuture$UniExceptionally.tryFire(CompletableFuture.java:866) ~[?:1.8.0_275]
        at java.util.concurrent.CompletableFuture.postComplete(CompletableFuture.java:488) ~[?:1.8.0_275]
        at java.util.concurrent.CompletableFuture.completeExceptionally(CompletableFuture.java:1990) ~[?:1.8.0_275]
        at org.apache.pulsar.metadata.impl.zookeeper.ZKMetadataStore.lambda$null$10(ZKMetadataStore.java:186) ~[org.apache.pulsar-pulsar-metadata-2.7.0.3.jar:2.7.0.3]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_275]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_275]
        at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30) [io.netty-netty-common-4.1.51.Final.jar:4.1.51.Final]
        at java.lang.Thread.run(Thread.java:748) [?:1.8.0_275]
Caused by: org.apache.bookkeeper.mledger.ManagedLedgerException: org.apache.bookkeeper.mledger.ManagedLedgerException$BadVersionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /managed-ledgers/public/fexfiusuynjxuxrs/persistent/my-kafka-topic-partition-0
Caused by: org.apache.bookkeeper.mledger.ManagedLedgerException$BadVersionException: org.apache.pulsar.metadata.api.MetadataStoreException$BadVersionException: org.apache.zookeeper.KeeperException$BadVersionException: KeeperErrorCode = BadVersion for /managed-ledgers/public/fexfiusuynjxuxrs/persistent/my-kafka-topic-partition-0
12:14:28.474 [zk-metadata-store-callback-20-1] WARN  io.streamnative.pulsar.handlers.kop.KafkaRequestHandler - [[id: 0x65c5c076, L:/172.21.0.5:9092 - R:/172.21.0.2:58652]] findBroker: Failed to getOrCreateTopic persistent://public/fexfiusuynjxuxrs/my-kafka-topic-partition-0. broker:PLAINTEXT://0.0.0.0:19092, exception:
12:14:28.474 [zk-metadata-store-callback-20-1] WARN  io.streamnative.pulsar.handlers.kop.KafkaRequestHandler - [[id: 0x65c5c076, L:/172.21.0.5:9092 - R:/172.21.0.2:58652]] Request RequestHeader(apiKey=METADATA, apiVersion=6, clientId=producer-1, correlationId=1): Exception while find Broker metadata
12:14:28.474 [zk-metadata-store-callback-20-1] WARN  io.streamnative.pulsar.handlers.kop.KafkaRequestHandler - Failed find Broker metadata, create PartitionMetadata with NOT_LEADER_FOR_PARTITION
12:14:28.588 [pulsar-web-42-4] INFO  org.eclipse.jetty.server.RequestLog - 172.21.0.5 - - [15/Jan/2021:12:14:28 +0000] "GET /admin/v2/persistent/public/fexfiusuynjxuxrs/my-kafka-topic/partitions HTTP/1.1" 200 16 "-" "Pulsar-Java-v2.7.0.3" 4
# Above 3 lines repeated until client disconnected

The first error is caused by

brokerService.getTopic(t, brokerService.isAllowAutoTopicCreation(t)).whenComplete((t2, throwable) -> {
if (throwable != null) {
log.error("[{}] Failed to getTopic {}. exception:",
requestHandler.ctx.channel(), t, throwable);
// failed to getTopic from current broker, remove cache, which added in getTopicBroker.
removeTopicManagerCache(t);
topicCompletableFuture.complete(null);

They we can see the repeated error is caused by KafkaTopicManager#getTopic:

topicManager.getTopic(topic.toString()).whenComplete((persistentTopic, exception) -> {
if (exception != null || persistentTopic == null) {
log.warn("[{}] findBroker: Failed to getOrCreateTopic {}. broker:{}, exception:",
ctx.channel(), topic.toString(), endPoint.getOriginalListener(), exception);
// remove cache when topic is null
topicManager.removeTopicManagerCache(topic.toString());
returnFuture.complete(null);

To Reproduce
The KoP version is v2.7.0.4-rc-202101150854 (branch-2.7.0). It's not easy to reproduce. It's my private integration test environment, which will set 1 ZK, 1 BK, 1 Broker and KoP for simple produce test.

Expected behavior
A failed BrokerService#getTopic invocation should not affect followed the METADATA request attempts.

@BewareMyPower
Copy link
Collaborator Author

I'm not sure if #267 fixed the bug, I'll cherry-pick this commit and test again. Currently the test has about 50% possibility to fail.

@BewareMyPower
Copy link
Collaborator Author

Confirmed that #267 works

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

1 participant