Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Avoid creating offset consumer before updating consumer metrics #426

Conversation

BewareMyPower
Copy link
Collaborator

@BewareMyPower BewareMyPower commented Apr 5, 2021

Motivation

Sometimes multiple brokers could subscribe the same topic partition and fail with ConsumerBusyException:

WARN  io.streamnative.pulsar.handlers.kop.coordinator.group.OffsetAcker - Error when get consumer for offset ack:
  java.util.concurrent.CompletionException: org.apache.pulsar.client.api.PulsarClientException$ConsumerBusyException: Exclusive consumer is already connected

This problem was mentioned in #263 before. However, switching the subscription type to shared or failover is not correct because a partition should have only one consumer at the same time. The problem is before updating consumer metrics, it uses OffsetAcker#getConsumer to check if the consumer has been created. This check is wrong.

Firstly, OffsetAcker#getConsumer may create an offset consumer, while the actual offset consumer could have been created by another broker.
Secondly, it only checks if the future of consumer is done and returns immediately if not. It could cause a situation that this metrics update was skipped.

Modifications

  • Modify the condition check in KafkaTopicConsumer#getGroupConsumers. Because the offset consumer could be created by GroupCoordinator#handleSyncGroup before, here we just check if the consumer existed. Then update the consumer metrics when the future is completed.
  • Remove the failed future of offset consumers.
  • Add consumer stats tests to guarantee the modification works.

@BewareMyPower BewareMyPower force-pushed the bewaremypower/fix-consumer-metrics-check branch from 4336570 to 1f1dc01 Compare April 6, 2021 04:11
Copy link
Contributor

@dockerzhang dockerzhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BewareMyPower
Copy link
Collaborator Author

I've also changed the log level of empty group check to debug.

    public CompletableFuture<Consumer> getGroupConsumers(String groupId, TopicPartition kafkaPartition) {
        if (StringUtils.isEmpty(groupId)) {
            if (log.isDebugEnabled()) {
                log.debug("Try to get group consumers with an empty group id");
            }
            return CompletableFuture.completedFuture(null);
        }

Because in my test environment, there're a lot of logs like

13:27:09.864 [bookkeeper-ml-workers-OrderedExecutor-6-0] INFO io.streamnative.pulsar.handlers.kop.MessageFetchContext - get group name from zk for current connection:/10.0.2.6:49286 groupId:null
13:27:09.864 [bookkeeper-ml-workers-OrderedExecutor-6-0] WARN io.streamnative.pulsar.handlers.kop.KafkaTopicManager - Try to get group consumers with an empty group id
...
13:31:07.859 [bookkeeper-ml-workers-OrderedExecutor-6-0] INFO io.streamnative.pulsar.handlers.kop.MessageFetchContext - get group name from zk for current connection:/10.0.2.6:49286 groupId:null
13:31:07.859 [bookkeeper-ml-workers-OrderedExecutor-6-0] WARN io.streamnative.pulsar.handlers.kop.KafkaTopicManager - Try to get group consumers with an empty group id

It looks like there's something wrong with

String groupName = requestHandler
.getCurrentConnectedGroup().computeIfAbsent(clientHost, ignored -> {
String zkSubPath = ZooKeeperUtils.groupIdPathFormat(clientHost,
fetch.getHeader().clientId());
String groupId = ZooKeeperUtils.getData(requestHandler.getPulsarService().getZkClient(),
requestHandler.getGroupIdStoredPath(), zkSubPath);
log.info("get group name from zk for current connection:{} groupId:{}",
clientHost, groupId);
return groupId;
});
.

Could you help take a look? @dockerzhang

@dockerzhang
Copy link
Contributor

@BewareMyPower I think there is no zk for UT to store group id, so the getData returns null.

@BewareMyPower
Copy link
Collaborator Author

@dockerzhang It's the problem in the real environment not UT. The ZK node exists like:

ls /client_group_id
[10.0.2.10-consumer-sub-000-G0triB0-1, 10.0.2.10-consumer-sub-000-t3lUJiE-2]

The logs occurred in another broker whose IP is 10.0.2.4.

@BewareMyPower
Copy link
Collaborator Author

BTW, the testDeleteClosedTopics is very flaky, I'll open a new PR to fix it if it still fails.

@BewareMyPower
Copy link
Collaborator Author

Wait for #427 merged.

@BewareMyPower BewareMyPower merged commit d0053b1 into streamnative:master Apr 8, 2021
@BewareMyPower BewareMyPower deleted the bewaremypower/fix-consumer-metrics-check branch April 12, 2021 14:41
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants