Skip to content
This repository has been archived by the owner on Jan 24, 2024. It is now read-only.

Use Subscription#resetCursor for cumulative ACK #515

Conversation

BewareMyPower
Copy link
Collaborator

Motivation

Currently KoP uses OffsetAcker to manage a list of Pulsar consumers for cumulative ACK. However, it wastes a lot of resources. For example, a topic with N partitions will create N consumers across all brokers. Besides, for each client side consumer, broker will create an associated broker side consumer (in package org.apache.pulsar.broker.service). In addition, the broker side consumer also updates two fields.

See #513, it could easily makes a network channel broken when OFFSET_COMMIT request are frequent.

What's worse is, when a bundle ownership changes, the consumers in OffsetAcker cannot detect the change. Because it needs to call BrokerService#getTopic to get the PersistentTopic, then find the offset's position by looking up in the ManagedLedger. However, even it failed, no error code will be sent to Kafka client.

Modifications

  • Use Subscription#resetCursor for cumulative ACK, for the first time when the cursor doesn't exist, create the PersistentSubscription.
  • Only for those partitions that succeed in OffsetAcker#ackOffset will we write the data to __consumer_offsets.
  • Remove CommitOffsetBacklogTest because it relies on the update of Subscription.
  • Revert kop using pulsar consumer metrics #263, because it relies on the OffsetAcker's internal consumers and the associated broker side consumers.

@BewareMyPower BewareMyPower requested a review from jiazhai as a code owner May 18, 2021 16:40
@BewareMyPower
Copy link
Collaborator Author

@hangc0276 The CPU usage problem may be related to the openmessaging-benchmark's bug, the OFFSET_COMMIT requests are very often.

17:07:26.295 [bookkeeper-ml-workers-OrderedExecutor-3-0] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [public/default/persistent/test-topic-4jMpA3M-0000-partition-14] Initiate reset position to 685:307 on cursor sub-000-keXE2c4
17:07:26.312 [bookkeeper-ml-workers-OrderedExecutor-3-0] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [public/default/persistent/test-topic-4jMpA3M-0000-partition-14] Initiate reset position to 685:308 on cursor sub-000-keXE2c4
17:07:26.354 [bookkeeper-ml-workers-OrderedExecutor-3-0] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [public/default/persistent/test-topic-4jMpA3M-0000-partition-14] Initiate reset position to 685:309 on cursor sub-000-keXE2c4
17:07:26.372 [bookkeeper-ml-workers-OrderedExecutor-3-0] INFO org.apache.bookkeeper.mledger.impl.ManagedCursorImpl - [public/default/persistent/test-topic-4jMpA3M-0000-partition-14] Initiate reset position to 685:310 on cursor sub-000-keXE2c4

KoP has reset the cursor for about 33 to 47 times in every second. There may be something wrong with my benchmark client. (openmessaging-benchmark)

However, for the same workloads, after this PR, KoP has a significant better performance. See the following compare.

image

image

Before this PR, broker could be easily broken when lots of OFFSET_COMMIT requests arrived.

@BewareMyPower
Copy link
Collaborator Author

@dockerzhang I removed the consumer side stats in this PR. Reusing broker side consumer's stats is convenient, however, KoP will manage N broker consumers and N client consumers for N topics/partitions. When N is too large, the performance could be impacted significantly.

I think we can add these metrics to prometheus metrics later.

@wuzhanpeng
Copy link
Contributor

I have a question, have you tried adjusting different commit frequencies to observe performance changes? Or in other words, how to determine that the performance loss is caused by changing the way of commit(ackOffset -> resetCursor) ?

@BewareMyPower
Copy link
Collaborator Author

BewareMyPower commented May 19, 2021

@wuzhanpeng I'm going to do this today because I just found the commit frequency is too last after this change (lots of INFO logs from pulsar side). BTW, for the previous test (#513), you can see all errors in writeAndFlush's callback were related to OFFSET_COMMIT. Besides, the FETCH request latency is very low.

@BewareMyPower
Copy link
Collaborator Author

There're 2 tests that may fail.

Error:    BasicEndToEndKafkaTest.testDeleteClosedTopics:98 null
Error:    KafkaRequestHandlerTest.testOffsetCommitRequestRetentionMs:603 » NoSuchElement

I'll try to fix them or remove them.

Copy link
Collaborator

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@BewareMyPower
Copy link
Collaborator Author

We're going to remove OffsetAcker, so close this PR.

@BewareMyPower BewareMyPower deleted the bewaremypower/rest-api-ack branch May 20, 2021 10:31
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants