Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consumer: reduce ticker allocations #1028

Merged
merged 1 commit into from
Feb 9, 2018
Merged

Consumer: reduce ticker allocations #1028

merged 1 commit into from
Feb 9, 2018

Conversation

faceair
Copy link
Contributor

@faceair faceair commented Jan 24, 2018

A ticker is created after each response, even when there is no messages in the response, so when the consumer is idle the CPU is still high. The cost of ticker creation and destruction is expensive, when the number of fetch requests goes up, both the CPU and memory are consumed by the tickers.

So I try to reuse the ticker, and did not change the existing behavior. Now the timeout is still low precision (between MaxProcessingTime and 2 * MaxProcessingTime).

Before:
image
image

After:
image

@eapache
Copy link
Contributor

eapache commented Feb 2, 2018

A ticker is created after each response, even when there is no messages in the response, so when the consumer is idle the CPU is still high.

When the consumer is idle you should only be processing maybe one or two responses per second, so there should be no noticeable cost here. What is your Consumer.MaxWaitTime config?

The change is still OK, but I'm curious if it might make things worse for high-volume consumers? My understanding is that time.Now() can also be expensive, and we're now calling that once per message (not once per response).

@faceair
Copy link
Contributor Author

faceair commented Feb 2, 2018

I use the sarama-cluster and not change the default config, the MaxWaitTime is 250 * time.Millisecond. Every process have 8 cluster.Consumer instance, i see about 500 fetch request per second per process from the cluster monitor when idle. I don't know why, it's very strange...
Or can i use other method to verify this ? Maybe i can add some event tracking in process. 🤔

The time.Now()'s expense is clear, but when the golang's timers heap goes up the ticker's expense is not clear. If just worry the time.Now()'s expense, we can create a millisecond precision timer (like coarseTime), then call time.Now()'s expense will be constant.

@eapache
Copy link
Contributor

eapache commented Feb 2, 2018

Every process have 8 cluster.Consumer instance

I don't know a lot about sarama-cluster, but I would think you'd only need one per process?

i see about 500 fetch request per second per process from the cluster monitor when idle

That is very high for idle. How many brokers are in your kafka cluster?

@faceair
Copy link
Contributor Author

faceair commented Feb 3, 2018

In my system, topic has priority, I use multiple instances and control the timing of consumer messages.
20 brokers in this cluster.

I'll verify the count of the fetch requests next Monday.

@faceair
Copy link
Contributor Author

faceair commented Feb 3, 2018

The MaxProcessingTime is the time to deal with each response ? Should I create the startSend for each response, not per message ?

@eapache
Copy link
Contributor

eapache commented Feb 3, 2018

No, MaxProcessingTime is the time to deal with each message; what you've got here is correct.

@eapache
Copy link
Contributor

eapache commented Feb 3, 2018

One approach that might be more efficient (since it wouldn't require tracking time.Now on every message) would be to keep the msgSent boolean but simply set it to true inside the loop before the messageSelect label. It would no longer represent a sent message exactly (maybe it could be renamed to firstAttempt or something) but it would still let you move the ticker outside the main loop.

@faceair faceair force-pushed the master branch 2 times, most recently from d7428ad to 733cd80 Compare February 4, 2018 05:12
@faceair
Copy link
Contributor Author

faceair commented Feb 4, 2018

yeah, it's a good solution. 😁

@faceair
Copy link
Contributor Author

faceair commented Feb 5, 2018

I add a counter in the responseFeeder loop, and show the partitionConsumer.feeder loop run rate is about 500 times per second per process, for every sarama instance is about 500/8=62 times per second.

And I found this consumer.go#L727, when broker fetch an response broker will dispatch the response to every partitionConsumer.feeder, so the partitionConsumer.feeder loop will run more times. Every topic has 50 partitions, and I have 3 process, so every process may have about 16 partitions. 62/16 = 4 times per second per broker.

So it looks like there's no problem here.

Copy link
Contributor

@eapache eapache left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@daedric wondering if you have thoughts on this PR, since you recently submitted #1036 in a slightly different direction. There's no "leak" in this PR, but it would involve a ticker hanging around all the time for slow consumers.

consumer.go Outdated
@@ -441,20 +441,20 @@ func (child *partitionConsumer) HighWaterMarkOffset() int64 {

func (child *partitionConsumer) responseFeeder() {
var msgs []*ConsumerMessage
msgSent := false
var firstAttempt bool
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this needs to be initialized to true now

@daedric
Copy link
Contributor

daedric commented Feb 7, 2018

Actually, while the leak have been fixed, I meant to ask you what was really the purpose of this ticker. Upon trigger, it does nothing apart from a continue that will eventually process the remaining messages.
If you were to ask me, without more input, I would simply remove this ticker :)

@eapache
Copy link
Contributor

eapache commented Feb 7, 2018

The first time it ticks it simply sets a flag and repeats the select statement. The second time it ticks it fully detaches the partition from the consumer in order to unblock the remaining partitions. This is important so that a slow message consumer doesn't drag the entire process down.

@eapache eapache merged commit 5910255 into IBM:master Feb 9, 2018
@faceair
Copy link
Contributor Author

faceair commented Feb 25, 2018

after this pr merged the line consumer.go#L464 should delete ?

@eapache
Copy link
Contributor

eapache commented Feb 26, 2018

Yes! I'm not sure how we missed that.

eapache added a commit that referenced this pull request Feb 26, 2018
Now that we reuse a single ticker the whole time, this would probably lead to
the consumer just kind of hanging and never resuming.

Thanks to #1028 (comment) for
catching this.
eapache added a commit that referenced this pull request Feb 26, 2018
Now that we reuse a single ticker the whole time, this would probably lead to
the consumer just kind of hanging and never resuming.

Thanks to #1028 (comment) for
catching this.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants