-
Notifications
You must be signed in to change notification settings - Fork 5.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Telegraf stops konsuming from partition on GetOffset error, does not try again (affects entire consumer group) #3553
Comments
This is hw my config ooks like |
@seuf Have you seen this error before? |
@danielnelson Sorry, nope.. Maybe we should switch to the confluent kafka client : https://github.com/confluentinc/confluent-kafka-go |
I tried restarting my entire Kafka cluster, to no avail. |
Could the high number of partitions somehow play role? I have 140 partitions in the topic. |
I think this might be bsm/sarama-cluster#121, which has been fixed upstream. I will update to the latest upstream version v2.1.10. I'm a little nervous to put this into 1.5.0 since I'm hoping to release next week, do you think you could test it if I add it to 1.5.0-rc2? |
Yep, no problemo =) |
Here are the builds of 1.5.0-rc2. |
I will test now and report back |
Running for an hour everything is Ok so far. |
Right I still get this error, albeit on a much smaller scale. Since the last 24 hours, I have only had this happene to 4 partitions with a total loss of ~20k messages ( I produce around 120 0000 messages per minute from my producers). |
I notice something interesting. |
So now it is recovering without a restart, but you are still losing messages? |
I am not losing messages as much as they are "late". Moment A1: Now some time elapses and we get to: It is not a major bug now that they are getting consumed(while before the 1.5 rc2 they were just stuck forever), just a tad annoying :) |
Closing as there hasn't been any activity in this bug report for a long time, if you are still facing this issue I recommend using the latest version of Telegraf and posting your configuration and any debug information. Thanks! |
Telegraf version telegraf-1.4.5-1.x86_64 running on 3.10.0-327.36.3.el7.x86_64, Kafka is 1.0.0, Scala for Kafka is 2.12.
Everything runs fine, until:
Dec 07 18:15:53 telecons02 telegraf[2820]: 2017-12-07T17:15:53Z E! Error in plugin [inputs.kafka_consumer]: Consumer Error: kafka: error while consuming telegrafcommon/26: kafka server: The requested offset is outside the range of offsets maintained by the server for the given topic/partition.
After this, the specified partition is not consumed anymore and messages pile up.
This is fixable by restating telegraf.
Interesting fact is that I have a few telegraf instances consuming kafka messages and all of them hit this issue on a few partitions (random partitions, cannot localize it to one partition). When I restart one telegraf instance the entire consumer group goes back to normal and messages are flowing (even on partitions served by the other instances that were stuck).
Pls help ;(
The text was updated successfully, but these errors were encountered: