-
-
Notifications
You must be signed in to change notification settings - Fork 197
Continue to calculate lag for inactive groups for a configurable timespan #66
Comments
Another potential direction here might be to have a flag for the collection of (earliest and) latest metrics for all topics, regardless of consumer state. This would make alerting for unconsumed topics/partitions possible, which is a good thing to do to prevent data loss. |
That would be useful, but based on observations from #63 it seems that inactive groups aren't available when using |
So I think something else is actually going on here. The QA [email protected]:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
rkrage_test
QA [email protected]:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe
Consumer group 'rkrage_test' has no active members.
TOPIC PARTITION CURRENT-OFFSET LOG-END-OFFSET LAG CONSUMER-ID HOST CLIENT-ID
rkrage_test 0 4090 4090 0 - - -
QA [email protected]:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe --state
Consumer group 'rkrage_test' has no active members.
COORDINATOR (ID) ASSIGNMENT-STRATEGY STATE #MEMBERS
log-kafka02.qa:9092 (2) Empty 0 I believe this is the source code it's using to list all groups: https://github.com/apache/kafka/blob/6dc6f6a60ddf7a70c394c147fbed579749d2abcc/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L181-L185 |
I think this is the same issue as #126. Where if a group has no active members its information was inadvertently filtered out. @lilyevsky resolved this with #128 which was released in 0.6.2.
This makes me think it might be the same issue.
We make the same call in KafkaClient. Can anyone confirm with the latest version of Kafka Lag Exporter? (@rkrage) |
@seglo, just upgraded to 0.6.3 today this appears to be solved for us! |
Inspired by discussion in #63
Add a feature that continues to calculate consumer group lag for a group after it's no longer active. Today, we will immediately evict metrics for groups that no longer exist. We detect that a group has been removed by comparing the list of groups returned to the list returned in the last poll. Instead of removing metrics immediately, when we discover that groups no longer exist (they're no longer returned when we retrieve group metadata), we will continue to calculate lag for their last reported partition subscription. When a group is detected as removed it will be added with a timestamp to a removal list that will be cleaned up after each poll. If a group in the removal list exceed a configured time span then it will be removed. If the group becomes active again then the group is removed from the removal list. A default of 30 minutes would be a good value to start with.
The text was updated successfully, but these errors were encountered: