Continue to calculate lag for inactive groups for a configurable timespan #66

seglo · 2019-09-18T12:40:04Z

Inspired by discussion in #63

Add a feature that continues to calculate consumer group lag for a group after it's no longer active. Today, we will immediately evict metrics for groups that no longer exist. We detect that a group has been removed by comparing the list of groups returned to the list returned in the last poll. Instead of removing metrics immediately, when we discover that groups no longer exist (they're no longer returned when we retrieve group metadata), we will continue to calculate lag for their last reported partition subscription. When a group is detected as removed it will be added with a timestamp to a removal list that will be cleaned up after each poll. If a group in the removal list exceed a configured time span then it will be removed. If the group becomes active again then the group is removed from the removal list. A default of 30 minutes would be a good value to start with.

graphex · 2019-09-21T14:48:48Z

Another potential direction here might be to have a flag for the collection of (earliest and) latest metrics for all topics, regardless of consumer state. This would make alerting for unconsumed topics/partitions possible, which is a good thing to do to prevent data loss.

seglo · 2019-09-21T22:46:01Z

That would be useful, but based on observations from #63 it seems that inactive groups aren't available when using AdminClient. It may be worth investigating this more, it's possible the group metadata may still be accessible, but it's just not returned when getting a list of consumer groups. We use the list of consumer groups to determine what groups to return metadata for.

rkrage · 2020-01-31T17:23:37Z

So I think something else is actually going on here. The kafka-consumer-groups.sh script uses AdminClient and absolutely displays inactive consumer groups:

QA [email protected]:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --list
rkrage_test

QA [email protected]:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe
Consumer group 'rkrage_test' has no active members.

TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            CLIENT-ID
rkrage_test     0          4090            4090            0               -               -               -

QA [email protected]:~$
/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe --state
Consumer group 'rkrage_test' has no active members.

COORDINATOR (ID)          ASSIGNMENT-STRATEGY       STATE                #MEMBERS
log-kafka02.qa:9092 (2)                             Empty                0

I believe this is the source code it's using to list all groups: https://github.com/apache/kafka/blob/6dc6f6a60ddf7a70c394c147fbed579749d2abcc/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L181-L185

seglo · 2020-09-02T13:19:17Z

I think this is the same issue as #126. Where if a group has no active members its information was inadvertently filtered out. @lilyevsky resolved this with #128 which was released in 0.6.2.

The kafka-consumer-groups.sh script uses AdminClient and absolutely displays inactive consumer groups:

/usr/lib/kafka/bin/kafka-consumer-groups.sh --bootstrap-server localhost:9092 --group rkrage_test --describe
Consumer group 'rkrage_test' has no active members.

TOPIC           PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG             CONSUMER-ID     HOST            
CLIENT-ID
rkrage_test     0          4090            4090            0

This makes me think it might be the same issue.

I believe this is the source code it's using to list all groups: apache/kafka@6dc6f6a/core/src/main/scala/kafka/admin/ConsumerGroupCommand.scala#L181-L185

We make the same call in KafkaClient.

https://github.com/lightbend/kafka-lag-exporter/blob/v0.6.3/src/main/scala/com/lightbend/kafkalagexporter/KafkaClient.scala#L113

Can anyone confirm with the latest version of Kafka Lag Exporter? (@rkrage)

rkrage · 2020-09-14T23:25:56Z

@seglo, just upgraded to 0.6.3 today this appears to be solved for us!

seglo · 2020-09-15T15:28:12Z

@rkrage Excellent! I'll close this ticket.

Fixed with #128

seglo added the enhancement New feature or request label Sep 18, 2019

graphex mentioned this issue Sep 21, 2019

Add option to always collect metrics for all topics #71

Closed

seglo linked a pull request Sep 15, 2020 that will close this issue

Support consumer groups for which member information is unavailable. #128

Merged

seglo closed this as completed Sep 15, 2020

andypp mentioned this issue Dec 1, 2021

Lags not reported for topics with no active member #290

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Continue to calculate lag for inactive groups for a configurable timespan #66

Continue to calculate lag for inactive groups for a configurable timespan #66

seglo commented Sep 18, 2019

graphex commented Sep 21, 2019

seglo commented Sep 21, 2019

rkrage commented Jan 31, 2020

seglo commented Sep 2, 2020

rkrage commented Sep 14, 2020

seglo commented Sep 15, 2020

Continue to calculate lag for inactive groups for a configurable timespan #66

Continue to calculate lag for inactive groups for a configurable timespan #66

Comments

seglo commented Sep 18, 2019

graphex commented Sep 21, 2019

seglo commented Sep 21, 2019

rkrage commented Jan 31, 2020

seglo commented Sep 2, 2020

rkrage commented Sep 14, 2020

seglo commented Sep 15, 2020