Metrics not updated when a consumer group is not active #63

efrikin · 2019-09-16T09:37:57Z

Hi @seglo.
I have the same problem => #36
Ex.
I have group consumers which read topic.

When I stop read topic I see on the chart empty data(no data)
When I run consumer group lag and I see data on chart
This is very critical becouse in this is time consumer lag not monitoring
any idea?
Version exporter 0.5.1

Poll interval: 10 seconds
Lookup table size: 8192
Prometheus metrics endpoint port: 8000
Admin client consumer group id: kafkalagexporter
Kafka client timeout: 10 seconds
Statically defined Clusters:

  Cluster name: kafka_general
  Cluster Kafka bootstrap brokers: broker-01:9092
     
Watchers:
  Strimzi: false
      
2019-09-16 05:14:11,329 INFO  c.l.k.KafkaClusterManager$ akka://kafka-lag-exporter/user - Cluster Added: 
  Cluster name: kafka_general
  Cluster Kafka bootstrap brokers: broker-01:9092
      
2019-09-16 05:14:11,344 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Spawned ConsumerGroupCollector for cluster: kafka_general 
2019-09-16 05:14:11,355 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Collecting offsets 
2019-09-16 05:14:11,384 INFO  o.a.k.c.admin.AdminClientConfig  - AdminClientConfig values: 
	bootstrap.servers = [broker-01:9092]
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 300000
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 10000
	retries = 0
	retry.backoff.ms = 1000
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = https
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
 
2019-09-16 05:14:11,537 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version: 2.2.1 
2019-09-16 05:14:11,537 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId: 55783d3133a5a49a 
2019-09-16 05:14:12,105 INFO  o.a.k.c.consumer.ConsumerConfig  - ConsumerConfig values: 
	auto.commit.interval.ms = 5000
	auto.offset.reset = latest
	bootstrap.servers = [broker-01:9092]
	check.crcs = true
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 540000
	default.api.timeout.ms = 60000
	enable.auto.commit = false
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = kafkalagexporter
	heartbeat.interval.ms = 3000
	interceptor.classes = []
	internal.leave.group.on.close = true
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 10000
	retry.backoff.ms = 1000
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = https
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer

The text was updated successfully, but these errors were encountered:

seglo · 2019-09-16T15:59:50Z

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

As I described #36, we only report data for groups that are returned by the Kafka AdminClient. Every poll interval we compare the groups returned from the last interval and this interval and then unregister the metrics for the groups that no longer exists. This is done so we don't accumulate groups in an unbounded manner.

I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error? How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

efrikin · 2019-09-17T06:00:19Z

@seglo Thank a lot the answer

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)

I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error?

My case is related to a production incident, when consumer group suddenly got stuck and we could not spot the ever-growing lag for this consumer group for several hours.

How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes.
This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.

seglo · 2019-09-17T15:02:29Z

@efrikin Thanks for the reply.

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)

I see. I thought there was a longer grace period for the consumer group to stay active after the last member has left.

How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes.
This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.

Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint. Perhaps a default of 30 minutes would be a good value to start with. The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again.

efrikin · 2019-09-18T05:44:38Z

@seglo Thanks for the reply.

Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint.

This is good news.

Perhaps a default of 30 minutes would be a good value to start with.

For values by default 30 minutes a great starting.

The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again.
This is no problem. If the exporter is restarted, we will see it via other metrics.

Would it be possible to include it in next release? I'll really appreciate that! Also, could you please let me know the date of the next release?

Thanks a lot!

seglo · 2019-09-18T12:31:52Z

@efrikin Thanks for clarifying. I will issue a release soon. There are several PRs in progress. I'll create a new issue for this one and work on it soon, unless someone else volunteers to do it first.

efrikin · 2019-09-19T04:07:08Z

@seglo Thank a lot. I'll really appreciate that!

ryan-dyer-sp · 2020-08-22T14:04:38Z

Just to chime in here. I do agree that lag calculations may not be applicable beyond a certain time if there is no active consumer groups. However, the kafka_partition_*_offset metrics should report regardless of consumer groups being active or not. This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.

seglo · 2020-08-24T14:47:01Z

This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.

The _offset metrics were originally exported because the data was already available while calculating group lag. That's why only partitions belonging to [active] groups are reported.

I understand the value you get from monitoring the latest offset of arbitrary partitions. It would require a poll of all topic partitions in a cluster. That could be many more partitions than would be desired, but if it were enabled through a feature flag I think it would be a fine addition.

seglo added the enhancement New feature or request label Sep 17, 2019

seglo mentioned this issue Sep 18, 2019

Continue to calculate lag for inactive groups for a configurable timespan #66

Closed

seglo removed the enhancement New feature or request label Sep 18, 2019

seglo added enhancement New feature or request good first issue Good for newcomers labels Sep 1, 2020

seglo mentioned this issue Dec 22, 2020

Missing data in prometheus scrapes #172

Closed

emelyanovtv mentioned this issue Aug 10, 2021

feat: report metrics during rebalancing or when consumer group metrics are empty #252

Closed

seglo closed this as completed Aug 27, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics not updated when a consumer group is not active #63

Metrics not updated when a consumer group is not active #63

efrikin commented Sep 16, 2019

seglo commented Sep 16, 2019

efrikin commented Sep 17, 2019

seglo commented Sep 17, 2019

efrikin commented Sep 18, 2019

seglo commented Sep 18, 2019

efrikin commented Sep 19, 2019

ryan-dyer-sp commented Aug 22, 2020

seglo commented Aug 24, 2020

Metrics not updated when a consumer group is not active #63

Metrics not updated when a consumer group is not active #63

Comments

efrikin commented Sep 16, 2019

seglo commented Sep 16, 2019

efrikin commented Sep 17, 2019

seglo commented Sep 17, 2019

efrikin commented Sep 18, 2019

seglo commented Sep 18, 2019

efrikin commented Sep 19, 2019

ryan-dyer-sp commented Aug 22, 2020

seglo commented Aug 24, 2020