Skip to content
This repository has been archived by the owner on Mar 17, 2024. It is now read-only.

Metrics not updated when a consumer group is not active #63

Closed
efrikin opened this issue Sep 16, 2019 · 8 comments
Closed

Metrics not updated when a consumer group is not active #63

efrikin opened this issue Sep 16, 2019 · 8 comments
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@efrikin
Copy link

efrikin commented Sep 16, 2019

Hi @seglo.
I have the same problem => #36
Ex.
I have group consumers which read topic.
image
When I stop read topic I see on the chart empty data(no data)
When I run consumer group lag and I see data on chart
This is very critical becouse in this is time consumer lag not monitoring
any idea?
Version exporter 0.5.1

Poll interval: 10 seconds
Lookup table size: 8192
Prometheus metrics endpoint port: 8000
Admin client consumer group id: kafkalagexporter
Kafka client timeout: 10 seconds
Statically defined Clusters:

  Cluster name: kafka_general
  Cluster Kafka bootstrap brokers: broker-01:9092
     
Watchers:
  Strimzi: false
      
2019-09-16 05:14:11,329 INFO  c.l.k.KafkaClusterManager$ akka://kafka-lag-exporter/user - Cluster Added: 
  Cluster name: kafka_general
  Cluster Kafka bootstrap brokers: broker-01:9092
      
2019-09-16 05:14:11,344 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Spawned ConsumerGroupCollector for cluster: kafka_general 
2019-09-16 05:14:11,355 INFO  c.l.k.ConsumerGroupCollector$ akka://kafka-lag-exporter/user/consumer-group-collector-kafka_general - Collecting offsets 
2019-09-16 05:14:11,384 INFO  o.a.k.c.admin.AdminClientConfig  - AdminClientConfig values: 
	bootstrap.servers = [broker-01:9092]
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 300000
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 10000
	retries = 0
	retry.backoff.ms = 1000
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = https
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
 
2019-09-16 05:14:11,537 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka version: 2.2.1 
2019-09-16 05:14:11,537 INFO  o.a.kafka.common.utils.AppInfoParser  - Kafka commitId: 55783d3133a5a49a 
2019-09-16 05:14:12,105 INFO  o.a.k.c.consumer.ConsumerConfig  - ConsumerConfig values: 
	auto.commit.interval.ms = 5000
	auto.offset.reset = latest
	bootstrap.servers = [broker-01:9092]
	check.crcs = true
	client.dns.lookup = default
	client.id = 
	connections.max.idle.ms = 540000
	default.api.timeout.ms = 60000
	enable.auto.commit = false
	exclude.internal.topics = true
	fetch.max.bytes = 52428800
	fetch.max.wait.ms = 500
	fetch.min.bytes = 1
	group.id = kafkalagexporter
	heartbeat.interval.ms = 3000
	interceptor.classes = []
	internal.leave.group.on.close = true
	isolation.level = read_uncommitted
	key.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer
	max.partition.fetch.bytes = 1048576
	max.poll.interval.ms = 300000
	max.poll.records = 500
	metadata.max.age.ms = 300000
	metric.reporters = []
	metrics.num.samples = 2
	metrics.recording.level = INFO
	metrics.sample.window.ms = 30000
	partition.assignment.strategy = [class org.apache.kafka.clients.consumer.RangeAssignor]
	receive.buffer.bytes = 65536
	reconnect.backoff.max.ms = 1000
	reconnect.backoff.ms = 50
	request.timeout.ms = 10000
	retry.backoff.ms = 1000
	sasl.client.callback.handler.class = null
	sasl.jaas.config = null
	sasl.kerberos.kinit.cmd = /usr/bin/kinit
	sasl.kerberos.min.time.before.relogin = 60000
	sasl.kerberos.service.name = null
	sasl.kerberos.ticket.renew.jitter = 0.05
	sasl.kerberos.ticket.renew.window.factor = 0.8
	sasl.login.callback.handler.class = null
	sasl.login.class = null
	sasl.login.refresh.buffer.seconds = 300
	sasl.login.refresh.min.period.seconds = 60
	sasl.login.refresh.window.factor = 0.8
	sasl.login.refresh.window.jitter = 0.05
	sasl.mechanism = GSSAPI
	security.protocol = PLAINTEXT
	send.buffer.bytes = 131072
	session.timeout.ms = 10000
	ssl.cipher.suites = null
	ssl.enabled.protocols = [TLSv1.2, TLSv1.1, TLSv1]
	ssl.endpoint.identification.algorithm = https
	ssl.key.password = null
	ssl.keymanager.algorithm = SunX509
	ssl.keystore.location = null
	ssl.keystore.password = null
	ssl.keystore.type = JKS
	ssl.protocol = TLS
	ssl.provider = null
	ssl.secure.random.implementation = null
	ssl.trustmanager.algorithm = PKIX
	ssl.truststore.location = null
	ssl.truststore.password = null
	ssl.truststore.type = JKS
	value.deserializer = class org.apache.kafka.common.serialization.ByteArrayDeserializer

@seglo
Copy link
Owner

seglo commented Sep 16, 2019

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

As I described #36, we only report data for groups that are returned by the Kafka AdminClient. Every poll interval we compare the groups returned from the last interval and this interval and then unregister the metrics for the groups that no longer exists. This is done so we don't accumulate groups in an unbounded manner.

I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error? How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

@efrikin
Copy link
Author

efrikin commented Sep 17, 2019

@seglo Thank a lot the answer

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)

I can see how this could be a problem in some cases, even though a consumer group is not active any more, you may still want to know how far behind it is regardless. In your use case are the consumer groups shut down intentionally or have they encountered an error?

My case is related to a production incident, when consumer group suddenly got stuck and we could not spot the ever-growing lag for this consumer group for several hours.

How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes.
This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.

@seglo
Copy link
Owner

seglo commented Sep 17, 2019

@efrikin Thanks for the reply.

@efrikin When you say you stop reading do you mean that as soon as the consumer group shuts down the metrics are no longer reported?

Yes of course. When last consumer leaving consumer group, the metrics are no longer reported and I see empty chart (prometheus console outputs no data)

I see. I thought there was a longer grace period for the consumer group to stay active after the last member has left.

How long would you consider is long enough for a group to be inactive before we shouldn't report lag any more?

I think this behavior should be implemented as a custom user configuration, but by default it could equal to 5 minutes.
This is similar to kafka broker properties named log.retention.hours, when a user himself can set this behavior according to his needs.

Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint. Perhaps a default of 30 minutes would be a good value to start with. The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again.

@seglo seglo added the enhancement New feature or request label Sep 17, 2019
@efrikin
Copy link
Author

efrikin commented Sep 18, 2019

@seglo Thanks for the reply.

Yes. We could add a feature that retains group metadata for a configured interval of time after it's left the cache of the Consumer Group coordinator. If the group no longer exists in the next poll we can continue calculating the lag based on the last consumed offsets. If the group becomes active again then we continue as normal, if it doesn't become active then after an interval of time remove it from the metrics endpoint.

This is good news.

Perhaps a default of 30 minutes would be a good value to start with.

For values by default 30 minutes a great starting.

The one caveat is that if Kafka Lag Exporter is started after a group is no longer active, then it won't see it until it's active again.
This is no problem. If the exporter is restarted, we will see it via other metrics.

Would it be possible to include it in next release? I'll really appreciate that! Also, could you please let me know the date of the next release?

Thanks a lot!

@seglo
Copy link
Owner

seglo commented Sep 18, 2019

@efrikin Thanks for clarifying. I will issue a release soon. There are several PRs in progress. I'll create a new issue for this one and work on it soon, unless someone else volunteers to do it first.

@efrikin
Copy link
Author

efrikin commented Sep 19, 2019

@seglo Thank a lot. I'll really appreciate that!

@ryan-dyer-sp
Copy link
Contributor

Just to chime in here. I do agree that lag calculations may not be applicable beyond a certain time if there is no active consumer groups. However, the kafka_partition_*_offset metrics should report regardless of consumer groups being active or not. This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.

@seglo
Copy link
Owner

seglo commented Aug 24, 2020

This metric is not related to a consumer group but more a producer and we use it to ensure that we are getting new messages into the topic.

The _offset metrics were originally exported because the data was already available while calculating group lag. That's why only partitions belonging to [active] groups are reported.

I understand the value you get from monitoring the latest offset of arbitrary partitions. It would require a poll of all topic partitions in a cluster. That could be many more partitions than would be desired, but if it were enabled through a feature flag I think it would be a fine addition.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

3 participants