Skip to content
This repository has been archived by the owner on Mar 17, 2024. It is now read-only.

Lag reported as NaN for low volume topics #111

Closed
rkrage opened this issue Jan 7, 2020 · 4 comments · Fixed by #118
Closed

Lag reported as NaN for low volume topics #111

rkrage opened this issue Jan 7, 2020 · 4 comments · Fixed by #118
Labels
bug Something isn't working

Comments

@rkrage
Copy link

rkrage commented Jan 7, 2020

Not sure if this is expected behavior, but I've observed these NaN stats when a topic stops getting new messages and a consumer group is completely caught up. I'd expect the value to be zero in this case.

If it's helpful, I'm running 0.5.5 against Kafka 1.1.1

Might be related to #37

@rkrage
Copy link
Author

rkrage commented Jan 7, 2020

Might also be worth mentioning that if I restart the lag exporter, the value becomes zero.

@rkrage
Copy link
Author

rkrage commented Jan 8, 2020

Looking at the code, it seems like this shouldn't be happening:

https://github.com/lightbend/kafka-lag-exporter/blob/master/src/main/scala/com/lightbend/kafkalagexporter/LookupTable.scala#L76

It only returns TooFewPoints if there are less than two points in the lookup table. But it definitely seems like the table should contain at least two points for the same offset in this case:

https://github.com/lightbend/kafka-lag-exporter/blob/master/src/main/scala/com/lightbend/kafkalagexporter/LookupTable.scala#L33-L38

Is it possible we're hitting this case?

https://github.com/lightbend/kafka-lag-exporter/blob/master/src/main/scala/com/lightbend/kafkalagexporter/LookupTable.scala#L19-L20

@seglo
Copy link
Owner

seglo commented Jan 9, 2020

Hi @rkrage. Thanks for the troubleshooting efforts. There are indeed some weird edge cases when extrapolating lag in time. It's been a challenge to satisfy all of them. The time metric kafka_consumergroup_group_lag_seconds metric will report NaN for several edge cases. Is this the metric you expect to see something different for? Or are you referring to the offset lag metric kafka_consumergroup_group_lag?

In either case, the best way to troubleshoot what's happening is to temporarily enable DEBUG logging so that you can see raw group and offset metadata Kafka Lag Exporter uses. See this comment for details on how to enable DEBUG.

`#106 (comment)

@rkrage
Copy link
Author

rkrage commented Jan 9, 2020

Hi @seglo, thanks for the response. Yes, I'm referring to the time metric here kafka_consumergroup_group_lag_seconds. I turned on debug logging and dumped the output as well as the stat values in this gist: https://gist.github.com/rkrage/03c730718b6d33e3de70f8b3e24ce61c

As you can see, once I turn off my test producer, time lag drops to zero initially, then changes to NaN in the next poll iteration (and stays that way until I start producing messages again).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
2 participants