Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Lag is incorrectly calculated in 0.11.5 #1911

Closed
3 of 7 tasks
andreycha opened this issue Aug 1, 2018 · 4 comments
Closed
3 of 7 tasks

Lag is incorrectly calculated in 0.11.5 #1911

andreycha opened this issue Aug 1, 2018 · 4 comments
Labels

Comments

@andreycha
Copy link

andreycha commented Aug 1, 2018

Description

We're using Confluent Kafka driver for .NET and after we upgraded to 0.11.5, we started to get incorrect statistics. Since it comes from librdkafka, I opened the issue here. The problem is that lag is now calculated incorrectly, if some of the offsets are set to negative values, here is the example (look at the last two entries):

Partition: 0, QueryOffset: -2, NextOffset: 29, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 29, LowestOffset: 29, HighestOffset: 29, Lag: 1030
Partition: 1, QueryOffset: -2, NextOffset: 50, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 50, LowestOffset: 50, HighestOffset: 50, Lag: 1051
Partition: 2, QueryOffset: -2, NextOffset: 27, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 27, LowestOffset: 27, HighestOffset: 27, Lag: 1028
Partition: 3, QueryOffset: -2, NextOffset: 24, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 24, LowestOffset: 24, HighestOffset: 24, Lag: 1025
Partition: 4, QueryOffset: -2, NextOffset: 26, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 26, LowestOffset: 26, HighestOffset: 26, Lag: 1027
Partition: 5, QueryOffset: -2, NextOffset: 38, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 38, LowestOffset: 38, HighestOffset: 38, Lag: 1039
Partition: 6, QueryOffset: -1001, NextOffset: 45, AppOffset: 45, StoredOffset: 45, CommittedOffset: 44, EofOffset: 45, LowestOffset: 44, HighestOffset: 45, Lag: 0
Partition: 7, QueryOffset: -2, NextOffset: 29, AppOffset: -1001, StoredOffset: -1001, CommittedOffset: -1001, EofOffset: 29, LowestOffset: 29, HighestOffset: 29, Lag: 1030

As far as I can see, the issue comes from d41b086. Lag is now calculated as hi_offset - max(app_offset, commit_offset). The math should take into account situations where partitions are not consumed or messages are not committed and both app_offset/commit_offset are negative. Probably in this case max(lo_offset, 0) should be used for lag calculation.

Also it's not clear for the case AppOffset: 45, CommittedOffset: 44, HighestOffset: 45, Lag: 0 why lag is 0. Shouldn't it be 1? Docs say that app offset is "Offset of last message passed to application + 1", so it means that application has already processed offset 44, but not yet processed offset 45.

How to reproduce

Start consuming any topic with autocommit off.

Checklist

  • librdkafka version (release number or git tag): 0.11.5
  • Apache Kafka version:
  • librdkafka client configuration: autocommit is off
  • Operating system: Win 10 x64
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@mhowlett mhowlett added the bug label Aug 1, 2018
@mhowlett
Copy link
Contributor

mhowlett commented Aug 1, 2018

for context, the associated PR is #1878 (with relevant discussion).

yes, it does appear as though calculations are now incorrect in the case of special offsets. thanks for reporting.

@edenhill
Copy link
Contributor

edenhill commented Aug 6, 2018

So there are two different issues here:

  1. consumer_lag does not take invalid/unset app_offset/committed_offset into calculation.
  2. the consumer_lag is off by one due to app_offset and committed_offset being +1.

2 is straight forward.
For 1, if both app_offset and committed_offset are invalid it either means there are no messages to consume, or no messages have yet been consumed. I think it might be better to let consumer_lag be -1 in this case to indicate that it is infact unknown.

@edenhill
Copy link
Contributor

The high watermark offset is the next offset of the partition, i.e., latest message offset + 1.
This value is the same as the committed offset for a caught-up consumer.
If 10 messages are produced, giving offsets 0..9, the highwatermark will be 10, and after consumption the app and committed offsets will also be 10, thus a lag of 0.

@edenhill
Copy link
Contributor

Fixed on master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants