-
Notifications
You must be signed in to change notification settings - Fork 126
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Extract consumer group member_count as a separate metric #105
Conversation
06332a8
to
d88a513
Compare
group.Protocol, | ||
group.ProtocolType, | ||
group.State, | ||
strconv.FormatInt(int64(coordinator), 10), | ||
) | ||
// total number of members in consumer groups | ||
if len(group.Members) > 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain that condition? Why don't we report 0
for groups that don't have any active members?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thoughrt we can reduce no of metrics with this.
Empty groups can alearedy be monitored with consumer_group_info{state="Empty"}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What I'm afraid of is when you monitor your consumer group member count in a graph that you may experience gaps or inaccurate results due to the staleness of these specific groups. For instance imagine a job that does certain things periodically (like catching up the topic lag and then go to sleep for 10minutes until it will repeat). That graph would look weird / inaccurate if we suddenly stop reporting a number for it when it switches the state.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The core reason for avoiding exposing 0 members in the group is to reduce the amount of samples scraped by Prometheus, Martin.
As you know, Prometheus holds recently added samples for active time series in RAM, so its RAM usage highly depends on the number of active time series. On the flip-side, having sparse time series can lead to high churn in Prometheus when scraping. A
Now, when it comes to consumer groups a consumer group can have 0 members when the group is in Empty or Dead that usually last ~7 days since the consumer group was abandoned by an application. Reporting metrics for such long period of time is useless in my opinion and could put extra pressure on Prometheus (both scrape but also memory) and that'w why I preferred to avoid sending 0 values metrics.
To your example, with batch recurring job using the consumer group - the graph (if using Grafana) can be "corrected" using null-to-0 functions
wdyt
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw this is the same reasoning as in #102 :)
To give options to kminion users I could try to add a new config parameter, say activeGroupsOnly
(similar to allow/deny list) that can control reporting these metrics for all (default) or only active (Stable) groups
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you say makes sense and I'm also often worried about high RAM usage. In that case I believe it can be neglected as the number of time series would only scale with the number of consumer groups. You said you have a lot of consumer groups, which I'd interpret as 1k-3k? I'd assume the majority of these are in a stable state then.
I think few thousand additional metric series can be neglected and I'd rather have kminion report all groups' member count (especially because users possibly would expect them for all consumer groups and not just for certain states).
If we don't want to make a trade off we could introduce further configurations which would allow users to configure what they are interested in. Some may not be interested in consumer group member count as well. I'm fine with adding configuration options as kminion is supposed to be a flexible and very featue rich prometheus exporter for Kafka
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You said you have a lot of consumer groups, which I'd interpret as 1k-3k? I'd assume the majority of these are in a stable state then.
I fact in our environment I have ~30k consumer groups. And only ~10% are active/stable - the rest are Empty as they are one-off consumer groups created by some ephemeral jobs that create unique consumer group names.
I'll remove the condition here and file a new issue to control this behaviour through a well defined configuration parameter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #108
@weeco kind reminder to review this PR. thanks :) |
Added a new consumer_group_members topic to capture number of members in a consumer group and removed the `member_count` label from consumer_group_info Fixes redpanda-data#103
@weeco Rebased on master branch and fixed the conflicts. |
Thanks for your PR and the patience @amuraru |
Added a new consumer_group_members metric to capture number of members
in a consumer group and removed the
member_count
label from consumer_group_infoFixes #103