-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Aggregated queue_messages_published_total
metric violates Prometheus expectations about counters
#2783
Comments
Will be picking this one up first thing with @mkuratczyk, and working with @dcorbacho on the review so that she can continue focusing on her in-flight. |
Luckily we're not tracking totals on the channel but doing increments, see https://github.com/rabbitmq/rabbitmq-server/blob/master/deps/rabbit/src/rabbit_channel.erl#L205 So when This should be a very simple addition. Tracking this metric per node means that no garbage collection is needed, the new table only needs to be included on the tables to be reset (API command) and ignored everywhere else. The prometheus collector could stop scrapping |
This is meant to address #2783, which began as #2781 This is just an experiment, don't get too excited. The first impression is that the OpenTelemetry API is really hard to work with, and the concepts involved don't make a lot of sense. We (+@mkuratczyk) will continue as soon as we have dealt with #2785 Signed-off-by: Gerhard Lazu <[email protected]>
This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, rabbitmq_global_, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho (this is multiple commits squashed into one) Next steps: - Create new PR and ask @mkuratcyzk and @ansd for review - fresh 👀 👀 - This new PR closes #3045 - Back-port to 3.9.x as is (to my knowledge, this is the only feature missing before we can code freeze and cut an RC) - Back-port parts of this to 3.8.x so that we can finally address #2783 - Fix & publish new version of the RabbitMQ-Overview Grafana dashboard - Fix & finally publish the new RabbitMQ-Streams Grafana dashboard Signed-off-by: Gerhard Lazu <[email protected]>
This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, rabbitmq_global_, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho (this is multiple commits squashed into one) Next steps: - Create new PR and ask @mkuratcyzk and @ansd for review - fresh 👀 👀 - This new PR closes #3045 - Back-port to 3.9.x as is (to my knowledge, this is the only feature missing before we can code freeze and cut an RC) - Back-port parts of this to 3.8.x so that we can finally address #2783 - Fix & publish new version of the RabbitMQ-Overview Grafana dashboard - Fix & finally publish the new RabbitMQ-Streams Grafana dashboard Signed-off-by: Gerhard Lazu <[email protected]>
Hi All, Please sorry if i'm posting at a wrong place. Currently we're seeing some behaviour similar to this issue; our "rabbitmq_channel_messages_published_total" metrics (with aggregation enabled) decreases which causes false alarms & mis-guiding graphs. I guess this is some sort of bug or at least something not good for Prometheus. What is the proposed workaround-solution for such kind of situation? Going to per-object metrics ? |
The fix is already available in what will ship as RabbitMQ 3.9.0 via #3127 You will notice that the PR has 5 more outstanding tasks, the next one being a back-port to v3.8.x. Once this is done, the issue will be fixed for the 3.8 release series via an update to the dashboards (also one of those 5 tasks). Between RabbitMQ Summit - will you be joining us for today's pre-conference meetup @karanlik? 😉 - and a few other things in flight (3.9.0 being the most significant one by far), this back-port is going slower than I would like. The important take-away is that it's coming and it definitely fixes the issue (we have been testing it for weeks in our 3.9 long-running environment). Hope that helps 👍🏻 |
Great! Thanks for your quick feedback.. I'll give a try to per-object based metrics until this is available.. best regards..(PS: I've subscribed t o RabbitMQ Summit mailing list :-) ) |
Hi, Do you think this will get backported to 3.8.x? Thanks |
Closing. As this is available in 3.9 and 3.10, and 3.8 is in extended support. |
See #2781 for the background.
Some metrics, e.g.
queue_messages_published_total
are computed as a basic sum aggregation of samples from a local ETS table, e.g.channel_queue_exchange_metrics
. When a channel is closed, its samples are removed from the table,decreasing the sum. This violates an expectation for
counter
metrics in Prometheus: they can only increment or stay flat or reset to 0 but not decrease.The issue is not present when per-object metrics are used: when a channel is closed, all of its metrics go away, which is what the user expects to happen.
This is a side-effect of our quick-and-dirty switch to aggregated metrics. We need to retain this historical total or delegate
to the Prometheus client library which will do most aggregation work and handle resets.
This is node local state, so we can address this even with a significant rework of the Prometheus plugin and still ship it in a
3.8.x
release.Per discussion with @dcorbacho @gerhard @kjnilsson.
The text was updated successfully, but these errors were encountered: