-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Global counters per protocol + protocol AND queue_type #3127
Conversation
b7b279b
to
a631d4b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome change!
I left some in-line comments.
I know it's not part of this PR, but I looked into seshat as well.
Should we use the write_concurrency
option for the counters initialised in https://github.com/rabbitmq/seshat/blob/4e190ebec1707df8cfaba51753f92a7588fe0e2e/src/seshat_counters.erl#L34?
Our use case fits that option very well:
- The channel processes and stream reader connection processes write concurrently.
- The writes are very frequent compared to reading via
seshat_counters:prometheus_format/1
andseshat_counters:overview/1
. - We don't require "absolute read consistency".
deps/rabbitmq_prometheus/src/collectors/prometheus_rabbitmq_global_metrics_collector.erl
Show resolved
Hide resolved
That makes sense to me. WDYT @kjnilsson & @dcorbacho? If we reach soft consensus, do you want to contribute that PR to seshat @ansd ? |
42e89dc
to
b220004
Compare
This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <[email protected]>
All these metrics, except publishers & consumers, are handled by rabbitmq_global_metrics, so we currently have duplicates. As I started removing these, I realised that tests were written in Java - why not Erlang? - and they seemed way too complicated for what was needed. After the new rabbitmq_global_metrics, we are left with 2 metrics, and all the extra code simply doesn't justify them. I am proposing that we add them to rabbit_global_counters as gauges. Let's discuss @dcorbacho @acogoluegnes Signed-off-by: Gerhard Lazu <[email protected]>
b220004
to
fae836f
Compare
Back-ported to |
Hello! Any updates on task "Update RabbitMQ-Overview Grafana dashboard to use the new global counters & update the public versions"? We got bitten by this one at work last week so I took a look at all the RemapCertain
Almost Certain
Unknown
SolutionThis is what I ran to remap the values in the dashboard file:
|
Hi. Thanks for sharing this. I'll update the dashboards in the the next few days hopefully. |
Any updates on this? |
According to ZenHub, the Grafana dashboard update task is still open. |
@ggustafsson I've just made the above PR to first get the dashboards into a healthier state - they've fallen behind some Grafana changes. My plan for next week is to then make the changes from aggregated to global metrics as you requested. |
@michaelklishin @coro Thank you for the update! I’ll take a closer look at the PR next week. I can test each revision of the dashboard at work. |
Thanks @ggustafsson. As a heads up, I haven't pushed anything to grafana.com just yet, but as of the changes in that PR you should be able to copy the dashboard JSON straight from the repo and paste it into the import menu in Grafana in order to test them. |
The draft PR containing the global metrics is here: #5463 |
@coro I tried out the first PR and from what I can see it looks good, but it is a bit hard for me to do a good test because our metrics values are all over the place when we don't use the global metric paths. I will keep an eye on the second PR and try it out early. We have been using the existing dashboard with my |
This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time.
The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams.
This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1.
We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace,
rabbitmq_global_
, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple.While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this.
Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders.
Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)
Next steps
v3.9.x
as isv3.9.x
v3.8.x
v3.9.x
v3.8.x
so that we can finally address Aggregatedqueue_messages_published_total
metric violates Prometheus expectations about counters #2783v3.8.x
queue_messages_published_total
metric violates Prometheus expectations about counters #2783