Global counters per protocol + protocol AND queue_type #3127

gerhard · 2021-06-21T16:47:05Z

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time.

The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, rabbitmq_global_, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple.

While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Next steps

workspace_helpers.bzl

ansd

Awesome change!

I left some in-line comments.

I know it's not part of this PR, but I looked into seshat as well.
Should we use the write_concurrency option for the counters initialised in https://github.com/rabbitmq/seshat/blob/4e190ebec1707df8cfaba51753f92a7588fe0e2e/src/seshat_counters.erl#L34?
Our use case fits that option very well:

The channel processes and stream reader connection processes write concurrently.
The writes are very frequent compared to reading via seshat_counters:prometheus_format/1 and seshat_counters:overview/1.
We don't require "absolute read consistency".

deps/rabbitmq_stream/src/rabbit_stream_reader.erl

deps/rabbit/src/rabbit_global_counters.erl

deps/rabbitmq_prometheus/src/collectors/prometheus_rabbitmq_global_metrics_collector.erl

gerhard · 2021-06-22T09:54:38Z

I know it's not part of this PR, but I looked into seshat as well.
Should we use the write_concurrency option for the counters initialised in https://github.com/rabbitmq/seshat/blob/4e190ebec1707df8cfaba51753f92a7588fe0e2e/src/seshat_counters.erl#L34?
Our use case fits that option very well:

The channel processes and stream reader connection processes write concurrently.

The writes are very frequent compared to reading via seshat_counters:prometheus_format/1 and seshat_counters:overview/1.

We don't require "absolute read consistency".

That makes sense to me. WDYT @kjnilsson & @dcorbacho?

If we reach soft consensus, do you want to contribute that PR to seshat @ansd ?

@mkuratczyk

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time. The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams. This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1. We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, `rabbitmq_global_`, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple. While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this. Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders. Pairs @kjnilsson @dcorbacho @dumbbell (this is multiple commits squashed into one) Signed-off-by: Gerhard Lazu <[email protected]>

@dcorbacho

All these metrics, except publishers & consumers, are handled by rabbitmq_global_metrics, so we currently have duplicates. As I started removing these, I realised that tests were written in Java - why not Erlang? - and they seemed way too complicated for what was needed. After the new rabbitmq_global_metrics, we are left with 2 metrics, and all the extra code simply doesn't justify them. I am proposing that we add them to rabbit_global_counters as gauges. Let's discuss @dcorbacho @acogoluegnes Signed-off-by: Gerhard Lazu <[email protected]>

gerhard · 2021-06-22T13:58:01Z

Back-ported to v3.9.x

ggustafsson · 2022-06-13T08:18:10Z

Hello!

Any updates on task "Update RabbitMQ-Overview Grafana dashboard to use the new global counters & update the public versions"?

We got bitten by this one at work last week so I took a look at all the *_total metric points in the dashboard and remapped them. It seems to work as intended but I am not 100% sure as not all metric names was directly translatable so I had to guess.

Remap

Certain

rabbitmq_channel_get_empty_total                    -> rabbitmq_global_messages_get_empty_total
rabbitmq_channel_messages_acked_total               -> rabbitmq_global_messages_acknowledged_total
rabbitmq_channel_messages_confirmed_total           -> rabbitmq_global_messages_confirmed_total
rabbitmq_channel_messages_delivered_total           -> rabbitmq_global_messages_delivered_total
rabbitmq_channel_messages_redelivered_total         -> rabbitmq_global_messages_redelivered_total
rabbitmq_channel_messages_unroutable_dropped_total  -> rabbitmq_global_messages_unroutable_dropped_total
rabbitmq_channel_messages_unroutable_returned_total -> rabbitmq_global_messages_unroutable_returned_total

Almost Certain

rabbitmq_channel_get_ack_total                -> rabbitmq_global_messages_delivered_get_manual_ack_total
rabbitmq_channel_get_total                    -> rabbitmq_global_messages_delivered_get_auto_ack_total
rabbitmq_channel_messages_delivered_ack_total -> rabbitmq_global_messages_delivered_consume_manual_ack_total
rabbitmq_channel_messages_published_total     -> rabbitmq_global_messages_received_total

Unknown

rabbitmq_channels_closed_total          -> ?
rabbitmq_channels_opened_total          -> ?
rabbitmq_connections_closed_total       -> ?
rabbitmq_connections_opened_total       -> ?
rabbitmq_queue_messages_published_total -> ?
rabbitmq_queues_created_total           -> ?
rabbitmq_queues_declared_total          -> ?
rabbitmq_queues_deleted_total           -> ?

Solution

This is what I ran to remap the values in the dashboard file:

sed -i "" \
  -e 's/rabbitmq_channel_get_ack_total/rabbitmq_global_messages_delivered_get_manual_ack_total/g' \
  -e 's/rabbitmq_channel_get_empty_total/rabbitmq_global_messages_get_empty_total/g' \
  -e 's/rabbitmq_channel_get_total/rabbitmq_global_messages_delivered_get_auto_ack_total/g' \
  -e 's/rabbitmq_channel_messages_acked_total/rabbitmq_global_messages_acknowledged_total/g' \
  -e 's/rabbitmq_channel_messages_confirmed_total/rabbitmq_global_messages_confirmed_total/g' \
  -e 's/rabbitmq_channel_messages_delivered_ack_total/rabbitmq_global_messages_delivered_consume_manual_ack_total/g' \
  -e 's/rabbitmq_channel_messages_delivered_total/rabbitmq_global_messages_delivered_total/g' \
  -e 's/rabbitmq_channel_messages_published_total/rabbitmq_global_messages_received_total/g' \
  -e 's/rabbitmq_channel_messages_redelivered_total/rabbitmq_global_messages_redelivered_total/g' \
  -e 's/rabbitmq_channel_messages_unroutable_dropped_total/rabbitmq_global_messages_unroutable_dropped_total/g' \
  -e 's/rabbitmq_channel_messages_unroutable_returned_total/rabbitmq_global_messages_unroutable_returned_total/g' \
  RabbitMQ-Overview.json

mkuratczyk · 2022-06-21T06:49:57Z

Hi. Thanks for sharing this. I'll update the dashboards in the the next few days hopefully.

ggustafsson · 2022-08-05T08:03:49Z

Hi. Thanks for sharing this. I'll update the dashboards in the the next few days hopefully.

Any updates on this?

michaelklishin · 2022-08-05T11:32:29Z

According to ZenHub, the Grafana dashboard update task is still open.

coro · 2022-08-05T17:05:06Z

@ggustafsson I've just made the above PR to first get the dashboards into a healthier state - they've fallen behind some Grafana changes. My plan for next week is to then make the changes from aggregated to global metrics as you requested.

ggustafsson · 2022-08-06T06:47:14Z

@michaelklishin @coro Thank you for the update! I’ll take a closer look at the PR next week. I can test each revision of the dashboard at work.

coro · 2022-08-08T09:19:56Z

Thanks @ggustafsson. As a heads up, I haven't pushed anything to grafana.com just yet, but as of the changes in that PR you should be able to copy the dashboard JSON straight from the repo and paste it into the import menu in Grafana in order to test them.

coro · 2022-08-08T17:06:15Z

The draft PR containing the global metrics is here: #5463

ggustafsson · 2022-08-18T11:21:12Z

@coro I tried out the first PR and from what I can see it looks good, but it is a bit hard for me to do a good test because our metrics values are all over the place when we don't use the global metric paths. I will keep an eye on the second PR and try it out early. We have been using the existing dashboard with my sed fix above for around two months now and we haven't seen anything odd with it (yet).

Global counters for producers added in #3127 but never made it to this dashboard (cherry picked from commit 0b2a94c)

Global counters for producers added in #3127 but never made it to this dashboard (cherry picked from commit 0b2a94c) (cherry picked from commit 548e67c)

gerhard added effort-medium enhancement rabbitmq-prometheus streams labels Jun 21, 2021

gerhard added this to the 3.9.0 milestone Jun 21, 2021

gerhard mentioned this pull request Jun 21, 2021

Global counters #3045

Closed

11 tasks

gerhard force-pushed the global-counters-simplified branch from b7b279b to a631d4b Compare June 21, 2021 17:11

HoloRin reviewed Jun 22, 2021

View reviewed changes

workspace_helpers.bzl Outdated Show resolved Hide resolved

ansd self-requested a review June 22, 2021 08:30

ansd reviewed Jun 22, 2021

View reviewed changes

gerhard force-pushed the global-counters-simplified branch from 42e89dc to b220004 Compare June 22, 2021 12:52

gerhard added 2 commits June 22, 2021 14:14

gerhard force-pushed the global-counters-simplified branch from b220004 to fae836f Compare June 22, 2021 13:15

gerhard merged commit fda3c19 into master Jun 22, 2021

gerhard deleted the global-counters-simplified branch June 22, 2021 13:39

ansd mentioned this pull request Jun 22, 2021

Use option write_concurrency rabbitmq/seshat#2

Merged

gerhard mentioned this pull request Jun 29, 2021

Gauges for global publishers & consumers metrics #3136

Merged

11 tasks

gerhard mentioned this pull request Jul 7, 2021

Aggregated queue_messages_published_total metric violates Prometheus expectations about counters #2783

Closed

mergify bot added bazel make labels Aug 3, 2021

coro mentioned this pull request Aug 5, 2022

Update RabbitMQ Dashboards to support latest Grafana versions #5449

Merged

12 tasks

michaelklishin mentioned this pull request Jan 18, 2023

Aggregated Metrics in RMQ 3.10.14 #6929

Closed

johanrhodin mentioned this pull request Nov 1, 2023

Update RabbitMQ-Overview.json #9846

Merged

mergify bot pushed a commit that referenced this pull request Nov 1, 2023

Update RabbitMQ-Overview.json

548e67c

Global counters for producers added in #3127 but never made it to this dashboard (cherry picked from commit 0b2a94c)

mergify bot pushed a commit that referenced this pull request Nov 1, 2023

Update RabbitMQ-Overview.json

210e558

Global counters for producers added in #3127 but never made it to this dashboard (cherry picked from commit 0b2a94c) (cherry picked from commit 548e67c)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Global counters per protocol + protocol AND queue_type #3127

Global counters per protocol + protocol AND queue_type #3127

gerhard commented Jun 21, 2021 •

edited

Loading

ansd left a comment

gerhard commented Jun 22, 2021 •

edited

Loading

gerhard commented Jun 22, 2021

ggustafsson commented Jun 13, 2022

mkuratczyk commented Jun 21, 2022

ggustafsson commented Aug 5, 2022

michaelklishin commented Aug 5, 2022

coro commented Aug 5, 2022

ggustafsson commented Aug 6, 2022

coro commented Aug 8, 2022

coro commented Aug 8, 2022

ggustafsson commented Aug 18, 2022

Global counters per protocol + protocol AND queue_type #3127

Global counters per protocol + protocol AND queue_type #3127

Conversation

gerhard commented Jun 21, 2021 • edited Loading

Next steps

ansd left a comment

Choose a reason for hiding this comment

gerhard commented Jun 22, 2021 • edited Loading

gerhard commented Jun 22, 2021

ggustafsson commented Jun 13, 2022

Remap

Certain

Almost Certain

Unknown

Solution

mkuratczyk commented Jun 21, 2022

ggustafsson commented Aug 5, 2022

michaelklishin commented Aug 5, 2022

coro commented Aug 5, 2022

ggustafsson commented Aug 6, 2022

coro commented Aug 8, 2022

coro commented Aug 8, 2022

ggustafsson commented Aug 18, 2022

gerhard commented Jun 21, 2021 •

edited

Loading

gerhard commented Jun 22, 2021 •

edited

Loading