Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Global counters per protocol + protocol AND queue_type #3127

Merged
merged 2 commits into from
Jun 22, 2021

Conversation

gerhard
Copy link
Contributor

@gerhard gerhard commented Jun 21, 2021

This way we can show how many messages were received via a certain protocol (stream is the second real protocol besides the default amqp091 one), as well as by queue type, which is something that many asked for a really long time.

The most important aspect is that we can also see them by protocol AND queue_type, which becomes very important for Streams, which have different rules from regular queues (e.g. for example, consuming messages is non-destructive, and deep queue backlogs - think billions of messages - are normal). Alerting and consumer scaling due to deep backlogs will now work correctly, as we can distinguish between regular queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho covering most of the ground. @dcorbacho had most of this in #3045, but the main branch went through a few changes in the meantime. Rather than resolving all the conflicts, and then making the necessary changes, we (@gerhard + @kjnilsson) took all learnings and started re-applying a lot of the existing code from #3045. We are confident in this approach and would like to see it through. We continued working on this with @dumbbell, and the most important changes are captured in rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new collector. We don't want to keep modifying the existing collector, which grew really complex in parts, especially since we introduced aggregation, but start with a new namespace, rabbitmq_global_, and continue building on top of it. The idea is to build in parallel, and slowly transition to the new metrics, because semantically the changes are too big since streams, and we have been discussing protocol-specific metrics with @kjnilsson, which makes me think that this approach is least disruptive and... simple.

While at this, we removed redundant empty return value handling in the channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Next steps

@ansd ansd self-requested a review June 22, 2021 08:30
Copy link
Member

@ansd ansd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome change!

I left some in-line comments.

I know it's not part of this PR, but I looked into seshat as well.
Should we use the write_concurrency option for the counters initialised in https://github.com/rabbitmq/seshat/blob/4e190ebec1707df8cfaba51753f92a7588fe0e2e/src/seshat_counters.erl#L34?
Our use case fits that option very well:

  1. The channel processes and stream reader connection processes write concurrently.
  2. The writes are very frequent compared to reading via seshat_counters:prometheus_format/1 and seshat_counters:overview/1.
  3. We don't require "absolute read consistency".

@gerhard
Copy link
Contributor Author

gerhard commented Jun 22, 2021

I know it's not part of this PR, but I looked into seshat as well.
Should we use the write_concurrency option for the counters initialised in https://github.com/rabbitmq/seshat/blob/4e190ebec1707df8cfaba51753f92a7588fe0e2e/src/seshat_counters.erl#L34?
Our use case fits that option very well:

  1. The channel processes and stream reader connection processes write concurrently.
  2. The writes are very frequent compared to reading via seshat_counters:prometheus_format/1 and seshat_counters:overview/1.
  3. We don't require "absolute read consistency".

That makes sense to me. WDYT @kjnilsson & @dcorbacho?

If we reach soft consensus, do you want to contribute that PR to seshat @ansd ?

@gerhard gerhard force-pushed the global-counters-simplified branch from 42e89dc to b220004 Compare June 22, 2021 12:52
gerhard added 2 commits June 22, 2021 14:14
This way we can show how many messages were received via a certain
protocol (stream is the second real protocol besides the default amqp091
one), as well as by queue type, which is something that many asked for a
really long time.

The most important aspect is that we can also see them by protocol AND
queue_type, which becomes very important for Streams, which have
different rules from regular queues (e.g. for example, consuming
messages is non-destructive, and deep queue backlogs - think billions of
messages - are normal). Alerting and consumer scaling due to deep
backlogs will now work correctly, as we can distinguish between regular
queues & streams.

This has gone through a few cycles, with @mkuratczyk & @dcorbacho
covering most of the ground. @dcorbacho had most of this in
#3045, but the main
branch went through a few changes in the meantime. Rather than resolving
all the conflicts, and then making the necessary changes, we (@gerhard +
@kjnilsson) took all learnings and started re-applying a lot of the
existing code from #3045. We are confident in this approach and would
like to see it through. We continued working on this with @dumbbell, and
the most important changes are captured in
rabbitmq/seshat#1.

We expose these global counters in rabbitmq_prometheus via a new
collector. We don't want to keep modifying the existing collector, which
grew really complex in parts, especially since we introduced
aggregation, but start with a new namespace, `rabbitmq_global_`, and
continue building on top of it. The idea is to build in parallel, and
slowly transition to the new metrics, because semantically the changes
are too big since streams, and we have been discussing protocol-specific
metrics with @kjnilsson, which makes me think that this approach is
least disruptive and... simple.

While at this, we removed redundant empty return value handling in the
channel. The function called no longer returns this.

Also removed all DONE / TODO & other comments - we'll handle them when
the time comes, no need to leave TODO reminders.

Pairs @kjnilsson @dcorbacho @dumbbell
(this is multiple commits squashed into one)

Signed-off-by: Gerhard Lazu <[email protected]>
All these metrics, except publishers & consumers, are handled by
rabbitmq_global_metrics, so we currently have duplicates. As I started
removing these, I realised that tests were written in Java - why not
Erlang? - and they seemed way too complicated for what was needed. After
the new rabbitmq_global_metrics, we are left with 2 metrics, and all the
extra code simply doesn't justify them. I am proposing that we add them to
rabbit_global_counters as gauges. Let's discuss @dcorbacho @acogoluegnes

Signed-off-by: Gerhard Lazu <[email protected]>
@gerhard gerhard force-pushed the global-counters-simplified branch from b220004 to fae836f Compare June 22, 2021 13:15
@gerhard gerhard merged commit fda3c19 into master Jun 22, 2021
@gerhard gerhard deleted the global-counters-simplified branch June 22, 2021 13:39
@gerhard
Copy link
Contributor Author

gerhard commented Jun 22, 2021

Back-ported to v3.9.x

@ggustafsson
Copy link

Hello!

Any updates on task "Update RabbitMQ-Overview Grafana dashboard to use the new global counters & update the public versions"?

We got bitten by this one at work last week so I took a look at all the *_total metric points in the dashboard and remapped them. It seems to work as intended but I am not 100% sure as not all metric names was directly translatable so I had to guess.

Remap

Certain

rabbitmq_channel_get_empty_total                    -> rabbitmq_global_messages_get_empty_total
rabbitmq_channel_messages_acked_total               -> rabbitmq_global_messages_acknowledged_total
rabbitmq_channel_messages_confirmed_total           -> rabbitmq_global_messages_confirmed_total
rabbitmq_channel_messages_delivered_total           -> rabbitmq_global_messages_delivered_total
rabbitmq_channel_messages_redelivered_total         -> rabbitmq_global_messages_redelivered_total
rabbitmq_channel_messages_unroutable_dropped_total  -> rabbitmq_global_messages_unroutable_dropped_total
rabbitmq_channel_messages_unroutable_returned_total -> rabbitmq_global_messages_unroutable_returned_total

Almost Certain

rabbitmq_channel_get_ack_total                -> rabbitmq_global_messages_delivered_get_manual_ack_total
rabbitmq_channel_get_total                    -> rabbitmq_global_messages_delivered_get_auto_ack_total
rabbitmq_channel_messages_delivered_ack_total -> rabbitmq_global_messages_delivered_consume_manual_ack_total
rabbitmq_channel_messages_published_total     -> rabbitmq_global_messages_received_total

Unknown

rabbitmq_channels_closed_total          -> ?
rabbitmq_channels_opened_total          -> ?
rabbitmq_connections_closed_total       -> ?
rabbitmq_connections_opened_total       -> ?
rabbitmq_queue_messages_published_total -> ?
rabbitmq_queues_created_total           -> ?
rabbitmq_queues_declared_total          -> ?
rabbitmq_queues_deleted_total           -> ?

Solution

This is what I ran to remap the values in the dashboard file:

sed -i "" \
  -e 's/rabbitmq_channel_get_ack_total/rabbitmq_global_messages_delivered_get_manual_ack_total/g' \
  -e 's/rabbitmq_channel_get_empty_total/rabbitmq_global_messages_get_empty_total/g' \
  -e 's/rabbitmq_channel_get_total/rabbitmq_global_messages_delivered_get_auto_ack_total/g' \
  -e 's/rabbitmq_channel_messages_acked_total/rabbitmq_global_messages_acknowledged_total/g' \
  -e 's/rabbitmq_channel_messages_confirmed_total/rabbitmq_global_messages_confirmed_total/g' \
  -e 's/rabbitmq_channel_messages_delivered_ack_total/rabbitmq_global_messages_delivered_consume_manual_ack_total/g' \
  -e 's/rabbitmq_channel_messages_delivered_total/rabbitmq_global_messages_delivered_total/g' \
  -e 's/rabbitmq_channel_messages_published_total/rabbitmq_global_messages_received_total/g' \
  -e 's/rabbitmq_channel_messages_redelivered_total/rabbitmq_global_messages_redelivered_total/g' \
  -e 's/rabbitmq_channel_messages_unroutable_dropped_total/rabbitmq_global_messages_unroutable_dropped_total/g' \
  -e 's/rabbitmq_channel_messages_unroutable_returned_total/rabbitmq_global_messages_unroutable_returned_total/g' \
  RabbitMQ-Overview.json

@mkuratczyk
Copy link
Contributor

Hi. Thanks for sharing this. I'll update the dashboards in the the next few days hopefully.

@ggustafsson
Copy link

Hi. Thanks for sharing this. I'll update the dashboards in the the next few days hopefully.

Any updates on this?

@michaelklishin
Copy link
Member

According to ZenHub, the Grafana dashboard update task is still open.

@coro
Copy link
Contributor

coro commented Aug 5, 2022

@ggustafsson I've just made the above PR to first get the dashboards into a healthier state - they've fallen behind some Grafana changes. My plan for next week is to then make the changes from aggregated to global metrics as you requested.

@ggustafsson
Copy link

@michaelklishin @coro Thank you for the update! I’ll take a closer look at the PR next week. I can test each revision of the dashboard at work.

@coro
Copy link
Contributor

coro commented Aug 8, 2022

Thanks @ggustafsson. As a heads up, I haven't pushed anything to grafana.com just yet, but as of the changes in that PR you should be able to copy the dashboard JSON straight from the repo and paste it into the import menu in Grafana in order to test them.

@coro
Copy link
Contributor

coro commented Aug 8, 2022

The draft PR containing the global metrics is here: #5463

@ggustafsson
Copy link

@coro I tried out the first PR and from what I can see it looks good, but it is a bit hard for me to do a good test because our metrics values are all over the place when we don't use the global metric paths. I will keep an eye on the second PR and try it out early. We have been using the existing dashboard with my sed fix above for around two months now and we haven't seen anything odd with it (yet).

mergify bot pushed a commit that referenced this pull request Nov 1, 2023
Global counters for producers added in #3127 but never made it to this dashboard

(cherry picked from commit 0b2a94c)
mergify bot pushed a commit that referenced this pull request Nov 1, 2023
Global counters for producers added in #3127 but never made it to this dashboard

(cherry picked from commit 0b2a94c)
(cherry picked from commit 548e67c)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants