Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add specific stream protocol counters to track protocol errors #3157

Merged
merged 4 commits into from
Jul 1, 2021

Conversation

dcorbacho
Copy link
Contributor

Types of Changes

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)

Checklist

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • All tests pass locally with my changes
  • I have added tests that prove my fix is effective or that my feature works
  • I have added necessary documentation (if appropriate)
  • Any dependent changes have been merged and published in related repositories

@dcorbacho dcorbacho marked this pull request as draft June 28, 2021 15:22
@gerhard
Copy link
Contributor

gerhard commented Jun 29, 2021

This can now be rebased onto master, #3136 is in. Thank you @michaelklishin 👍🏻

@dcorbacho dcorbacho force-pushed the stream-protocol-counters branch from b956676 to 58e36b6 Compare June 29, 2021 10:50
@dcorbacho dcorbacho marked this pull request as ready for review June 29, 2021 10:50
@dcorbacho dcorbacho requested a review from gerhard June 29, 2021 10:50
-define(PROTOCOL_COUNTERS,
[
{
stream_error_stream_does_not_exist, ?STREAM_DOES_NOT_EXIST, counter,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are counters, then Prometheus format requires them to end with _total :

Note that an accumulating count has total as a suffix, in addition to the unit if applicable.

https://prometheus.io/docs/practices/naming/

@gerhard
Copy link
Contributor

gerhard commented Jun 29, 2021

Apart from the above _total change, this is missing before we can merge:

  • add metric to deps/rabbitmq_prometheus/metrics.md under GLOBAL COUNTERS

@Gsantomaggio do you want to test pivotalrabbitmq/rabbitmq:stream-protocol-counters-otp-max with https://github.com/rabbitmq/rabbitmq-stream-go-client/blob/main/examples/publishersError/publisherError.go ?

After that, if you want Gabriele, we can add these to the new RabbitMQ-Streams dashboard 😉

@Gsantomaggio
Copy link
Member

Sure !

@dcorbacho
Copy link
Contributor Author

@gerhard Those test suites can't check for the new metrics, as they only show up when the stream protocol is used. There is a check for them on rabbitmq_stream/test/rabbit_stream_SUITE

@dcorbacho dcorbacho requested a review from gerhard June 30, 2021 10:48
@gerhard
Copy link
Contributor

gerhard commented Jun 30, 2021

All looks good, starting QA in 3.9 LRE

@gerhard
Copy link
Contributor

gerhard commented Jun 30, 2021

While this works, writing queries for these metrics is really difficult. This is what a single query looks like:

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_access_refused_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_authentication_failure_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_frame_too_large_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_internal_error_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_precondition_failed_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_publisher_does_not_exist_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_sasl_authentication_failure_loopback_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_sasl_challenge_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_sasl_error_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_sasl_mechanism_not_supported_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_stream_already_exists_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_stream_does_not_exist_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_stream_not_available_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_subscription_id_already_exists_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_subscription_id_does_not_exist_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_unknown_frame_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"}) +

sum by(rabbitmq_cluster) (rabbitmq_global_stream_error_vhost_access_failure_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"})

Could we capture the error type in a label, so that we would end up with metrics like these:

# TYPE rabbitmq_global_stream_error_total counter
# HELP rabbitmq_global_stream_error_total Total number of errors
rabbitmq_global_stream_error_total{protocol="stream",error_type="access_refused"} 0
rabbitmq_global_stream_error_total{protocol="stream",error_type="sasl_error"} 0
rabbitmq_global_stream_error_total{protocol="stream",error_type="authentication_failure"} 0
...

This would transform the above query into this:

sum by(rabbitmq_cluster, error_type) (rabbitmq_global_stream_error_total{protocol="stream"} * on(instance) group_left(rabbitmq_cluster, rabbitmq_node) rabbitmq_identity_info{rabbitmq_cluster="$rabbitmq_cluster", namespace="$namespace"})

As I mentioned above, while queries are difficult, this works and produces the following end-result:

image

@gerhard
Copy link
Contributor

gerhard commented Jul 1, 2021

Having discussed this offline, it would be too difficult to capture the error types as labels. Leaving as is & merging. Thank you @dcorbacho!

@gerhard gerhard merged commit ef4303a into master Jul 1, 2021
@gerhard gerhard deleted the stream-protocol-counters branch July 1, 2021 16:56
gerhard added a commit that referenced this pull request Jul 1, 2021
Add specific stream protocol counters to track protocol errors

(cherry picked from commit ef4303a)
@gerhard
Copy link
Contributor

gerhard commented Jul 1, 2021

Backported to v3.9.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants