More/progressive health check commands #292

michaelklishin · 2019-01-14T16:11:09Z

Yes, even more. Per discussion with @gerhard:

There is no single health check command that would be "universal": too many things can go wrong and would be considered a failure by different teams
There are node-local and cluster-wide checks, which should be reflected in command names
Health checks are stages (just like human or animal health checks), so we need commands that perform increasingly comprehensive checks that will have an increasing likelihood of false positives, e.g. node_health_check today checks every channel process which takes a long time with 10s of thousands of channels

Below is a proposal draft that will be refined as we go.

Introduction

Since relatively few multi-service systems that use messaging can be considered
completely identical and different operators consider different things to be
within the normal parameters, team RabbitMQ (and some other folks who work on data services and their automation [1][2]) has long concluded that
there is no such thing as a "one true way to health check" a RabbitMQ node.

The Docker image maintainer community have arrived at a similar conclusion

Things get even more involved with clusters since distributed system monitoring,
the level of fault tolerance acceptable for a given system,
and preferred ways of reacting/recovering can vary even greatly from
ops team to ops team.

Another important aspect of node monitoring is how it should be altered during
upgrades. This proposal doesn't cover that part.

Two Types of Health Checks

The proposal is to classify every health check RabbitMQ offers into one of
two categories:

Node-local checks
Cluster checks

Each category will have a number of checks organized into stages, with
increasingly more aspects of the system checked. This means the probability
of false positives for higher stages will also be higher. Which stage
is used by a given deployment is a choice of that system's operators.

Node-local Checks

Stage 1

What rabbitmqctl ping offers today: it ensures that the runtime is running
and (indirectly) that CLI tools can authenticate to it.

This is the most basic check possible. Except for the CLI tool authentication
part, the probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

Stage 2

Includes all checks in stage 1 plus makes sure that rabbitmqctl status
(well, the function that backs it) succeeds.

This is a common way of sanity checking a node.
The probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

Stage 3

Includes all checks in stage 2 plus checks that the RabbitMQ application is running
(not stopped/"paused" with rabbitmqctl stop_app or the Pause Minority partition
handling strategy) and there are no resource alarms.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Systems hovering around their max allowed memory usage will have a high
probability of false positives.

Stage 4

Includes all checks in stage 3 plus checks that there are no failing virtual hosts.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Stage 5

Includes all checks in stage 4 plus a check on all enabled listeners
(using a temporary TCP connection).

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Stage 6

Includes all checks in stage 5 plus what rabbitmqctl node_health_check
does (it sanity checks every local queue master process and every channel).

The probability of false positives is moderate for systems under
above average load or with a large number of queues and channels
(starting with 10s of thousands).

Optional Check 1

Includes all checks in stage 4 plus checks that an expected set of plugins is
enabled.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly depending on the deployment
tools/strategies used (e.g. all plugins can be temporarily disabled).

Cluster Checks

Stage 1

Checks for the expected number of nodes in a cluster.

The probability of false positives can be considered approaching 0.

Stage 2

Checks for network partitions detected by a node.

The probability of false positives is a function of the partition
detection algorithm used. With a timer-based strategy it is moderate
(say, within the (0, 0.2] range). With adaptive accrual failure detectors it is lower (according to our team's anecdotal evidence).

Tasks

rabbitmq-diagnostics check_virtual_hosts was extracted into a separate issue scheduled for 3.7.12.

The text was updated successfully, but these errors were encountered:

…ecks See rabbitmq/rabbitmq-cli#292 for an overview.

michaelklishin · 2019-01-18T23:53:43Z

First step towards addressing this was t come up with a number of health checks our team agrees on as a reasonable range of options and document it.

It joins the club of is_booted/1 and is_running/{0, 1}. This allows for a CLI command that checks if the node is still botting. References rabbitmq/rabbitmq-cli#292.

References rabbitmq/rabbitmq-cli#292.

It joins the club of is_booted/1 and is_running/{0, 1}. This allows for a CLI command that checks if the node is still botting. References rabbitmq/rabbitmq-cli#292. (cherry picked from commit c5ae45e)

References rabbitmq/rabbitmq-cli#292. (cherry picked from commit 2677553)

Part of #292.

Part of #292. (cherry picked from commit 5c45a78)

References #292.

Part of #292.

References #292.

Part of #292. (cherry picked from commit 6bef934)

References #292. (cherry picked from commit f247b4c)

References #292. (cherry picked from commit bb88eff)

…ands This means that even in the "negative" response they exit with a 0 status code. In other words, they just tell the user the state of things without asserting on what constitutes a success or failure. This is consistent with some recently introduced diagnostics commands: some are "informational" (simply provide an insight into the state of the node) and others are checks (optinionated, consider certain conditions to be faulty and exit with a non-zero exit code). References #292.

…ands This means that even in the "negative" response they exit with a 0 status code. In other words, they just tell the user the state of things without asserting on what constitutes a success or failure. This is consistent with some recently introduced diagnostics commands: some are "informational" (simply provide an insight into the state of the node) and others are checks (optinionated, consider certain conditions to be faulty and exit with a non-zero exit code). References #292. (cherry picked from commit 8482380)

Part of #292.

Part of #292. (cherry picked from commit 6a09f39)

Part of #292. (cherry picked from commit 1fb7fd8)

now that 3.7.11 has shipped with rabbitmq/rabbitmq-cli#292 in it.

kitchen · 2019-04-10T23:24:33Z

hmm.

I seem to be having trouble with this:

# rabbitmq-plugins list
Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rabbit@ip-10-182-95-144
 |/
[  ] rabbitmq_amqp1_0                  3.7.14
[  ] rabbitmq_auth_backend_cache       3.7.14
[  ] rabbitmq_auth_backend_http        3.7.14
[  ] rabbitmq_auth_backend_ldap        3.7.14
[  ] rabbitmq_auth_mechanism_ssl       3.7.14
[  ] rabbitmq_consistent_hash_exchange 3.7.14
[  ] rabbitmq_event_exchange           3.7.14
[  ] rabbitmq_federation               3.7.14
[  ] rabbitmq_federation_management    3.7.14
[  ] rabbitmq_jms_topic_exchange       3.7.14
[  ] rabbitmq_management               3.7.14
[  ] rabbitmq_management_agent         3.7.14
[  ] rabbitmq_mqtt                     3.7.14
[E*] rabbitmq_peer_discovery_aws       3.7.14
[e*] rabbitmq_peer_discovery_common    3.7.14
[  ] rabbitmq_peer_discovery_consul    3.7.14
[  ] rabbitmq_peer_discovery_etcd      3.7.14
[  ] rabbitmq_peer_discovery_k8s       3.7.14
[  ] rabbitmq_random_exchange          3.7.14
[  ] rabbitmq_recent_history_exchange  3.7.14
[  ] rabbitmq_sharding                 3.7.14
[  ] rabbitmq_shovel                   3.7.14
[  ] rabbitmq_shovel_management        3.7.14
[  ] rabbitmq_stomp                    3.7.14
[  ] rabbitmq_top                      3.7.14
[  ] rabbitmq_tracing                  3.7.14
[  ] rabbitmq_trust_store              3.7.14
[  ] rabbitmq_web_dispatch             3.7.14
[  ] rabbitmq_web_mqtt                 3.7.14
[  ] rabbitmq_web_mqtt_examples        3.7.14
[  ] rabbitmq_web_stomp                3.7.14
[  ] rabbitmq_web_stomp_examples       3.7.14

If I try, say, rabbitmq_top:

root@ip-10-182-95-144:~# rabbitmq-plugins -q is_enabled rabbitmq_top ; echo $?
Plugin rabbitmq_top is not enabled on node rabbit@ip-10-182-95-144. Enabled plugins and dependencies: rabbitmq_aws, rabbitmq_peer_discovery_aws, rabbitmq_peer_discovery_common
0

# rabbitmq-plugins -q is_enabled rabbitmq_peer_discovery_aws; echo $?
Plugin rabbitmq_peer_discovery_aws is enabled on node rabbit@ip-10-182-95-144
0

Am I doing something wrong or making assumptions or? I would expect the first one to exit non-zero, since the plugin isn't enabled.

What I'm trying to do is use is_enabled in an idempotency condition in a chef recipe to see if the plugin is enabled or not, and it doesn't seem like I can do that here without parsing some output somewhere.

Thanks!

gerhard · 2019-04-11T06:14:36Z

Your expectations are correct, this is exactly how rabbitmq-plugins is_enabled is supposed to behave. This is the behaviour that I am seeing:

docker run -it --rm --name is_enabled rabbitmq:3.7.14

# in a new shell
docker exec -it is_enabled bash

rabbitmq-plugins list
Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rabbit@b54e36bb3b4f
 |/
[  ] rabbitmq_amqp1_0                  3.7.14
[  ] rabbitmq_auth_backend_cache       3.7.14
[  ] rabbitmq_auth_backend_http        3.7.14
[  ] rabbitmq_auth_backend_ldap        3.7.14
[  ] rabbitmq_auth_mechanism_ssl       3.7.14
[  ] rabbitmq_consistent_hash_exchange 3.7.14
[  ] rabbitmq_event_exchange           3.7.14
[  ] rabbitmq_federation               3.7.14
[  ] rabbitmq_federation_management    3.7.14
[  ] rabbitmq_jms_topic_exchange       3.7.14
[  ] rabbitmq_management               3.7.14
[  ] rabbitmq_management_agent         3.7.14
[  ] rabbitmq_mqtt                     3.7.14
[  ] rabbitmq_peer_discovery_aws       3.7.14
[  ] rabbitmq_peer_discovery_common    3.7.14
[  ] rabbitmq_peer_discovery_consul    3.7.14
[  ] rabbitmq_peer_discovery_etcd      3.7.14
[  ] rabbitmq_peer_discovery_k8s       3.7.14
[  ] rabbitmq_random_exchange          3.7.14
[  ] rabbitmq_recent_history_exchange  3.7.14
[  ] rabbitmq_sharding                 3.7.14
[  ] rabbitmq_shovel                   3.7.14
[  ] rabbitmq_shovel_management        3.7.14
[  ] rabbitmq_stomp                    3.7.14
[  ] rabbitmq_top                      3.7.14
[  ] rabbitmq_tracing                  3.7.14
[  ] rabbitmq_trust_store              3.7.14
[  ] rabbitmq_web_dispatch             3.7.14
[  ] rabbitmq_web_mqtt                 3.7.14
[  ] rabbitmq_web_mqtt_examples        3.7.14
[  ] rabbitmq_web_stomp                3.7.14
[  ] rabbitmq_web_stomp_examples       3.7.14

rabbitmq-plugins enable rabbitmq_peer_discovery_aws
Enabling plugins on node rabbit@b54e36bb3b4f:
rabbitmq_peer_discovery_aws
The following plugins have been configured:
  rabbitmq_peer_discovery_aws
  rabbitmq_peer_discovery_common
Applying plugin configuration to rabbit@b54e36bb3b4f...
The following plugins have been enabled:
  rabbitmq_peer_discovery_aws
  rabbitmq_peer_discovery_common
started 2 plugins.

rabbitmq-plugins list
Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rabbit@b54e36bb3b4f
 |/
[  ] rabbitmq_amqp1_0                  3.7.14
[  ] rabbitmq_auth_backend_cache       3.7.14
[  ] rabbitmq_auth_backend_http        3.7.14
[  ] rabbitmq_auth_backend_ldap        3.7.14
[  ] rabbitmq_auth_mechanism_ssl       3.7.14
[  ] rabbitmq_consistent_hash_exchange 3.7.14
[  ] rabbitmq_event_exchange           3.7.14
[  ] rabbitmq_federation               3.7.14
[  ] rabbitmq_federation_management    3.7.14
[  ] rabbitmq_jms_topic_exchange       3.7.14
[  ] rabbitmq_management               3.7.14
[  ] rabbitmq_management_agent         3.7.14
[  ] rabbitmq_mqtt                     3.7.14
[E*] rabbitmq_peer_discovery_aws       3.7.14
[e*] rabbitmq_peer_discovery_common    3.7.14
[  ] rabbitmq_peer_discovery_consul    3.7.14
[  ] rabbitmq_peer_discovery_etcd      3.7.14
[  ] rabbitmq_peer_discovery_k8s       3.7.14
[  ] rabbitmq_random_exchange          3.7.14
[  ] rabbitmq_recent_history_exchange  3.7.14
[  ] rabbitmq_sharding                 3.7.14
[  ] rabbitmq_shovel                   3.7.14
[  ] rabbitmq_shovel_management        3.7.14
[  ] rabbitmq_stomp                    3.7.14
[  ] rabbitmq_top                      3.7.14
[  ] rabbitmq_tracing                  3.7.14
[  ] rabbitmq_trust_store              3.7.14
[  ] rabbitmq_web_dispatch             3.7.14
[  ] rabbitmq_web_mqtt                 3.7.14
[  ] rabbitmq_web_mqtt_examples        3.7.14
[  ] rabbitmq_web_stomp                3.7.14
[  ] rabbitmq_web_stomp_examples       3.7.14

rabbitmq-plugins -q is_enabled rabbitmq_top ; echo $?
Plugin rabbitmq_top is not enabled on node rabbit@b54e36bb3b4f. Enabled plugins and dependencies: rabbitmq_aws, rabbitmq_peer_discovery_aws, rabbitmq_peer_discovery_common
69

rabbitmq-plugins -q is_enabled rabbitmq_peer_discovery_aws; echo $?
Plugin rabbitmq_peer_discovery_aws is enabled on node rabbit@b54e36bb3b4f
0

Given the above, rabbitmq-plugins is_enabled behaves as expected in v3.7.14 on Docker. I would take a closer look at your environment. How did you install RabbitMQ on AWS? Which OS are you using? Are all file permissions correct? How did you enable rabbitmq_peer_discovery_aws in the first place? What does your plugins_enabled file look like?

I am intrigued as to what could possibly make rabbitmq-plugins is_enabled behave differently in your case 🤔

michaelklishin · 2019-04-11T10:09:37Z

This is mailing list material.

michaelklishin · 2019-04-11T10:12:24Z

There is no need to parse any output. rabbitmq-plugins is_enabled, like all checks, uses exit codes to communicate success or error. I'm also not sure why rabbitmq-plugins enable is not idempotent enough, in particular in --offline mode. Again, this is a great topic for a rabbitmq-users thread.

michaelklishin · 2019-04-11T10:14:39Z

RabbitMQ Chef cookbook uses rabbitmq-plugins list --silent --minimal --enabled and grep. It predates this PR but is not meaningfully different, at least as far as Chef LWRPs go.

michaelklishin · 2020-10-07T19:58:21Z

HTTP API versions of (most of) these checks has shipped in rabbitmq/rabbitmq-management#844.

michaelklishin added enhancement effort-medium labels Jan 14, 2019

michaelklishin added this to the 3.7.11 milestone Jan 14, 2019

michaelklishin self-assigned this Jan 14, 2019

gerhard mentioned this issue Jan 14, 2019

Add healthcheck docker-library/rabbitmq#300

Closed

michaelklishin mentioned this issue Jan 14, 2019

Add a healthcheck script docker-library/rabbitmq#174

Closed

michaelklishin modified the milestones: 3.7.11, 3.7.12 Jan 16, 2019

michaelklishin changed the title ~~More health check commands~~ More/progressive health check commands Jan 18, 2019

michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this issue Jan 18, 2019

Monitoring guide: add a more extensive overview, section on health ch…

d14ce00

…ecks See rabbitmq/rabbitmq-cli#292 for an overview.

michaelklishin added a commit to rabbitmq/rabbitmq-server that referenced this issue Jan 20, 2019

Introduce rabbit:is_booted/0, is_booting/0

2677553

References rabbitmq/rabbitmq-cli#292.

michaelklishin added a commit to rabbitmq/rabbitmq-server that referenced this issue Jan 20, 2019

Introduce rabbit:is_booted/0, is_booting/0

2ba9e19

References rabbitmq/rabbitmq-cli#292. (cherry picked from commit 2677553)

michaelklishin mentioned this issue Jan 20, 2019

Introoduce rabbitmq-diagnostics is_running, is_booting #294

Merged

michaelklishin added a commit that referenced this issue Jan 21, 2019

Introducee rabbitmq-plugins is_enabled [plugin 1] [plugin 2] [...]

5c45a78

Part of #292.

This was referenced Jan 21, 2019

Introduce rabbitmq-plugins is_enabled [plugin 1] [plugin 2] [...] #295

Merged

Introduce rabbitmq-diagnostics alarms #296

Merged

michaelklishin added a commit that referenced this issue Jan 21, 2019

Introducee rabbitmq-plugins is_enabled [plugin 1] [plugin 2] [...]

7044cb8

Part of #292. (cherry picked from commit 5c45a78)

michaelklishin added a commit that referenced this issue Jan 22, 2019

Introduce rabbitmq-diagnostics check[_local]_alarms

41244bd

References #292.

michaelklishin added a commit that referenced this issue Jan 22, 2019

Tests and revisions for 'rabbitmq-diagnostics check[_local]_alarms'

7506407

References #292.

michaelklishin added a commit that referenced this issue Jan 22, 2019

Initial version of the 'rabbitmq-diagnostics listeners' command

5efcb36

Part of #292.

michaelklishin added a commit that referenced this issue Jan 22, 2019

Initial version of the 'rabbitmq-diagnostics listeners' command

6bef934

Part of #292.

michaelklishin mentioned this issue Jan 22, 2019

Introduce rabbitmq-diagnostics listeners #298

Merged

michaelklishin added a commit that referenced this issue Jan 22, 2019

Introduce rabbitmq-diagnostics check[_local]_alarms

f247b4c

References #292.

michaelklishin added a commit that referenced this issue Jan 22, 2019

Tests and revisions for 'rabbitmq-diagnostics check[_local]_alarms'

bb88eff

References #292.

michaelklishin added a commit that referenced this issue Jan 22, 2019

Initial version of the 'rabbitmq-diagnostics listeners' command

b2cbc31

Part of #292. (cherry picked from commit 6bef934)

michaelklishin added a commit that referenced this issue Jan 22, 2019

Introduce rabbitmq-diagnostics check[_local]_alarms

3bd0820

References #292. (cherry picked from commit f247b4c)

michaelklishin added a commit that referenced this issue Jan 22, 2019

Tests and revisions for 'rabbitmq-diagnostics check[_local]_alarms'

cd5ce85

References #292. (cherry picked from commit bb88eff)

michaelklishin added a commit that referenced this issue Jan 23, 2019

Introduce 'rabbitmq-diagnostics check_protocol_listener'

6a09f39

Part of #292.

michaelklishin added a commit that referenced this issue Jan 23, 2019

Introduce 'rabbitmq-diagnostics check_port_listener <port>'

1fb7fd8

Part of #292.

michaelklishin mentioned this issue Jan 23, 2019

Introduces listener check commands #300

Merged

michaelklishin added a commit that referenced this issue Jan 24, 2019

Introduce 'rabbitmq-diagnostics check_protocol_listener'

2032810

Part of #292. (cherry picked from commit 6a09f39)

michaelklishin added a commit that referenced this issue Jan 24, 2019

Introduce 'rabbitmq-diagnostics check_port_listener <port>'

9fa17cd

Part of #292. (cherry picked from commit 1fb7fd8)

michaelklishin modified the milestones: 3.7.12, 3.7.11 Jan 24, 2019

michaelklishin closed this as completed Jan 24, 2019

michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this issue Feb 12, 2019

Monitoring guide: convert to Markdown, recommend rabbitmq-diagnostics

78922f1

now that 3.7.11 has shipped with rabbitmq/rabbitmq-cli#292 in it.

michaelklishin mentioned this issue Apr 10, 2019

More human-friendly output of rabbitmqctl status #340

Closed

rabbitmq locked as off-topic and limited conversation to collaborators Apr 11, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More/progressive health check commands #292

More/progressive health check commands #292

michaelklishin commented Jan 14, 2019 •

edited

Loading

michaelklishin commented Jan 18, 2019

kitchen commented Apr 10, 2019

gerhard commented Apr 11, 2019 •

edited

Loading

michaelklishin commented Apr 11, 2019

michaelklishin commented Apr 11, 2019

michaelklishin commented Apr 11, 2019

michaelklishin commented Oct 7, 2020

More/progressive health check commands #292

More/progressive health check commands #292

Comments

michaelklishin commented Jan 14, 2019 • edited Loading

Introduction

Two Types of Health Checks

Node-local Checks

Stage 1

Stage 2

Stage 3

Stage 4

Stage 5

Stage 6

Optional Check 1

Cluster Checks

Stage 1

Stage 2

Tasks

michaelklishin commented Jan 18, 2019

kitchen commented Apr 10, 2019

gerhard commented Apr 11, 2019 • edited Loading

michaelklishin commented Apr 11, 2019

michaelklishin commented Apr 11, 2019

michaelklishin commented Apr 11, 2019

michaelklishin commented Oct 7, 2020

michaelklishin commented Jan 14, 2019 •

edited

Loading

gerhard commented Apr 11, 2019 •

edited

Loading