Skip to content
This repository has been archived by the owner on Nov 18, 2020. It is now read-only.

More/progressive health check commands #292

Closed
12 tasks done
michaelklishin opened this issue Jan 14, 2019 · 7 comments
Closed
12 tasks done

More/progressive health check commands #292

michaelklishin opened this issue Jan 14, 2019 · 7 comments

Comments

@michaelklishin
Copy link
Member

michaelklishin commented Jan 14, 2019

Yes, even more. Per discussion with @gerhard:

  • There is no single health check command that would be "universal": too many things can go wrong and would be considered a failure by different teams
  • There are node-local and cluster-wide checks, which should be reflected in command names
  • Health checks are stages (just like human or animal health checks), so we need commands that perform increasingly comprehensive checks that will have an increasing likelihood of false positives, e.g. node_health_check today checks every channel process which takes a long time with 10s of thousands of channels

Below is a proposal draft that will be refined as we go.

Introduction

Since relatively few multi-service systems that use messaging can be considered
completely identical and different operators consider different things to be
within the normal parameters, team RabbitMQ (and some other folks who work on data services and their automation [1][2]) has long concluded that
there is no such thing as a "one true way to health check" a RabbitMQ node.

The Docker image maintainer community have arrived at a similar conclusion

Things get even more involved with clusters since distributed system monitoring,
the level of fault tolerance acceptable for a given system,
and preferred ways of reacting/recovering can vary even greatly from
ops team to ops team.

Another important aspect of node monitoring is how it should be altered during
upgrades. This proposal doesn't cover that part.

Two Types of Health Checks

The proposal is to classify every health check RabbitMQ offers into one of
two categories:

  • Node-local checks
  • Cluster checks

Each category will have a number of checks organized into stages, with
increasingly more aspects of the system checked. This means the probability
of false positives for higher stages will also be higher. Which stage
is used by a given deployment is a choice of that system's operators.

Node-local Checks

Stage 1

What rabbitmqctl ping offers today: it ensures that the runtime is running
and (indirectly) that CLI tools can authenticate to it.

This is the most basic check possible. Except for the CLI tool authentication
part, the probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

Stage 2

Includes all checks in stage 1 plus makes sure that rabbitmqctl status
(well, the function that backs it) succeeds.

This is a common way of sanity checking a node.
The probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.

Stage 3

Includes all checks in stage 2 plus checks that the RabbitMQ application is running
(not stopped/"paused" with rabbitmqctl stop_app or the Pause Minority partition
handling strategy) and there are no resource alarms.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Systems hovering around their max allowed memory usage will have a high
probability of false positives.

Stage 4

Includes all checks in stage 3 plus checks that there are no failing virtual hosts.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Stage 5

Includes all checks in stage 4 plus a check on all enabled listeners
(using a temporary TCP connection).

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.

Stage 6

Includes all checks in stage 5 plus what rabbitmqctl node_health_check
does (it sanity checks every local queue master process and every channel).

The probability of false positives is moderate for systems under
above average load or with a large number of queues and channels
(starting with 10s of thousands).

Optional Check 1

Includes all checks in stage 4 plus checks that an expected set of plugins is
enabled.

The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly depending on the deployment
tools/strategies used (e.g. all plugins can be temporarily disabled).

Cluster Checks

Stage 1

Checks for the expected number of nodes in a cluster.

The probability of false positives can be considered approaching 0.

Stage 2

Checks for network partitions detected by a node.

The probability of false positives is a function of the partition
detection algorithm used. With a timer-based strategy it is moderate
(say, within the (0, 0.2] range). With adaptive accrual failure detectors it is lower (according to our team's anecdotal evidence).

Tasks

rabbitmq-diagnostics check_virtual_hosts was extracted into a separate issue scheduled for 3.7.12.

  1. Add a healthcheck script docker-library/rabbitmq#174 (comment)
  2. HEALTHCHECK directive in Dockerfile docker-library/cassandra#76 (comment)
@michaelklishin michaelklishin added this to the 3.7.11 milestone Jan 14, 2019
@michaelklishin michaelklishin self-assigned this Jan 14, 2019
@michaelklishin michaelklishin modified the milestones: 3.7.11, 3.7.12 Jan 16, 2019
@michaelklishin michaelklishin changed the title More health check commands More/progressive health check commands Jan 18, 2019
michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this issue Jan 18, 2019
@michaelklishin
Copy link
Member Author

First step towards addressing this was t come up with a number of health checks our team agrees on as a reasonable range of options and document it.

michaelklishin added a commit to rabbitmq/rabbitmq-server that referenced this issue Jan 20, 2019
It joins the club of is_booted/1 and is_running/{0, 1}.
This allows for a CLI command that checks if the node is still
botting.

References rabbitmq/rabbitmq-cli#292.
michaelklishin added a commit to rabbitmq/rabbitmq-server that referenced this issue Jan 20, 2019
michaelklishin added a commit to rabbitmq/rabbitmq-server that referenced this issue Jan 20, 2019
It joins the club of is_booted/1 and is_running/{0, 1}.
This allows for a CLI command that checks if the node is still
botting.

References rabbitmq/rabbitmq-cli#292.

(cherry picked from commit c5ae45e)
michaelklishin added a commit to rabbitmq/rabbitmq-server that referenced this issue Jan 20, 2019
michaelklishin added a commit that referenced this issue Jan 21, 2019
michaelklishin added a commit that referenced this issue Jan 22, 2019
michaelklishin added a commit that referenced this issue Jan 22, 2019
michaelklishin added a commit that referenced this issue Jan 22, 2019
michaelklishin added a commit that referenced this issue Jan 22, 2019
References #292.

(cherry picked from commit f247b4c)
michaelklishin added a commit that referenced this issue Jan 22, 2019
michaelklishin added a commit that referenced this issue Jan 23, 2019
…ands

This means that even in the "negative" response they exit with a 0
status code. In other words, they just tell the user the state of things
without asserting on what constitutes a success or failure.

This is consistent with some recently introduced diagnostics commands:
some are "informational" (simply provide an insight into the state
of the node) and others are checks (optinionated, consider certain
conditions to be faulty and exit with a non-zero exit code).

References #292.
michaelklishin added a commit that referenced this issue Jan 23, 2019
…ands

This means that even in the "negative" response they exit with a 0
status code. In other words, they just tell the user the state of things
without asserting on what constitutes a success or failure.

This is consistent with some recently introduced diagnostics commands:
some are "informational" (simply provide an insight into the state
of the node) and others are checks (optinionated, consider certain
conditions to be faulty and exit with a non-zero exit code).

References #292.

(cherry picked from commit 8482380)
michaelklishin added a commit that referenced this issue Jan 24, 2019
michaelklishin added a commit that referenced this issue Jan 24, 2019
@michaelklishin michaelklishin modified the milestones: 3.7.12, 3.7.11 Jan 24, 2019
michaelklishin added a commit to rabbitmq/rabbitmq-website that referenced this issue Feb 12, 2019
@kitchen
Copy link

kitchen commented Apr 10, 2019

hmm.

I seem to be having trouble with this:

# rabbitmq-plugins list
Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rabbit@ip-10-182-95-144
 |/
[  ] rabbitmq_amqp1_0                  3.7.14
[  ] rabbitmq_auth_backend_cache       3.7.14
[  ] rabbitmq_auth_backend_http        3.7.14
[  ] rabbitmq_auth_backend_ldap        3.7.14
[  ] rabbitmq_auth_mechanism_ssl       3.7.14
[  ] rabbitmq_consistent_hash_exchange 3.7.14
[  ] rabbitmq_event_exchange           3.7.14
[  ] rabbitmq_federation               3.7.14
[  ] rabbitmq_federation_management    3.7.14
[  ] rabbitmq_jms_topic_exchange       3.7.14
[  ] rabbitmq_management               3.7.14
[  ] rabbitmq_management_agent         3.7.14
[  ] rabbitmq_mqtt                     3.7.14
[E*] rabbitmq_peer_discovery_aws       3.7.14
[e*] rabbitmq_peer_discovery_common    3.7.14
[  ] rabbitmq_peer_discovery_consul    3.7.14
[  ] rabbitmq_peer_discovery_etcd      3.7.14
[  ] rabbitmq_peer_discovery_k8s       3.7.14
[  ] rabbitmq_random_exchange          3.7.14
[  ] rabbitmq_recent_history_exchange  3.7.14
[  ] rabbitmq_sharding                 3.7.14
[  ] rabbitmq_shovel                   3.7.14
[  ] rabbitmq_shovel_management        3.7.14
[  ] rabbitmq_stomp                    3.7.14
[  ] rabbitmq_top                      3.7.14
[  ] rabbitmq_tracing                  3.7.14
[  ] rabbitmq_trust_store              3.7.14
[  ] rabbitmq_web_dispatch             3.7.14
[  ] rabbitmq_web_mqtt                 3.7.14
[  ] rabbitmq_web_mqtt_examples        3.7.14
[  ] rabbitmq_web_stomp                3.7.14
[  ] rabbitmq_web_stomp_examples       3.7.14

If I try, say, rabbitmq_top:

root@ip-10-182-95-144:~# rabbitmq-plugins -q is_enabled rabbitmq_top ; echo $?
Plugin rabbitmq_top is not enabled on node rabbit@ip-10-182-95-144. Enabled plugins and dependencies: rabbitmq_aws, rabbitmq_peer_discovery_aws, rabbitmq_peer_discovery_common
0
# rabbitmq-plugins -q is_enabled rabbitmq_peer_discovery_aws; echo $?
Plugin rabbitmq_peer_discovery_aws is enabled on node rabbit@ip-10-182-95-144
0

Am I doing something wrong or making assumptions or? I would expect the first one to exit non-zero, since the plugin isn't enabled.

What I'm trying to do is use is_enabled in an idempotency condition in a chef recipe to see if the plugin is enabled or not, and it doesn't seem like I can do that here without parsing some output somewhere.

Thanks!

@gerhard
Copy link
Contributor

gerhard commented Apr 11, 2019

Your expectations are correct, this is exactly how rabbitmq-plugins is_enabled is supposed to behave. This is the behaviour that I am seeing:

docker run -it --rm --name is_enabled rabbitmq:3.7.14

# in a new shell
docker exec -it is_enabled bash

rabbitmq-plugins list
Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rabbit@b54e36bb3b4f
 |/
[  ] rabbitmq_amqp1_0                  3.7.14
[  ] rabbitmq_auth_backend_cache       3.7.14
[  ] rabbitmq_auth_backend_http        3.7.14
[  ] rabbitmq_auth_backend_ldap        3.7.14
[  ] rabbitmq_auth_mechanism_ssl       3.7.14
[  ] rabbitmq_consistent_hash_exchange 3.7.14
[  ] rabbitmq_event_exchange           3.7.14
[  ] rabbitmq_federation               3.7.14
[  ] rabbitmq_federation_management    3.7.14
[  ] rabbitmq_jms_topic_exchange       3.7.14
[  ] rabbitmq_management               3.7.14
[  ] rabbitmq_management_agent         3.7.14
[  ] rabbitmq_mqtt                     3.7.14
[  ] rabbitmq_peer_discovery_aws       3.7.14
[  ] rabbitmq_peer_discovery_common    3.7.14
[  ] rabbitmq_peer_discovery_consul    3.7.14
[  ] rabbitmq_peer_discovery_etcd      3.7.14
[  ] rabbitmq_peer_discovery_k8s       3.7.14
[  ] rabbitmq_random_exchange          3.7.14
[  ] rabbitmq_recent_history_exchange  3.7.14
[  ] rabbitmq_sharding                 3.7.14
[  ] rabbitmq_shovel                   3.7.14
[  ] rabbitmq_shovel_management        3.7.14
[  ] rabbitmq_stomp                    3.7.14
[  ] rabbitmq_top                      3.7.14
[  ] rabbitmq_tracing                  3.7.14
[  ] rabbitmq_trust_store              3.7.14
[  ] rabbitmq_web_dispatch             3.7.14
[  ] rabbitmq_web_mqtt                 3.7.14
[  ] rabbitmq_web_mqtt_examples        3.7.14
[  ] rabbitmq_web_stomp                3.7.14
[  ] rabbitmq_web_stomp_examples       3.7.14

rabbitmq-plugins enable rabbitmq_peer_discovery_aws
Enabling plugins on node rabbit@b54e36bb3b4f:
rabbitmq_peer_discovery_aws
The following plugins have been configured:
  rabbitmq_peer_discovery_aws
  rabbitmq_peer_discovery_common
Applying plugin configuration to rabbit@b54e36bb3b4f...
The following plugins have been enabled:
  rabbitmq_peer_discovery_aws
  rabbitmq_peer_discovery_common
started 2 plugins.

rabbitmq-plugins list
Listing plugins with pattern ".*" ...
 Configured: E = explicitly enabled; e = implicitly enabled
 | Status: * = running on rabbit@b54e36bb3b4f
 |/
[  ] rabbitmq_amqp1_0                  3.7.14
[  ] rabbitmq_auth_backend_cache       3.7.14
[  ] rabbitmq_auth_backend_http        3.7.14
[  ] rabbitmq_auth_backend_ldap        3.7.14
[  ] rabbitmq_auth_mechanism_ssl       3.7.14
[  ] rabbitmq_consistent_hash_exchange 3.7.14
[  ] rabbitmq_event_exchange           3.7.14
[  ] rabbitmq_federation               3.7.14
[  ] rabbitmq_federation_management    3.7.14
[  ] rabbitmq_jms_topic_exchange       3.7.14
[  ] rabbitmq_management               3.7.14
[  ] rabbitmq_management_agent         3.7.14
[  ] rabbitmq_mqtt                     3.7.14
[E*] rabbitmq_peer_discovery_aws       3.7.14
[e*] rabbitmq_peer_discovery_common    3.7.14
[  ] rabbitmq_peer_discovery_consul    3.7.14
[  ] rabbitmq_peer_discovery_etcd      3.7.14
[  ] rabbitmq_peer_discovery_k8s       3.7.14
[  ] rabbitmq_random_exchange          3.7.14
[  ] rabbitmq_recent_history_exchange  3.7.14
[  ] rabbitmq_sharding                 3.7.14
[  ] rabbitmq_shovel                   3.7.14
[  ] rabbitmq_shovel_management        3.7.14
[  ] rabbitmq_stomp                    3.7.14
[  ] rabbitmq_top                      3.7.14
[  ] rabbitmq_tracing                  3.7.14
[  ] rabbitmq_trust_store              3.7.14
[  ] rabbitmq_web_dispatch             3.7.14
[  ] rabbitmq_web_mqtt                 3.7.14
[  ] rabbitmq_web_mqtt_examples        3.7.14
[  ] rabbitmq_web_stomp                3.7.14
[  ] rabbitmq_web_stomp_examples       3.7.14

rabbitmq-plugins -q is_enabled rabbitmq_top ; echo $?
Plugin rabbitmq_top is not enabled on node rabbit@b54e36bb3b4f. Enabled plugins and dependencies: rabbitmq_aws, rabbitmq_peer_discovery_aws, rabbitmq_peer_discovery_common
69

rabbitmq-plugins -q is_enabled rabbitmq_peer_discovery_aws; echo $?
Plugin rabbitmq_peer_discovery_aws is enabled on node rabbit@b54e36bb3b4f
0

Given the above, rabbitmq-plugins is_enabled behaves as expected in v3.7.14 on Docker. I would take a closer look at your environment. How did you install RabbitMQ on AWS? Which OS are you using? Are all file permissions correct? How did you enable rabbitmq_peer_discovery_aws in the first place? What does your plugins_enabled file look like?

I am intrigued as to what could possibly make rabbitmq-plugins is_enabled behave differently in your case 🤔

@michaelklishin
Copy link
Member Author

This is mailing list material.

@rabbitmq rabbitmq locked as off-topic and limited conversation to collaborators Apr 11, 2019
@michaelklishin
Copy link
Member Author

There is no need to parse any output. rabbitmq-plugins is_enabled, like all checks, uses exit codes to communicate success or error. I'm also not sure why rabbitmq-plugins enable is not idempotent enough, in particular in --offline mode. Again, this is a great topic for a rabbitmq-users thread.

@michaelklishin
Copy link
Member Author

RabbitMQ Chef cookbook uses rabbitmq-plugins list --silent --minimal --enabled and grep. It predates this PR but is not meaningfully different, at least as far as Chef LWRPs go.

@michaelklishin
Copy link
Member Author

HTTP API versions of (most of) these checks has shipped in rabbitmq/rabbitmq-management#844.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

3 participants