-
Notifications
You must be signed in to change notification settings - Fork 34
More/progressive health check commands #292
Comments
…ecks See rabbitmq/rabbitmq-cli#292 for an overview.
First step towards addressing this was t come up with a number of health checks our team agrees on as a reasonable range of options and document it. |
It joins the club of is_booted/1 and is_running/{0, 1}. This allows for a CLI command that checks if the node is still botting. References rabbitmq/rabbitmq-cli#292.
It joins the club of is_booted/1 and is_running/{0, 1}. This allows for a CLI command that checks if the node is still botting. References rabbitmq/rabbitmq-cli#292. (cherry picked from commit c5ae45e)
References rabbitmq/rabbitmq-cli#292. (cherry picked from commit 2677553)
…ands This means that even in the "negative" response they exit with a 0 status code. In other words, they just tell the user the state of things without asserting on what constitutes a success or failure. This is consistent with some recently introduced diagnostics commands: some are "informational" (simply provide an insight into the state of the node) and others are checks (optinionated, consider certain conditions to be faulty and exit with a non-zero exit code). References #292.
…ands This means that even in the "negative" response they exit with a 0 status code. In other words, they just tell the user the state of things without asserting on what constitutes a success or failure. This is consistent with some recently introduced diagnostics commands: some are "informational" (simply provide an insight into the state of the node) and others are checks (optinionated, consider certain conditions to be faulty and exit with a non-zero exit code). References #292. (cherry picked from commit 8482380)
now that 3.7.11 has shipped with rabbitmq/rabbitmq-cli#292 in it.
hmm. I seem to be having trouble with this:
If I try, say,
Am I doing something wrong or making assumptions or? I would expect the first one to exit non-zero, since the plugin isn't enabled. What I'm trying to do is use Thanks! |
Your expectations are correct, this is exactly how
Given the above, I am intrigued as to what could possibly make |
This is mailing list material. |
There is no need to parse any output. |
RabbitMQ Chef cookbook uses |
HTTP API versions of (most of) these checks has shipped in rabbitmq/rabbitmq-management#844. |
Yes, even more. Per discussion with @gerhard:
node_health_check
today checks every channel process which takes a long time with 10s of thousands of channelsBelow is a proposal draft that will be refined as we go.
Introduction
Since relatively few multi-service systems that use messaging can be considered
completely identical and different operators consider different things to be
within the normal parameters, team RabbitMQ (and some other folks who work on data services and their automation [1][2]) has long concluded that
there is no such thing as a "one true way to health check" a RabbitMQ node.
The Docker image maintainer community have arrived at a similar conclusion
Things get even more involved with clusters since distributed system monitoring,
the level of fault tolerance acceptable for a given system,
and preferred ways of reacting/recovering can vary even greatly from
ops team to ops team.
Another important aspect of node monitoring is how it should be altered during
upgrades. This proposal doesn't cover that part.
Two Types of Health Checks
The proposal is to classify every health check RabbitMQ offers into one of
two categories:
Each category will have a number of checks organized into stages, with
increasingly more aspects of the system checked. This means the probability
of false positives for higher stages will also be higher. Which stage
is used by a given deployment is a choice of that system's operators.
Node-local Checks
Stage 1
What
rabbitmqctl ping
offers today: it ensures that the runtime is runningand (indirectly) that CLI tools can authenticate to it.
This is the most basic check possible. Except for the CLI tool authentication
part, the probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.
Stage 2
Includes all checks in stage 1 plus makes sure that
rabbitmqctl status
(well, the function that backs it) succeeds.
This is a common way of sanity checking a node.
The probability of false positives can be considered approaching 0
except for upgrades and maintenance windows.
Stage 3
Includes all checks in stage 2 plus checks that the RabbitMQ application is running
(not stopped/"paused" with
rabbitmqctl stop_app
or the Pause Minority partitionhandling strategy) and there are no resource alarms.
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.
Systems hovering around their max allowed memory usage will have a high
probability of false positives.
Stage 4
Includes all checks in stage 3 plus checks that there are no failing virtual hosts.
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.
Stage 5
Includes all checks in stage 4 plus a check on all enabled listeners
(using a temporary TCP connection).
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly.
Stage 6
Includes all checks in stage 5 plus what
rabbitmqctl node_health_check
does (it sanity checks every local queue master process and every channel).
The probability of false positives is moderate for systems under
above average load or with a large number of queues and channels
(starting with 10s of thousands).
Optional Check 1
Includes all checks in stage 4 plus checks that an expected set of plugins is
enabled.
The probability of false positives is generally low but during upgrades and
maintenance windows can raise significantly depending on the deployment
tools/strategies used (e.g. all plugins can be temporarily disabled).
Cluster Checks
Stage 1
Checks for the expected number of nodes in a cluster.
The probability of false positives can be considered approaching 0.
Stage 2
Checks for network partitions detected by a node.
The probability of false positives is a function of the partition
detection algorithm used. With a timer-based strategy it is moderate
(say, within the (0, 0.2] range). With adaptive accrual failure detectors it is lower (according to our team's anecdotal evidence).
Tasks
rabbitmq-diagnostics alarms
#296)rabbitmq-diagnostics alarms
#296)rabbitmq-diagnostics alarms
#296)rabbitmq-diagnostics listeners
#298)rabbitmq-diagnostics check_virtual_hosts
was extracted into a separate issue scheduled for3.7.12
.The text was updated successfully, but these errors were encountered: