Skip to content

Monitor and remediate unhealthy Docker containers

License

Notifications You must be signed in to change notification settings

JourneyDocker/docker-autoheal

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docker-Autoheal

GitHubRelease DockerPublishing DockerSize DockerPulls

A cross-platform tool to monitor and remediate unhealthy Docker containers

Written in Rust and designed to be OS agnostic, flexible, and performant in large environments via concurrency and multi-threading

The docker-autoheal binary may be executed in a native OS or from a Docker container

How to Use

You must first apply HEALTHCHECK to your docker images

Environment Variables

Variable Default Description
AUTOHEAL_CONNECTION_TYPE local This determines how docker-autoheal connects to Docker (One of: local, socket, http, ssl
AUTOHEAL_STOP_TIMEOUT 10 Docker waits n seconds for a container to stop before killing it during restarts (override via label; see below)
AUTOHEAL_INTERVAL 5 Check container health every n seconds
AUTOHEAL_START_DELAY 0 Wait n seconds before first health check
AUTOHEAL_POST_ACTION The absolute path of an executable to be run after restart attempts; container name, id and stop-timeout are passed as arguments in that order
AUTOHEAL_MONITOR_ALL FALSE Set to TRUE to simply monitor all containers on the host or leave as FALSE and control via autoheal.monitor.enable
AUTOHEAL_LOG_ALL FALSE Allow (TRUE/FALSE) logging (and webhook/apprise if set) for containers with autostart.restart.enable=FALSE
AUTOHEAL_LOG_PERSIST FALSE Allow (TRUE/FALSE) external persistent logging and reporting of historical data
AUTOHEAL_TCP_HOST localhost Address of Docker host
AUTOHEAL_TCP_PORT 2375 (ssl: 2376) Port on which to connect to the Docker host
AUTOHEAL_TCP_TIMEOUT 10 Time in n seconds before failing connection attempt
AUTOHEAL_PEM_PATH /opt/docker-autoheal/tls Absolute path to requisite ssl certificate files (key.pem, cert.pem, ca.pem) when AUTOHEAL_CONNECTION_TYPE=ssl
AUTOHEAL_APPRISE_URL URL to post messages to the apprise following actions on unhealthy container
AUTOHEAL_WEBHOOK_KEY KEY to post messages to the webhook following actions on unhealthy container
AUTOHEAL_WEBHOOK_URL URL to post messages to the webhook following actions on unhealthy container

Optional Container Labels

Label Default Description
autoheal.stop.timeout Per container override (in seconds) of AUTOHEAL_STOP_TIMEOUT during restart (e.g. Some container routinely takes longer to cleanly exit)
autoheal.monitor.enable FALSE Per container override (true/false) to control if should be monitored (e.g. If you have a large number of containers that you wish to monitor and restart, apply this label as FALSE to the few that you do not wish to monitor and set AUTOHEAL_MONITOR_ALL to TRUE)
autoheal.restart.enable TRUE Per container override (true/false) to control if should restart on unhealthy (e.g. If you have a large number of containers that you wish to monitor and restart, apply this label as FALSE to the few that you do not wish to restart and set AUTOHEAL_MONITOR_ALL to TRUE)

Binary Options

Used when executed in native OS (NOTE: The environment variables are also accepted)

Options:
    -a, --apprise-url <APPRISE_URL>
                        The apprise url
    -c, --connection-type <CONNECTION_TYPE>
                        One of local, socket, http, or ssl
    -d, --start-delay <START_DELAY>
                        Time in seconds to wait for first check
    -h, --help          Print help
    -i, --interval <INTERVAL>
                        Time in seconds to check health
    -j, --webhook-key <WEBHOOK_KEY>
                        The webhook json key string
    -k, --key-path <KEY_PATH>
                        The absolute path to requisite ssl PEM files
    -l, --log-all       Enable logging of unhealthy containers where restart
                        is disabled (WARNING, this could be chatty)
    -m, --monitor-all   Enable monitoring off all containers that have a
                        healthcheck
    -n, --tcp-host <TCP_HOST>
                        The hostname or IP address of the Docker host (when -c
                        http or ssl)
    -p, --tcp-port <TCP_PORT>
                        The tcp port number of the Docker host (when -c http
                        or ssl)
    -s, --stop-timeout <STOP_TIMEOUT>
                        Time in seconds to wait for action to complete
    -t, --tcp-timeout <TCP_TIMEOUT>
                        Time in seconds to wait for connection to complete
    -w, --webhook-url <WEBHOOK_URL>
                        The webhook url
    -L, --log-persist Enable external persistent logging and reporting of historical
                        data
    -P, --post-action <SCRIPT_PATH>
                        The absolute path to a script that should be executed
                        after container restart
    -V, --version       Print version information

Local

/usr/local/bin/docker-autoheal --monitor-all --log_persist > /var/log/docker-autoheal.log &

Will connect to the local Docker host, monitor all containers, and generate a persistent log at /opt/docker-autoheal/log.json

Socket

docker run -d --read-only \
    --user=[uid]:[gid]
    --name docker-autoheal \
    --network=none \
    --restart=always \
    --env="AUTOHEAL_CONNECTION_TYPE=socket" \
    --env="AUTOHEAL_MONITOR_ALL=true" \
    --env="AUTOHEAL_LOG_PERSIST=true" \
    --volume=/var/run/docker.sock:/var/run/docker.sock:ro \
    --volume=/opt/docker-autoheal/log.json:/opt/docker-autoheal/log.json:rw \
    journeyover/docker-autoheal:latest

Will connect to the Docker host via unix socket location /var/run/docker.sock or Windows named pipe location //./pipe/docker_engine, monitor all containers, and write persistent log data to /opt/docker-autoheal/log.json as the user with the specified uid:gid

HTTP

docker run -d --read-only \
    --user=[uid]:[gid]
    --name docker-autoheal \
    --restart=always \
    --env="AUTOHEAL_CONNECTION_TYPE=http" \
    --env="AUTOHEAL_TCP_HOST=MYHOST" \
    --env="AUTOHEAL_TCP_PORT=2375" \
    --env="AUTOHEAL_LOG_PERSIST=true" \
    --volume=/opt/docker-autoheal/log.json:/opt/docker-autoheal/log.json:rw \
    journeyover/docker-autoheal:latest

Will connect to the Docker host via hostname or IP and the specified port, monitor only containers with a label autoheal.monitor.enable=true, and write persistent log data to /opt/docker-autoheal/log.json as the user with the specified uid:gid

Logging

2024-01-23 03:03:23-0500 [WARNING] [nordvpn] Container (886d37fd9f5c) is unhealthy with 3 failures
2024-01-23 03:03:23-0500 [WARNING] [nordvpn] Container (886d37fd9f5c) last output: [4] Status: Unstable
2024-01-23 03:03:23-0500 [WARNING] [nordvpn] Restarting container (886d37fd9f5c) with 10s timeout
2024-01-23 03:03:34-0500 [   INFO] [nordvpn] Restart of container (886d37fd9f5c) was successful
2024-01-23 03:03:34-0500 [   INFO] [nordvpn] Container (886d37fd9f5c) has been unhealthy 1 time
2024-01-23 03:04:48-0500 [WARNING] [privoxy] Container (74f74eb7b2d0) is unhealthy with 3 failures
2024-01-23 03:04:48-0500 [WARNING] [privoxy] Container (74f74eb7b2d0) last output: [-1] Health check exceeded timeout (3s)
2024-01-23 03:04:48-0500 [WARNING] [privoxy] Restarting container (74f74eb7b2d0) with 10s timeout
2024-01-23 03:04:59-0500 [   INFO] [privoxy] Restart of container (74f74eb7b2d0) was successful
2024-01-23 03:04:59-0500 [   INFO] [privoxy] Container (74f74eb7b2d0) has been unhealthy 1 time

Example output when docker-autoheal is in action

Persistent Logging

Examples of working with log.json:

jq -s 'group_by(.name) | map({name: .[0].name, data: (group_by(.id) | map({id: .[0].id, data: .}))})' /opt/docker-autoheal/log.json

Group all entries by name and then group by container id

jq -s 'map(select(.name=="privoxy"))' /opt/docker-autoheal/log.json

Find all occurrences of 'privoxy'

jq -s 'map(select(.name=="privoxy")) | group_by(.name) | map({name: .[0].name, data: (group_by(.id) | map({id: .[0].id, data: .}))})' /opt/docker-autoheal/log.json

Find all occurrences of 'privoxy' and group by container id

Other Info

Docker Labels

a) Apply the label autoheal.monitor.enable=true to your container to have it watched

OR

b) Set ENV AUTOHEAL_MONITOR_ALL=true (or apply --monitor-all to the binary) to watch all running containers

SSL Connection Type

See https://docs.docker.com/engine/security/https/ for how to configure TCP with mTLS

The certificates and keys need these names:

  • ca.pem
  • cert.pem
  • key.pem

Docker Security

Additional security can be obtained by:

  • Use a unique user for monitoring and remediating
    • Create a new user
    • Add that user to the docker group
    • Execute the binary or docker container with that uid:gid
  • Run docker in rootless mode

Docker Timezone

If you need the docker-autoheal container timezone to match the local machine, you can map /etc/localtime

docker run ... -v /etc/localtime:/etc/localtime:ro

Webhook/Apprise

  • The payload includes the following separated by |: Docker system hostname, the last health output, and the result of restart action

A Word of Caution about Excluding from Restart and Logging of those Exclusions

  • Excluding a container from restarts and enabling logging for excluded containers will generate numerous log messages whenever that container becomes unhealthy
  • Additionally, if a webhook or apprise is also configured, they will be executed at each monitoring interval for those containers

Credits

About

Monitor and remediate unhealthy Docker containers

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Languages

  • Rust 98.0%
  • Dockerfile 2.0%