Skip to content

Latest commit

 

History

History
108 lines (84 loc) · 3.46 KB

README.md

File metadata and controls

108 lines (84 loc) · 3.46 KB

deadman-switch

This tool implements a deadman switch for software systems. It expects to regularly receive pings from its configured services and one or more webhooks will be called when it doesn't hear from a service for a configurable amount of time.

Features

  • alert you when your services are down
  • alert you when your services up again
  • notifications can be send to any webhook or to slack
    • use custom URL, headers, body for webhooks
    • use custom key/value pairs on the slack message
  • configurable message debouncing
  • dynamic configuration of services and notifications via HTTP API
    • secured with basic auth
  • scalable in both directions
    • from a small container with <32MB RAM
    • to a cluster that can handle thousands of pings and notifications per second
  • leader election in the cluster, so only one node checks deadlines and triggers notifications
  • notifications are queued, so they can be executed by the whole cluster
  • optionally supply a secret token when configuring your services, so the ping messages can't be spoofed easily

Quickstart

Up and running in less than 1 minute:

# start deadman-switch
docker run --name deadman-switch -d --rm -p 8080:8080 trusch/deadman-switch:latest

# configure service
curl -u admin:admin -XPOST --data-binary @- localhost:8080/config <<EOF
{
  "id": "service-1",
  "timeout": "30s",
  "debounce": "1m",
  "alertNotifications": [
	{
	  "type": "webhook",
	  "config": {
	    "method": "GET",
		"url": "http://localhost:8080/log?service-1-alert"
	  }
	}
  ]
}
EOF

# call the ping endpoint
curl http://localhost:8080/ping/service-1

# look at the logs
docker logs -f deadman-switch

Build and run

Dependencies

This repo requires podman and buildah as development toolset.

Ubuntu install commands:

. /etc/os-release
echo "deb https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/ /" | sudo tee /etc/apt/sources.list.d/devel:kubic:libcontainers:stable.list
curl -L https://download.opensuse.org/repositories/devel:/kubic:/libcontainers:/stable/xUbuntu_${VERSION_ID}/Release.key | sudo apt-key add -
sudo apt-get update
sudo apt-get -y upgrade
sudo apt-get -y install podman buildah skopeo
echo "${USER}:100000:65535" | sudo tee -a /etc/subuid
echo "${USER}:100000:65535" | sudo tee -a /etc/subgid

Build image

make image

Run a local test deployment

make run

This will bring up a pod with etcd as storage backend, caddy as ingress router and two instances of deadman-switch. The pod will expose port 8080 to serve our HTTP API. You can now for example list all configured services like this:

curl -u admin:admin http://localhost:8080/config | jq .

You can also POST or DELETE service config objects using this endpoint:

curl -XPOST -u admin:admin -d '{"id":"new_service", "timeout":"10s", "notifications":[{"webhook": {"url": "https://google.com", "method": "GET"}}]}' http://localhost:8080/config
curl -XDELETE -u admin:admin http://localhost:8080/config/new_service

To actually send a ping to the deadman switch do something like this:

curl http://localhost:8080/ping/svc1?token=secret1

If you don't do anything, the application will start calling its configured webhooks after 30 seconds. You can see that in the logs: podman logs -fn deadman-switch-1 deadman-switch-2. Please note that only one of the two nodes checks the deadlines, but both nodes are used to send out the actual notification webhooks.