Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: health endpoint #46

Closed
wants to merge 2 commits into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
63 changes: 63 additions & 0 deletions 050-health-endpoint/proposal.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Summary

A public HTTP endpoint that gives basic information about the health of Concourse
cluster.


# Motivation

There are some scenarios in which a Concourse cluster is part of bigger setup for
automation. In such scenario other parts of the system rely on Concourse not
only for CI/CD, but for other type of automations as well. In order to have the
whole process working the system also monitors the availability of its parts and
takes measures if some of its parts are not operational (for example executes
some predefined steps - sends informing mails, triggers alerts, executes
medication steps, etc). In such cases it is important for the system to be able
to determine the state of every part of it. So it would be nice if the parts of
this system have a common way to return their health/availability status.

Currently there is no easy way for external (monitoring) system to understand
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding some more context to this for those outside the team

the way that we (the Concourse team itself) has been dealing with the "is this installation good or not?" question has been mostly through the use of the metrics that we expose, and SLIs (https://github.com/concourse/oxygen-mask - which relies on another Concourse installation sending probes to the first - / https://github.com/cirocosta/slirunner - doesn't require another installation, but is k8s and prometheus-first) which takes the approach of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion whether it's properly working or not.

what I like about this kind of approach is that you can keep that external to the main system, and thus, build the integrations to it however you want (e.g., in the case of slirunner, it exposes the information through Prometheus, but if you'd prefer to have an endpoint being hit when things fail, you could build that too, all while still leveraging "the core concourse" under the hood), as well as highlighting which kinds of user workflows are currently broken / degredated 🤔

if a Concourse cluster is live and operational. It would be nice if Concourse
also gives information for its health so in (the rare) case it is not healthy
the external system can react.


# Proposal

Concourse can expose a public HTTP endpoint called "health" endpoint that gives
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind elaborating a bit more on how you see the detection of "healthiness" would look like?

Also, do you think this endpoint should be public (in the sense of requiring no credentials at all) and have per-node information being displayed? I think a problem there is that it'd conflict with the current auth requirements that we have for someone reading /api/v1/workers.

basic information for its health. Similar to the approach with the existing
"info" endpoint the "health" endpoint can be found at <concourse_url>/health.
It can return a JSON object with very basic information about the health of its
parts:
{
db: <status>
<node_id> : <status>
<node_id> : <status>
...
}


# Open Questions

For which parts the information should be present in the JSON object - for all
the VMs (DB, web and worker nodes), or only for the DB and all worker nodes?
This question arises because a web node will serve the request - so at least one
web node should be healthy enough to return the response (if there is no healthy
web node - the caller would receive an error anyway).

What the status should contain (level of details) - only "OK" and "NOK" - or more
detailed information about the state of the specific workers or web nodes?

Should there be a specific property to configure caching responses interval
(caching responses might help in the prevention of DoS attacks)?


# Answered Questions


# New Implications

This change doesn't aim to change the general workflow of the users, i.e.
creating/updating pipelines and executing jobs. It only aims to ease the
monitoring of the Concourse cluster in order to better integrate it into bigger
systems.