-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: health endpoint #46
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,63 @@ | ||
# Summary | ||
|
||
A public HTTP endpoint that gives basic information about the health of Concourse | ||
cluster. | ||
|
||
|
||
# Motivation | ||
|
||
There are some scenarios in which a Concourse cluster is part of bigger setup for | ||
automation. In such scenario other parts of the system rely on Concourse not | ||
only for CI/CD, but for other type of automations as well. In order to have the | ||
whole process working the system also monitors the availability of its parts and | ||
takes measures if some of its parts are not operational (for example executes | ||
some predefined steps - sends informing mails, triggers alerts, executes | ||
medication steps, etc). In such cases it is important for the system to be able | ||
to determine the state of every part of it. So it would be nice if the parts of | ||
this system have a common way to return their health/availability status. | ||
|
||
Currently there is no easy way for external (monitoring) system to understand | ||
if a Concourse cluster is live and operational. It would be nice if Concourse | ||
also gives information for its health so in (the rare) case it is not healthy | ||
the external system can react. | ||
|
||
|
||
# Proposal | ||
|
||
Concourse can expose a public HTTP endpoint called "health" endpoint that gives | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Would you mind elaborating a bit more on how you see the detection of "healthiness" would look like? Also, do you think this endpoint should be public (in the sense of requiring no credentials at all) and have per-node information being displayed? I think a problem there is that it'd conflict with the current auth requirements that we have for someone reading |
||
basic information for its health. Similar to the approach with the existing | ||
"info" endpoint the "health" endpoint can be found at <concourse_url>/health. | ||
It can return a JSON object with very basic information about the health of its | ||
parts: | ||
{ | ||
db: <status> | ||
<node_id> : <status> | ||
<node_id> : <status> | ||
... | ||
} | ||
|
||
|
||
# Open Questions | ||
|
||
For which parts the information should be present in the JSON object - for all | ||
the VMs (DB, web and worker nodes), or only for the DB and all worker nodes? | ||
This question arises because a web node will serve the request - so at least one | ||
web node should be healthy enough to return the response (if there is no healthy | ||
web node - the caller would receive an error anyway). | ||
|
||
What the status should contain (level of details) - only "OK" and "NOK" - or more | ||
detailed information about the state of the specific workers or web nodes? | ||
|
||
Should there be a specific property to configure caching responses interval | ||
(caching responses might help in the prevention of DoS attacks)? | ||
|
||
|
||
# Answered Questions | ||
|
||
|
||
# New Implications | ||
|
||
This change doesn't aim to change the general workflow of the users, i.e. | ||
creating/updating pipelines and executing jobs. It only aims to ease the | ||
monitoring of the Concourse cluster in order to better integrate it into bigger | ||
systems. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding some more context to this for those outside the team
the way that we (the Concourse team itself) has been dealing with the "is this installation good or not?" question has been mostly through the use of the metrics that we expose, and SLIs (https://github.com/concourse/oxygen-mask - which relies on another Concourse installation sending probes to the first - / https://github.com/cirocosta/slirunner - doesn't require another installation, but is k8s and prometheus-first) which takes the approach of verifying the healthiness by sending specific workloads to it that would attest in an end-to-end fashion whether it's properly working or not.
what I like about this kind of approach is that you can keep that external to the main system, and thus, build the integrations to it however you want (e.g., in the case of
slirunner
, it exposes the information through Prometheus, but if you'd prefer to have an endpoint being hit when things fail, you could build that too, all while still leveraging "the core concourse" under the hood), as well as highlighting which kinds of user workflows are currently broken / degredated 🤔