Monitoring Grafana #3302

jaimegago · 2015-11-21T08:17:54Z

It's time to monitor the monitoring! It'd be great to have a /status or /health endpoint that returns grafana health data as json.

Things I'd like to get from a status endpoint are:

configured sources are reachable (when I configure a new graphite source I can test the connection, I'd love to have that via the /status API)
DB is available
configured authorization sources are reachable
version

e.g:

/status

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

anryko · 2015-11-21T08:48:21Z

++

kjedamzik · 2015-11-27T10:13:00Z

👍

torkelo · 2015-12-08T14:20:31Z

make sure the health url does not generate sessions

mattttt · 2016-01-08T20:58:05Z

👍

williamjoy · 2016-01-11T05:42:28Z

+1 , this would be very useful to run grafana behind loadbalancer, loadbalancer will call the /health HTTP to verify is grafana returns HTTP 200 OK.

theangryangel · 2016-06-04T18:03:34Z

I've put together something dead simple, but I'm not particularly happy with it at the moment.

If anyone would like to take a look at current state vs master: master...theangryangel:feature/health_check

It returns something like:

{"current_timestamp":"2016-06-04T18:43:49+01:00","database_ok":true,"session_ok":true,"version":{"built":1464981754,"commit":"v3.0.4+158-g7cbaf06-dirty","version":"3.1.0"}}

The database check I was originally returning some stats, but I've cut that out. I could switch the query to something much simpler like "select 1" and checking it doesn't error. Not sure if it's worth it.

The session check I'm not particularly happy with either. There doesn't seem to be an easy to test without standing up a test macaron server and recover()ing from the panic that it would throw when starting a session provider, or modifying macaron/session to add a test feature to each of the providers. As it is right now it irritating returns a Set-Cookie header, which I don't particularly want. I'd appreciate some input where to take this from someone more experienced with macaron 😞

Checking for data sources doesn't seem particularly sane to try through this given how grafana is written. Probably more sane to add to your regular monitoring system.

wpt1313 · 2016-06-10T14:36:56Z

I was facing the same issue and as a workaround, I use an API call from the load balancer with a dedicated authentication API key. I'm using HAProxy, which has some useful "hidden" feature of setting custom HTTP headers in option httpchk:

option httpchk GET /api/org HTTP/1.0\r\nAccept:\ application/json\r\nContent-Type:\ application/json\r\nAuthorization:\ Bearer\ your_api_key\r\n

(I need to use HTTP/1.0 rather than 1.1, since the latter requires setting Host header and I can't get it dynamically in HAProxy config).

/api/org seems to be the simplest request with little overhead and returns HTTP 200, which is exactly what the load balancer needs -- and does not create any new sessions.

iceycake · 2016-07-07T16:47:52Z

Any progress or PR on this issue?

tuxtek · 2016-09-29T16:43:33Z

+1

JorritSalverda · 2016-09-29T17:07:20Z

I would split this into a separate /liveness and /readiness endpoint as is best practice in kubernetes. /liveness only indicates whether grafana itself is up and running, /readiness indicates whether its ready to receive traffic and will check whether its dependencies are reachable.

In kubernetes the liveness endpoint will be probed and when failing a number of times to respond with 200 ok the container will be killed and replaced with a new one. The readiness endpoint is used to make the container part of a service and send traffic its way. Like adding and removing it from a load balancer.

marco-hoyer · 2016-10-12T12:57:40Z

+1

bigkraig · 2016-11-03T16:46:57Z

what about adding a /metrics Prometheus endpoint?

vinhlh · 2016-11-08T09:50:09Z

+1

vinhlh · 2016-11-08T11:22:59Z

For whoever needs health checks on some services like Amazon ECS:
Use this hack: Path /public/img/grafana_icon.svg, HTTP Code: 200.

philip-wernersbach · 2016-11-14T18:04:39Z

+1

envintus · 2016-12-05T21:09:20Z

In the mean time if you're only looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer.

andyfeller · 2016-12-05T21:16:14Z

Think everyone knows how to technically imply this but the point is to explicitly support monitoring of service health including external dependencies.

…

Sent from my iPhone

On Dec 5, 2016, at 4:09 PM, Hunter Satterwhite ***@***.***> wrote: If you're looking for a simple HTTP code: 200, then just use /login. My colleague and I just deployed Grafana to a Kubernetes cluster and using that endpoint worked just fine for the liveness/readiness probes. Also works for the Google Compute Engine load balancer. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub, or mute the thread.

philip-wernersbach · 2016-12-06T04:50:57Z

I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as /login to work around the absence of this, but we really need something that checks that the Grafana internals are running as expected. We don't necessarily need status checks for retrieving data from data sources, as we have separate health checks for those.

envintus · 2016-12-06T11:58:54Z

+1 to this.

…

On Mon, Dec 5, 2016 at 11:51 PM, Philip Wernersbach < ***@***.***> wrote: I'd like to add our specific use case: we need a simple HTTP endpoint for checking if a user can login and display graphs. I know that we can use the static resources and endpoints such as /login to work around the absence of this, but we really need something that checks that the Grafana internals are running as expected. We don't necessarily need status checks for retrieving data from data sources, as we have separate health checks for those. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#3302 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AIESgm7BZw3jqs8ElVWU9v7CjtcXBYFwks5rFOm-gaJpZM4Gm4T8> .

-- [image: TransLoc_logos_gear-blue_600x600.png] Hunter Satterwhite Lead Build & Operations Engineer, TransLoc Cell: 252.762.5177 | http://transloc.com <http://www.transloc.com/> [image: social media icons-03.png] <https://www.facebook.com/TransLoc/> [image: social media icons-04.png] <https://www.linkedin.com/company/transloc> [image: social media icons-02.png] <http://www.twitter.com/transloc> [image: social media icons-01.png] <http://www.instagram.com/transloc_inc>

torkelo · 2016-12-14T12:33:58Z

So there is currently in 4.0 a /api/metrics endpoint with some internal metrics.

But the issue requests something like this

{ "date_sources_ok": True, "database_ok": True, "authorization_ok": True, "grafana_version": "2.5.1" }

Would be good with a more detailed description for what is expected here. Should the API health call do a live check with all data sources in all orgs? should it be done on the fly as the /health api call is made?
What does authorization ok mean?

andyfeller · 2016-12-14T13:07:06Z

@torkelo going to toss out an idea but definitely think /health should allow for both grafana-server as well as installed plugins to register arbitrary things to report on:

{
	"ok": false,
	"items": [
		"datasources": {
			"ok": true,
		},
		"database": {
			"ok": false,
			"msg": "Cannot communicate ###.###.###.###/XXXXXXX"
		},
		...
	]
}

By default, health checks perform live checks of all things when endpoint is called. If people want to isolate health checks to specific things, you can do something like elasticsearch does for cluster health. When thing is an external service (authorization, database, etc), then connectivity test is done at the minimum and any other sanity check that is reasonable for thing (e.g. SELECT 1 for database, LDAP bind test for authorization, etc).

Having output like this will allow monitoring checks to check holistically for issues while finding specific problems and output accordingly.

aseppala · 2017-01-24T12:10:46Z

+1

jaimegago · 2017-01-24T19:49:19Z

@torkelo sorry for the delayed answer just saw your questions.

TL;DR
@andyfeller Did a good job in his comment and it's pretty much what I had in mind

The end point (or end points) used to monitor Grafana should answer 2 questions with details:
A) Is this Grafana instance up and ready ?
B) Is this Grafana instance running as expected according to its configuration intents?

"configuration intents" is key here, what I mean by intent is that when for example the admin adds as a data source she expects it to be available regardless of whether or not the saved configuration is right. Thus if a configured data source is not available to Grafana the monitoring end point should say so and why, in the same fashion the extremely useful "test" button works.

It helps me think in terms of a plane taking off, first I need to know the plane has finished taking off and is in the air, then I need to know the plane is flying towards its destination as expected (let's not get into what "reaching cruise altitude" means ;-) )

This can be somewhat be compared to the /live /ready others have pointed out or /health (1) /state (2) of the Elasticsearch model or /health and /info of Sensu (3).
IMHO one endpoint is enough but seeing 2 endpoints in most modern tools is kinda changing my mind; let's just say I'm not persuaded yet as I think B is a subset of A so I'd make the JSON returned reflect that instead of having 2 end points. Then one day when Grafana can be clustered a "/cluster_state" can be added.

Now regarding the details of each answer, here are my -non exhaustive- initial thoughts:
A details :

Status (e.g. red/yellow/green)
Status comment (e.g. "All is good"/"Couldn't start component Foo"/"Starting")
Version (e.g. v4.1.1-1)

B details:

DB Status (e.g. red/yellow/green)
DB details (e.g. "couldn't connect, bad auth", or connection ok to mySQL v4.1 at xxx.yyy.zzz:3306, schema version v34132, yes SQL schemas should be versioned (4) )
Authentication/Authorization (e.g. LDAP connection to xx.xx.xx:389 ok)
Data sources (e.g. Datasource 1, type Graphite, status Red, status comment "auth failure, Datasource 2, type Elasticsearch, status Green, status comment "all good")

There is much more that can go in B which is why breaking the monitoring into 2 end points might make more sense, meh.

As to how to go about what happens when the end point is being queried (on the fly, APIs ,etc), I would defer to who ever ends up implementing.

A couple of - obvious?- advices though:

be very mindful of resources used to collect monitoring data and be very "protective" with the instrumentation code, help Grafana admins avoid "my monitoring of Grafana took Grafana down" or "Grafana has slowed down by X % since I started monitoring it" situations.
be as certain as you can on the provided monitoring data, alert fatigue is a plague

(1) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-health.html
(2) https://www.elastic.co/guide/en/elasticsearch/reference/current/cluster-state.html
(3) https://sensuapp.org/docs/0.23/api/health-and-info-api.html#the-info-api-endpoint
(4) https://blog.codinghorror.com/get-your-database-under-version-control/

dynek · 2017-03-23T13:35:14Z

So 4.2.0 just came out and there still is no way to probe the service? (think k8s cluster)

jaimegago · 2017-03-23T16:01:21Z

@torkelo I think @dynek has a point, this is not optional anymore. Whether it's a new section in the docs dedicated to "how to monitor Grafana" where what can be done today with the existing instrumentation (e.g. leverage admin or metrics page) is documented or a full fleshed dedicated API like in this proposal we need something yesterday.
Please don't take this the wrong way, I don't mean to tell you what the priorities should be, It's just that it's a tough sell for an application to be "Enterprise Ready" without a dedicated part to how to monitor it.

al-joshwilliams · 2017-04-07T19:01:54Z

+1

…creates sessions, returns db status as well. #3302

torkelo · 2017-04-25T15:22:53Z

Added a simple http endpoint to check grafana health:

GET /api/health 
{
  "commit": "349f3eb",
  "database": "ok",
  "version": "4.1.0"
}

If database (mysql/postgres/sqlite3) is not reachable it will return "failing" in the database field. Grafana will still answer with status code 200. Not sure what is correct in that case.

The most important thing about this endpoint is that it will never cause sessions to be created (Something other api calls might do if you do not call them with an api key or basic auth).

ConorNevin · 2017-04-25T18:08:23Z

Wouldn't it be best to return with status code 503 when the database is unreachable?

adamcstephens · 2017-04-25T19:32:26Z

Kubernetes uses:

Any code greater than or equal to 200 and less than 400 indicates success. Any other code indicates failure.

torkelo · 2017-04-25T19:53:28Z

Yes, I think 503 status code when db status failed is best, will update

ref #8277, ref #8250, ref #8262, ref #8165, ref #8093, ref #8056, ref #8043, ref #7970, ref #7914, ref #7864, ref #7750, ref #7740, ref #7697, ref #7619, ref #5619, ref #4030, ref #5278, ref #3302, ref #2524

ref #3302

JorritSalverda · 2017-10-26T11:57:27Z

The 503 means the /api/health endpoint is best only used for the readiness check in Kubernetes. If this check is used for liveness a database issue will lead to all pods getting killed. Is there a query parameter to leave out the database check?

bedrin · 2017-11-01T18:00:14Z

@JorritSalverda you could probably use tcpSocket check in livenessProbe

bergquist · 2017-11-01T18:15:20Z

/metrics will not create sessions or issue a db request.

micachen · 2018-08-21T18:41:53Z

we typically have agressive readiness checks and relaxed liveness checks, 1 second, 1 fail for readiness, while it's 60 seconds 10 fails 1 success for liveness, this allows for responsive rerouting when there is an issue, but at the same time if self recovery is possible, prevents unnecessary pod restarts. But a persistent DB issue would cause restart which might actually help if it was due to some bad container state.

Document the health check implemented in grafana#3302 (and grafana#935), see grafana#3302 (comment)

Document the health check implemented in #3302 (and #935), see #3302 (comment)

* grafana/grafana#3302

suridaddy · 2021-03-26T07:12:51Z

@finkr /api/health take too long to response 503. Is there any way to make it reponse in a short term?

andyfeller · 2021-03-26T11:01:45Z

@finkr /api/health take too long to response 503. Is there any way to make it reponse in a short term?

@suridaddy : it might be easier to visit the Grafana community forums or the more interactive support channels along with more information to troubleshoot your problem. This issue is for feature / improvement and is closed.

torkelo added type/feature-request prio/medium Important over the long term, but may not be staffed and/or may need multiple releases to complete. help wanted labels Nov 21, 2015

torkelo mentioned this issue Dec 8, 2015

need a health check entry path to Grafana that won't generate session #3457

Closed

bergquist mentioned this issue Dec 28, 2015

FR: HealthCheck for Datasources #3609

Closed

bergquist mentioned this issue Feb 9, 2016

HealthCheck endpoint on Grafana #3967

Closed

bergquist added this to the 4.1.0 milestone Nov 3, 2016

bergquist mentioned this issue Nov 18, 2016

session_life_time not working #6635

Closed

torkelo removed this from the 4.1.0 milestone Dec 14, 2016

torkelo added this to the 4.3.0 milestone Mar 27, 2017

torkelo added a commit that referenced this issue Apr 25, 2017

feat: added api health endpoint that does not require auth and never …

368e847

…creates sessions, returns db status as well. #3302

torkelo closed this as completed Apr 25, 2017

daniellee added a commit that referenced this issue May 10, 2017

api: health check returns 503 if db is failing

4a35126

ref #3302

sebglon mentioned this issue Jan 29, 2018

Helm Grafana - ingress fail with anonymous disabled prometheus-operator/prometheus-operator#934

Closed

gianrubio mentioned this issue Jan 30, 2018

Add readinessProbe for grafana Helm prometheus-operator/prometheus-operator#935

Merged

finkr added a commit to finkr/grafana that referenced this issue Jan 25, 2019

Document /api/health

1fc5ea1

Document the health check implemented in grafana#3302 (and grafana#935), see grafana#3302 (comment)

This was referenced Jan 25, 2019

Document /api/health finkr/grafana#1

Merged

Document /api/health #15068

Merged

jschill pushed a commit that referenced this issue Jan 28, 2019

Document /api/health

22f9f47

Document the health check implemented in #3302 (and #935), see #3302 (comment)

dghubble added a commit to poseidon/typhoon that referenced this issue Mar 24, 2019

Add liveness and readiness probes to Grafana

36e31fc

* grafana/grafana#3302

cnouguier mentioned this issue May 10, 2019

Add health check to services kalisio/kaabah#75

Closed

6 tasks

bergquist mentioned this issue Sep 11, 2020

Api: Add /healthz endpoint for health checks #27536

Merged

Monitoring Grafana #3302

Monitoring Grafana #3302

Comments

jaimegago commented Nov 21, 2015

anryko commented Nov 21, 2015

kjedamzik commented Nov 27, 2015

torkelo commented Dec 8, 2015

mattttt commented Jan 8, 2016

williamjoy commented Jan 11, 2016

theangryangel commented Jun 4, 2016

wpt1313 commented Jun 10, 2016

iceycake commented Jul 7, 2016

tuxtek commented Sep 29, 2016

JorritSalverda commented Sep 29, 2016 • edited Loading

marco-hoyer commented Oct 12, 2016

bigkraig commented Nov 3, 2016

vinhlh commented Nov 8, 2016

vinhlh commented Nov 8, 2016

philip-wernersbach commented Nov 14, 2016

envintus commented Dec 5, 2016 • edited Loading

andyfeller commented Dec 5, 2016 via email

philip-wernersbach commented Dec 6, 2016

envintus commented Dec 6, 2016 via email

torkelo commented Dec 14, 2016 • edited Loading

andyfeller commented Dec 14, 2016

aseppala commented Jan 24, 2017

jaimegago commented Jan 24, 2017

dynek commented Mar 23, 2017

jaimegago commented Mar 23, 2017

al-joshwilliams commented Apr 7, 2017

torkelo commented Apr 25, 2017 • edited Loading

ConorNevin commented Apr 25, 2017

adamcstephens commented Apr 25, 2017

torkelo commented Apr 25, 2017

JorritSalverda commented Oct 26, 2017

bedrin commented Nov 1, 2017 • edited Loading

bergquist commented Nov 1, 2017

micachen commented Aug 21, 2018

suridaddy commented Mar 26, 2021

andyfeller commented Mar 26, 2021

JorritSalverda commented Sep 29, 2016 •

edited

Loading

envintus commented Dec 5, 2016 •

edited

Loading

torkelo commented Dec 14, 2016 •

edited

Loading

torkelo commented Apr 25, 2017 •

edited

Loading

bedrin commented Nov 1, 2017 •

edited

Loading