[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

jasonrhodes · 2022-02-24T19:59:44Z

Summary

Currently, if you shut down the Kibana you are monitoring while running the Metricbeat kibana monitoring module, Metricbeat will fill its logs with errors about not being able to connect, but Kibana will still report as "Healthy" or "Status: Green"

Metricbeat Logs:

2022-02-24T14:54:48.859-0500	ERROR	[kibana.stats]	stats/stats.go:191	HTTP error 404 in : 404 Not Found
2022-02-24T14:54:58.871-0500	ERROR	[kibana.stats]	stats/stats.go:191	HTTP error 404 in : 404 Not Found
2022-02-24T14:55:08.847-0500	ERROR	[kibana.stats]	stats/stats.go:191	HTTP error 404 in : 404 Not Found

Monitoring UI:

This would be the case if Kibana was down, unreachable, or otherwise incapacitated. These are the times when a monitoring user will want to know that Kibana may not be behaving properly.

Acceptance Criteria

For Kibana,

Change individual column "Status" to "Last Reported Status" in the instances table
Add new column to the instances table for "Last Seen" that has a relative time up to 6 hours, then show a date time
Next to the Last Seen value, if the value is > UNEXPECTED_DELAY, show a warning icon next to it ⚠️ with a tooltip that explains what it means
For aggregate statuses (on the main overview page and at the top of the instances table) ... if MAX(last seen) of all instances is > UNEXPECTED_DELAY, we will show the same warning icon next to the status with a link to "View all instances"

Note: UNEXPECTED_DELAY should default to 2 minutes but be configurable via kibana.yml

The text was updated successfully, but these errors were encountered:

elasticmachine · 2022-02-24T19:59:46Z

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

jasonrhodes · 2022-02-28T22:13:46Z

@ravikesarwani my instinct is to just change the health status to "Unknown" or "Unavailable" whenever there is no data -- I'm not sure what the query looks like but is that agreeable to you too, if we can figure out how to determine that?

jasonrhodes · 2022-03-01T15:12:25Z

UPDATE: this comment is no longer relevant because of how Stack Monitoring queries work today. See this one instead.

OK I'm thinking through this a bit more. I see a number of states we need to think through:

Ref	State	Individual Status	Individual Color
1A	Reporting as green	Healthy	Green
1B	Reporting as yellow	{Yellow Status}	Yellow
1C	Reporting as red	{Red Status}	Red
1D	Not reporting	{Unknown Status}	Grey

Ref	State	Overall Status	Overall Color
nA	All Kibana instances healthy	Healthy	Green
nB	All Kibana instances {Yellow Status}	{Yellow Status}	Yellow
nC	All Kibana instances {Red Status}	??	Red
nD	All Kibana instances {Unknown Status}	{Unknown Status}	Grey
nE	Kibana instances mixed, none are {Red Status}	{Mixed Status}	Yellow? Grey?
nF	Kibana instances mixed, at least one is {Red Status}	{Red Status} or {Mixed Status}	Red

If these all make sense (roughly), we need to define:

{Yellow Status}
{Red Status}
{Unknown Status}
{Mixed Status} - do we want to introduce this?
should nE show color as yellow, grey, or something else?
should nF show as {Red Status} or {Mixed Status}

The MAIN goal of this ticket is to make sure "{Unknown Status}" is represented, rather than just "last known status" which incorrectly tells a user their Kibanas are green when they are being crushed.

jasonrhodes · 2022-03-01T20:51:45Z

Related: #103816

jasonrhodes · 2022-03-02T23:51:26Z

@neptunian and I discussed this and she helped me realize that we don't currently have a way to tell that a Kibana instance has stopped reporting data. Here's an example of why:

Instance	2 hours ago	1 hour ago	15 minutes ago	1 minute ago
1	Green	x	x	x
2	Green	Green	Green	Green
3	x	x	Green	Green
4	Green	Green	Green	x

If the time picker was configured to a range of "Last 15 minutes", the UI would show:

Instance	Status
2	Green
3	Green
4	Green

The query is basically tuned to look for all Kibana monitoring documents in the given range. It won't find any documents for instance 1, which stopped reporting between 1-2 hours ago, so that instance just won't be in the list.

If, however, the time picker was set to "Last 24 hours", the UI would show:

Instance	Status
1	Green
2	Green
3	Green
4	Green

Because the last document found for each instance has a status of "green".

Ultimately, we probably want to drive (part of) this UI based on running alerting rules, which will have the ability to track when an instance "disappears" (it was reporting data but then stopped), which we can use to indicate instances that are down/unhealthy even if they aren't currently reporting data.

In the meantime, we could probably consider a few simple changes to make this experience better:

Change the "Status" heading for the individual instances from "Status" to "Last Known Status" or "Last Reported Status"
Add a column to the instance table that reports "Last Seen" and include a relative date for the timestamp of the last document in the range.
Consider adding some sort of warning icon (or possibly adjusting the status itself) when a given instance's "Last Seen" value is greater than "x".
Consider updating the "aggregate status" values to something that isn't "Green" when any of the reporting instances have a "Last Seen" value greater than "x".

@ravikesarwani let's chat about this at some point soon.

miltonhultgren · 2022-03-03T08:12:19Z

#126709 for discussions on the rules that relate to this.

I'm also wondering if adding entity centric indices for nodes is an option?
Then we could query the "known nodes" first and then query the last known status for each of those nodes for a given timerange and show them as Unknown if no data is given back.

jasonrhodes · 2022-03-03T20:38:57Z

I'm also wondering if adding entity centric indices for nodes is an option?
Then we could query the "known nodes" first and then query the last known status for each of those nodes for a given timerange and show them as Unknown if no data is given back.

You read (someone's) mind! This has been a topic bouncing around Elastic for a few years, having changed owners a few times due to people coming and going over that long of a timeline. The latest effort is called "entity model". I'll send you some information about it, but yes, we need a way to know about entity existence outside of a query for the monitoring documents created while monitoring that entity.

jasonrhodes · 2022-03-15T17:42:57Z

Chatted with @ravikesarwani, let's go with the following for Kibana to start (we will need to investigate other products also):

Change individual column "Status" to "Last Reported Status"
Add new column for "Last Seen" that has a relative time up to 6 hours, then show a date time
Next to the Last Seen value, if the value is >2 minutes, show a warning icon next to it ⚠️
For aggregate statuses ... if MAX(last seen) of all instances is >2 minutes, we will show the same warning icon next to the status with a link to "View all instances"

Only thing I'm not sure about: is "2 minutes" going to work for everyone always to hard-code? What if a user has modified their monitoring to report once every 5 minutes? We need to check if this is possible.

To be safe, let's make this 2 minute value configurable via kibana.yml just in case.

UPDATE: I added this to the Acceptance Criteria in the description.

miltonhultgren · 2022-05-19T12:26:21Z

In some places in the code, there is a flag marked as availability which is supposed to change the color of the status indicator to gray, it's set to false if the timestamp of the last Kibana document is more than 10 minutes.

Update: The frontend code wasn't using this property correctly, hence never changing the color.

miltonhultgren · 2022-05-20T10:06:25Z

We have these kinds of availability checks for other products too but I'm leaving those out of this work, but we might need to confirm if they work as intended or not.

miltonhultgren · 2022-05-20T12:12:29Z

@jasonrhodes I'm a little confused by "if MAX(last seen) of all instances is > UNEXPECTED_DELAY"

The way I've built it so far is:
For each Kibana instance, grab the timestamp of the last reported message.
If any of those timestamps are older than UNEXPECTED_DELAY I show the aggregate status warning.

I'd expect users would like to be informed in the overview as soon as at least one instance is having issues.
Is that how you meant?

Or did you mean that we should show the warning only if ALL instances are suffering from UNEXPECTED_DELAY?

jasonrhodes · 2022-05-23T12:01:08Z

If any of those timestamps are older than UNEXPECTED_DELAY I show the aggregate status warning.

Yes that's what I was thinking. I think that would be how "MAX(last seen)" would work?

Host	Last Seen
1	10s
2	3m
3	20s
4	1.5m
5	40s

MAX(last seen) here should return "3 minutes" (for host 2). If we just compare that to the threshold (e.g. 2 min), the MAX would be greater and we would show the warning (even though host 2 is the only host that is over the threshold). If we removed host 2 from that data set, MAX (last seen) would return "1.5 min" (for host 4), compared to the threshold we'd not show the warning, because we essentially proved no host has exceeded the last seen threshold.

Makes sense?

jasonrhodes · 2022-05-23T12:01:40Z

Obviously if you've figured out a simpler way to query this, that's fine also -- I just wanted to explain my MAX thinking there. :)

miltonhultgren · 2022-05-23T13:59:29Z

All clear, I just had all the data already from the current queries so I did it in JS instead of a new ES aggregation (since I also need to mark which instance has the delay and which doesn't)!

)

neptunian · 2022-06-06T19:31:27Z

I'm not sure these changes addresses the issue of "These are the times when a monitoring user will want to know that Kibana may not be behaving properly" when the instances still disappear out of the time window along with this new data and warnings. They would have to know to widen their time window. It'd be good for the user to have a more explicit way of being notified that data isn't being collected when it ought to be. Perhaps we can add a rule similar to "Missing Monitoring Data" for ES nodes but for Kibana instances. @klacabane also mentioned an idea about having a kibana dataset called "error" where if metricbeat kibana module is still enabled but unable to collect metrics it could send some errors to that metricset which we could incorporate into "status".

klacabane · 2022-06-07T10:44:33Z

In metricbeat > v8, metricset collection errors (both at transport and processing layer) are routed to the metricbeat-* indice so we could already take advantage of that. These errors may not provide sufficient metadata to, for example, link the error down to a specific node but can raise awareness that something fishy is going on

jasonrhodes · 2022-06-07T22:02:49Z

I'm not sure these changes addresses the issue of "These are the times when a monitoring user will want to know that Kibana may not be behaving properly" when the instances still disappear out of the time window along with this new data and warnings.

I agree with everything you've said in your comment but I think it's a different issue and shouldn't block us from trying to make the more urgent problem better, imo. The urgent problem is around "these are the times when" which was referring to the times in which an instance appears within the time window, but shows as green when it is actually unhealthy or down.

When it's outside of the time window, we never have any information on it. We should probably solve that too but it will likely involve a large rearchitecting of the data, or (as you mentioned) rely on alerting somehow. I particularly like the idea of having collectors report explicitly when they can no longer reach the component they are meant to be monitoring.

miltonhultgren · 2022-06-08T08:18:15Z

Do we capture that bigger problem in another issue somewhere?

There was a lot of good points raised in the PR about applying this to other products and also how to avoid noise while still showing the status so I feel like we have problems here that require thought and design effort.

* [Stack Monitoring] Add stale status reporting for Kibana (#126386) * Fix stale message grammar and update stale indicator to use EuiBadge * Fix i18n ids * Remove unused i18n key * Fix Jest tests * Update exposeToBrowser test * Update API integration tests * Fix functional tests * Fix API integration tests * Update snapshots Co-authored-by: Kibana Machine <[email protected]>

jasonrhodes added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Feb 24, 2022

jasonrhodes added the v8.2.0 label Feb 28, 2022

smith mentioned this issue Apr 14, 2022

Stack Monitoring Tech Debt Plan #127224

Closed

39 tasks

miltonhultgren self-assigned this May 12, 2022

miltonhultgren added a commit to miltonhultgren/kibana that referenced this issue May 31, 2022

[Stack Monitoring] Add stale status reporting for Kibana (elastic#126386

011a04f

)

miltonhultgren mentioned this issue May 31, 2022

[Stack Monitoring] Add stale status reporting for Kibana #132613

Merged

9 tasks

miltonhultgren closed this as completed in #132613 Jun 8, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

jasonrhodes commented Feb 24, 2022 •

edited

Loading

elasticmachine commented Feb 24, 2022

jasonrhodes commented Feb 28, 2022

jasonrhodes commented Mar 1, 2022 •

edited

Loading

jasonrhodes commented Mar 1, 2022

jasonrhodes commented Mar 2, 2022

miltonhultgren commented Mar 3, 2022

jasonrhodes commented Mar 3, 2022

jasonrhodes commented Mar 15, 2022 •

edited

Loading

miltonhultgren commented May 19, 2022 •

edited

Loading

miltonhultgren commented May 20, 2022

miltonhultgren commented May 20, 2022

jasonrhodes commented May 23, 2022

jasonrhodes commented May 23, 2022

miltonhultgren commented May 23, 2022

neptunian commented Jun 6, 2022 •

edited

Loading

klacabane commented Jun 7, 2022

jasonrhodes commented Jun 7, 2022

miltonhultgren commented Jun 8, 2022 •

edited

Loading

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

Comments

jasonrhodes commented Feb 24, 2022 • edited Loading

Summary

Acceptance Criteria

elasticmachine commented Feb 24, 2022

jasonrhodes commented Feb 28, 2022

jasonrhodes commented Mar 1, 2022 • edited Loading

jasonrhodes commented Mar 1, 2022

jasonrhodes commented Mar 2, 2022

miltonhultgren commented Mar 3, 2022

jasonrhodes commented Mar 3, 2022

jasonrhodes commented Mar 15, 2022 • edited Loading

miltonhultgren commented May 19, 2022 • edited Loading

miltonhultgren commented May 20, 2022

miltonhultgren commented May 20, 2022

jasonrhodes commented May 23, 2022

jasonrhodes commented May 23, 2022

miltonhultgren commented May 23, 2022

neptunian commented Jun 6, 2022 • edited Loading

klacabane commented Jun 7, 2022

jasonrhodes commented Jun 7, 2022

miltonhultgren commented Jun 8, 2022 • edited Loading

jasonrhodes commented Feb 24, 2022 •

edited

Loading

jasonrhodes commented Mar 1, 2022 •

edited

Loading

jasonrhodes commented Mar 15, 2022 •

edited

Loading

miltonhultgren commented May 19, 2022 •

edited

Loading

neptunian commented Jun 6, 2022 •

edited

Loading

miltonhultgren commented Jun 8, 2022 •

edited

Loading