Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

Closed
Tracked by #127224
jasonrhodes opened this issue Feb 24, 2022 · 18 comments · Fixed by #132613
Closed
Tracked by #127224

[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386

jasonrhodes opened this issue Feb 24, 2022 · 18 comments · Fixed by #132613
Assignees
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.2.0

Comments

@jasonrhodes
Copy link
Member

jasonrhodes commented Feb 24, 2022

Summary

Currently, if you shut down the Kibana you are monitoring while running the Metricbeat kibana monitoring module, Metricbeat will fill its logs with errors about not being able to connect, but Kibana will still report as "Healthy" or "Status: Green"

Metricbeat Logs:

2022-02-24T14:54:48.859-0500	ERROR	[kibana.stats]	stats/stats.go:191	HTTP error 404 in : 404 Not Found
2022-02-24T14:54:58.871-0500	ERROR	[kibana.stats]	stats/stats.go:191	HTTP error 404 in : 404 Not Found
2022-02-24T14:55:08.847-0500	ERROR	[kibana.stats]	stats/stats.go:191	HTTP error 404 in : 404 Not Found

Monitoring UI:
Screen Shot 2022-02-23 at 7 08 52 PM
Screen Shot 2022-02-23 at 7 08 36 PM

This would be the case if Kibana was down, unreachable, or otherwise incapacitated. These are the times when a monitoring user will want to know that Kibana may not be behaving properly.

Acceptance Criteria

For Kibana,

  • Change individual column "Status" to "Last Reported Status" in the instances table
  • Add new column to the instances table for "Last Seen" that has a relative time up to 6 hours, then show a date time
  • Next to the Last Seen value, if the value is > UNEXPECTED_DELAY, show a warning icon next to it ⚠️ with a tooltip that explains what it means
  • For aggregate statuses (on the main overview page and at the top of the instances table) ... if MAX(last seen) of all instances is > UNEXPECTED_DELAY, we will show the same warning icon next to the status with a link to "View all instances"

Note: UNEXPECTED_DELAY should default to 2 minutes but be configurable via kibana.yml

@jasonrhodes jasonrhodes added bug Fixes for quality problems that affect the customer experience Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services Feature:Stack Monitoring labels Feb 24, 2022
@elasticmachine
Copy link
Contributor

Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI)

@jasonrhodes
Copy link
Member Author

@ravikesarwani my instinct is to just change the health status to "Unknown" or "Unavailable" whenever there is no data -- I'm not sure what the query looks like but is that agreeable to you too, if we can figure out how to determine that?

@jasonrhodes
Copy link
Member Author

jasonrhodes commented Mar 1, 2022

UPDATE: this comment is no longer relevant because of how Stack Monitoring queries work today. See this one instead.

OK I'm thinking through this a bit more. I see a number of states we need to think through:

Ref State Individual Status Individual Color
1A Reporting as green Healthy Green
1B Reporting as yellow {Yellow Status} Yellow
1C Reporting as red {Red Status} Red
1D Not reporting {Unknown Status} Grey
Ref State Overall Status Overall Color
nA All Kibana instances healthy Healthy Green
nB All Kibana instances {Yellow Status} {Yellow Status} Yellow
nC All Kibana instances {Red Status} ?? Red
nD All Kibana instances {Unknown Status} {Unknown Status} Grey
nE Kibana instances mixed, none are {Red Status} {Mixed Status} Yellow? Grey?
nF Kibana instances mixed, at least one is {Red Status} {Red Status} or {Mixed Status} Red

If these all make sense (roughly), we need to define:

  • {Yellow Status}
  • {Red Status}
  • {Unknown Status}
  • {Mixed Status} - do we want to introduce this?
  • should nE show color as yellow, grey, or something else?
  • should nF show as {Red Status} or {Mixed Status}

The MAIN goal of this ticket is to make sure "{Unknown Status}" is represented, rather than just "last known status" which incorrectly tells a user their Kibanas are green when they are being crushed.

@jasonrhodes
Copy link
Member Author

Related: #103816

@jasonrhodes
Copy link
Member Author

@neptunian and I discussed this and she helped me realize that we don't currently have a way to tell that a Kibana instance has stopped reporting data. Here's an example of why:

Instance 2 hours ago 1 hour ago 15 minutes ago 1 minute ago
1 Green x x x
2 Green Green Green Green
3 x x Green Green
4 Green Green Green x

If the time picker was configured to a range of "Last 15 minutes", the UI would show:

Instance Status
2 Green
3 Green
4 Green

The query is basically tuned to look for all Kibana monitoring documents in the given range. It won't find any documents for instance 1, which stopped reporting between 1-2 hours ago, so that instance just won't be in the list.

If, however, the time picker was set to "Last 24 hours", the UI would show:

Instance Status
1 Green
2 Green
3 Green
4 Green

Because the last document found for each instance has a status of "green".

Ultimately, we probably want to drive (part of) this UI based on running alerting rules, which will have the ability to track when an instance "disappears" (it was reporting data but then stopped), which we can use to indicate instances that are down/unhealthy even if they aren't currently reporting data.

In the meantime, we could probably consider a few simple changes to make this experience better:

  1. Change the "Status" heading for the individual instances from "Status" to "Last Known Status" or "Last Reported Status"
  2. Add a column to the instance table that reports "Last Seen" and include a relative date for the timestamp of the last document in the range.
  3. Consider adding some sort of warning icon (or possibly adjusting the status itself) when a given instance's "Last Seen" value is greater than "x".
  4. Consider updating the "aggregate status" values to something that isn't "Green" when any of the reporting instances have a "Last Seen" value greater than "x".

@ravikesarwani let's chat about this at some point soon.

@miltonhultgren
Copy link
Contributor

#126709 for discussions on the rules that relate to this.

I'm also wondering if adding entity centric indices for nodes is an option?
Then we could query the "known nodes" first and then query the last known status for each of those nodes for a given timerange and show them as Unknown if no data is given back.

@jasonrhodes
Copy link
Member Author

I'm also wondering if adding entity centric indices for nodes is an option?
Then we could query the "known nodes" first and then query the last known status for each of those nodes for a given timerange and show them as Unknown if no data is given back.

You read (someone's) mind! This has been a topic bouncing around Elastic for a few years, having changed owners a few times due to people coming and going over that long of a timeline. The latest effort is called "entity model". I'll send you some information about it, but yes, we need a way to know about entity existence outside of a query for the monitoring documents created while monitoring that entity.

@jasonrhodes
Copy link
Member Author

jasonrhodes commented Mar 15, 2022

Chatted with @ravikesarwani, let's go with the following for Kibana to start (we will need to investigate other products also):

  • Change individual column "Status" to "Last Reported Status"
  • Add new column for "Last Seen" that has a relative time up to 6 hours, then show a date time
  • Next to the Last Seen value, if the value is >2 minutes, show a warning icon next to it ⚠️
  • For aggregate statuses ... if MAX(last seen) of all instances is >2 minutes, we will show the same warning icon next to the status with a link to "View all instances"

Only thing I'm not sure about: is "2 minutes" going to work for everyone always to hard-code? What if a user has modified their monitoring to report once every 5 minutes? We need to check if this is possible.

To be safe, let's make this 2 minute value configurable via kibana.yml just in case.

UPDATE: I added this to the Acceptance Criteria in the description.

@miltonhultgren miltonhultgren self-assigned this May 12, 2022
@miltonhultgren
Copy link
Contributor

miltonhultgren commented May 19, 2022

In some places in the code, there is a flag marked as availability which is supposed to change the color of the status indicator to gray, it's set to false if the timestamp of the last Kibana document is more than 10 minutes.

Update: The frontend code wasn't using this property correctly, hence never changing the color.

@miltonhultgren
Copy link
Contributor

We have these kinds of availability checks for other products too but I'm leaving those out of this work, but we might need to confirm if they work as intended or not.

@miltonhultgren
Copy link
Contributor

@jasonrhodes I'm a little confused by "if MAX(last seen) of all instances is > UNEXPECTED_DELAY"

The way I've built it so far is:
For each Kibana instance, grab the timestamp of the last reported message.
If any of those timestamps are older than UNEXPECTED_DELAY I show the aggregate status warning.

I'd expect users would like to be informed in the overview as soon as at least one instance is having issues.
Is that how you meant?

Or did you mean that we should show the warning only if ALL instances are suffering from UNEXPECTED_DELAY?

@jasonrhodes
Copy link
Member Author

If any of those timestamps are older than UNEXPECTED_DELAY I show the aggregate status warning.

Yes that's what I was thinking. I think that would be how "MAX(last seen)" would work?

Host Last Seen
1 10s
2 3m
3 20s
4 1.5m
5 40s

MAX(last seen) here should return "3 minutes" (for host 2). If we just compare that to the threshold (e.g. 2 min), the MAX would be greater and we would show the warning (even though host 2 is the only host that is over the threshold). If we removed host 2 from that data set, MAX (last seen) would return "1.5 min" (for host 4), compared to the threshold we'd not show the warning, because we essentially proved no host has exceeded the last seen threshold.

Makes sense?

@jasonrhodes
Copy link
Member Author

Obviously if you've figured out a simpler way to query this, that's fine also -- I just wanted to explain my MAX thinking there. :)

@miltonhultgren
Copy link
Contributor

All clear, I just had all the data already from the current queries so I did it in JS instead of a new ES aggregation (since I also need to mark which instance has the delay and which doesn't)!

@neptunian
Copy link
Contributor

neptunian commented Jun 6, 2022

I'm not sure these changes addresses the issue of "These are the times when a monitoring user will want to know that Kibana may not be behaving properly" when the instances still disappear out of the time window along with this new data and warnings. They would have to know to widen their time window. It'd be good for the user to have a more explicit way of being notified that data isn't being collected when it ought to be. Perhaps we can add a rule similar to "Missing Monitoring Data" for ES nodes but for Kibana instances. @klacabane also mentioned an idea about having a kibana dataset called "error" where if metricbeat kibana module is still enabled but unable to collect metrics it could send some errors to that metricset which we could incorporate into "status".

@klacabane
Copy link
Contributor

In metricbeat > v8, metricset collection errors (both at transport and processing layer) are routed to the metricbeat-* indice so we could already take advantage of that. These errors may not provide sufficient metadata to, for example, link the error down to a specific node but can raise awareness that something fishy is going on

@jasonrhodes
Copy link
Member Author

I'm not sure these changes addresses the issue of "These are the times when a monitoring user will want to know that Kibana may not be behaving properly" when the instances still disappear out of the time window along with this new data and warnings.

I agree with everything you've said in your comment but I think it's a different issue and shouldn't block us from trying to make the more urgent problem better, imo. The urgent problem is around "these are the times when" which was referring to the times in which an instance appears within the time window, but shows as green when it is actually unhealthy or down.

When it's outside of the time window, we never have any information on it. We should probably solve that too but it will likely involve a large rearchitecting of the data, or (as you mentioned) rely on alerting somehow. I particularly like the idea of having collectors report explicitly when they can no longer reach the component they are meant to be monitoring.

@miltonhultgren
Copy link
Contributor

miltonhultgren commented Jun 8, 2022

Do we capture that bigger problem in another issue somewhere?

There was a lot of good points raised in the PR about applying this to other products and also how to avoid noise while still showing the status so I feel like we have problems here that require thought and design effort.

miltonhultgren added a commit that referenced this issue Jun 8, 2022
* [Stack Monitoring] Add stale status reporting for Kibana (#126386)

* Fix stale message grammar and update stale indicator to use EuiBadge

* Fix i18n ids

* Remove unused i18n key

* Fix Jest tests

* Update exposeToBrowser test

* Update API integration tests

* Fix functional tests

* Fix API integration tests

* Update snapshots

Co-authored-by: Kibana Machine <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Fixes for quality problems that affect the customer experience Feature:Stack Monitoring Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v8.2.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants