-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Stack Monitoring] Kibana should not report healthy when recent data is missing #126386
Comments
Pinging @elastic/infra-monitoring-ui (Team:Infra Monitoring UI) |
@ravikesarwani my instinct is to just change the health status to "Unknown" or "Unavailable" whenever there is no data -- I'm not sure what the query looks like but is that agreeable to you too, if we can figure out how to determine that? |
UPDATE: this comment is no longer relevant because of how Stack Monitoring queries work today. See this one instead. OK I'm thinking through this a bit more. I see a number of states we need to think through:
If these all make sense (roughly), we need to define:
The MAIN goal of this ticket is to make sure "{Unknown Status}" is represented, rather than just "last known status" which incorrectly tells a user their Kibanas are green when they are being crushed. |
Related: #103816 |
@neptunian and I discussed this and she helped me realize that we don't currently have a way to tell that a Kibana instance has stopped reporting data. Here's an example of why:
If the time picker was configured to a range of "Last 15 minutes", the UI would show:
The query is basically tuned to look for all Kibana monitoring documents in the given range. It won't find any documents for instance 1, which stopped reporting between 1-2 hours ago, so that instance just won't be in the list. If, however, the time picker was set to "Last 24 hours", the UI would show:
Because the last document found for each instance has a status of "green". Ultimately, we probably want to drive (part of) this UI based on running alerting rules, which will have the ability to track when an instance "disappears" (it was reporting data but then stopped), which we can use to indicate instances that are down/unhealthy even if they aren't currently reporting data. In the meantime, we could probably consider a few simple changes to make this experience better:
@ravikesarwani let's chat about this at some point soon. |
#126709 for discussions on the rules that relate to this. I'm also wondering if adding entity centric indices for nodes is an option? |
You read (someone's) mind! This has been a topic bouncing around Elastic for a few years, having changed owners a few times due to people coming and going over that long of a timeline. The latest effort is called "entity model". I'll send you some information about it, but yes, we need a way to know about entity existence outside of a query for the monitoring documents created while monitoring that entity. |
Chatted with @ravikesarwani, let's go with the following for Kibana to start (we will need to investigate other products also):
Only thing I'm not sure about: is "2 minutes" going to work for everyone always to hard-code? What if a user has modified their monitoring to report once every 5 minutes? We need to check if this is possible. To be safe, let's make this 2 minute value configurable via kibana.yml just in case. UPDATE: I added this to the Acceptance Criteria in the description. |
In some places in the code, there is a flag marked as Update: The frontend code wasn't using this property correctly, hence never changing the color. |
We have these kinds of availability checks for other products too but I'm leaving those out of this work, but we might need to confirm if they work as intended or not. |
@jasonrhodes I'm a little confused by "if MAX(last seen) of all instances is > UNEXPECTED_DELAY" The way I've built it so far is: I'd expect users would like to be informed in the overview as soon as at least one instance is having issues. Or did you mean that we should show the warning only if ALL instances are suffering from |
Yes that's what I was thinking. I think that would be how "MAX(last seen)" would work?
MAX(last seen) here should return "3 minutes" (for host 2). If we just compare that to the threshold (e.g. 2 min), the MAX would be greater and we would show the warning (even though host 2 is the only host that is over the threshold). If we removed host 2 from that data set, MAX (last seen) would return "1.5 min" (for host 4), compared to the threshold we'd not show the warning, because we essentially proved no host has exceeded the last seen threshold. Makes sense? |
Obviously if you've figured out a simpler way to query this, that's fine also -- I just wanted to explain my MAX thinking there. :) |
All clear, I just had all the data already from the current queries so I did it in JS instead of a new ES aggregation (since I also need to mark which instance has the delay and which doesn't)! |
I'm not sure these changes addresses the issue of "These are the times when a monitoring user will want to know that Kibana may not be behaving properly" when the instances still disappear out of the time window along with this new data and warnings. They would have to know to widen their time window. It'd be good for the user to have a more explicit way of being notified that data isn't being collected when it ought to be. Perhaps we can add a rule similar to "Missing Monitoring Data" for ES nodes but for Kibana instances. @klacabane also mentioned an idea about having a kibana dataset called "error" where if metricbeat kibana module is still enabled but unable to collect metrics it could send some errors to that metricset which we could incorporate into "status". |
In metricbeat > v8, metricset collection errors (both at transport and processing layer) are routed to the metricbeat-* indice so we could already take advantage of that. These errors may not provide sufficient metadata to, for example, link the error down to a specific node but can raise awareness that something fishy is going on |
I agree with everything you've said in your comment but I think it's a different issue and shouldn't block us from trying to make the more urgent problem better, imo. The urgent problem is around "these are the times when" which was referring to the times in which an instance appears within the time window, but shows as green when it is actually unhealthy or down. When it's outside of the time window, we never have any information on it. We should probably solve that too but it will likely involve a large rearchitecting of the data, or (as you mentioned) rely on alerting somehow. I particularly like the idea of having collectors report explicitly when they can no longer reach the component they are meant to be monitoring. |
Do we capture that bigger problem in another issue somewhere? There was a lot of good points raised in the PR about applying this to other products and also how to avoid noise while still showing the status so I feel like we have problems here that require thought and design effort. |
* [Stack Monitoring] Add stale status reporting for Kibana (#126386) * Fix stale message grammar and update stale indicator to use EuiBadge * Fix i18n ids * Remove unused i18n key * Fix Jest tests * Update exposeToBrowser test * Update API integration tests * Fix functional tests * Fix API integration tests * Update snapshots Co-authored-by: Kibana Machine <[email protected]>
Summary
Currently, if you shut down the Kibana you are monitoring while running the Metricbeat kibana monitoring module, Metricbeat will fill its logs with errors about not being able to connect, but Kibana will still report as "Healthy" or "Status: Green"
Metricbeat Logs:
Monitoring UI:
This would be the case if Kibana was down, unreachable, or otherwise incapacitated. These are the times when a monitoring user will want to know that Kibana may not be behaving properly.
Acceptance Criteria
For Kibana,
Note: UNEXPECTED_DELAY should default to 2 minutes but be configurable via kibana.yml
The text was updated successfully, but these errors were encountered: