-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309
Comments
Pinging @elastic/stack-monitoring (Team:Monitoring) |
@ravikesarwani I'm curious to your thoughts here. Technically, the alert is working as intended. This query:
yields:
We can't really know if the Maybe we can talk to the cloud team and see if there are some APIs available to detect this scenario? |
@Kushmaro Any thoughts from the Cloud team? We need some way to identify and not alert in Cloud for this scenario. |
Thanks for the mention @ravikesarwani , looping in @zanbel as he's now the owner of the project. (fka make-it-action) Currently, I'm not even sure if Kibana can query cloud APIs, technically speaking (if I'm not mistaken) the GET /deployment API should return the status of all Kibana instances. |
@zanbel @Kushmaro I don't know what's possible, but it'd be great if the cloud Kibana plugin could expose APIs that return some data that can help us detect this. |
@chrisronline we're actually working on a way to allow Kibana to make API calls to Cloud in the cloud platform team. |
@chrisronline Looks like this issue also happens when configuration change is applied on Kibana instance in Cloud (based on new comments in SDH https://github.com/elastic/sdh-kibana/issues/958). Working with Cloud team we can figure out a solution and then enable the alert for Kibana. |
Before we head towards a particular solution, let's make sure we understand the problem. Does Cloud update Kibana the wrong way? Or does Kibana have the wrong expectations of how it gets updated? |
@jowiho thanks for your comments. Makes sense. |
@jowiho How does Cloud update Kibana right now? Kibana persists the |
After speaking briefly with @AlexP-Elastic, it doesn't seem like they intentionally persists It appears that the Elasticsearch It appears that APM's upgrade works the same way as Kibana, where a brand new The alert is working as intended, in that it will detect missing monitoring data, but we are not currently capturing the upgrade scenario (which doesn't just affect Cloud). I'm not sure if we have a way of detecting the difference between a legitimate upgrade and an instance/node going down. Maybe we can think about solving this by providing additional configuration for the alert, such as |
This alert need is critical for Elasticsearch and if its working there then I would say we make this available for ES. I am not in favor of making the alert configuration complicated. As a next step we can expand to other stack components. |
BTW is this alert applicable for beats? If it does, we need to understand the behavior there as well. |
It's indeed a bug, and the fix is #83646 |
As part of upgrade we should be backing up the data and config directories in Cloud. Cloud team, can we look at doing this in the Cloud for Kibana and APM server upgrade? Chris and I discussed this and for 7.10.1 we will make this alert applicable only for Elasticsearch. |
This would be a massive change to our infrastructure. Currently we only persist data (other than the global YAML config) across containers where Elasticsearch does that for us (I mentioned to Chris, we don't even have the concept of "we are moving instance X as part of an upgrade", we view it as "we are creating some new instances and then deleting the old ones"). There is little prospect of this happening in the foreseeable future. (cc @andrew-moldovan ) A smaller change that might be useful (it's not clear to me, see below) would be to switch APM/Kibana/etc from "grow-shrink" by default to "rolling in place" by default - I think they work this way just for legacy reasons. (cc @anyasabo @jhalterman not sure if that is planned/in progress/done?). This would decrease the chance that any given configuration change would trigger an alert (but "moves" due to hardware failure and some capacity increases would still do this) |
Is there any other functionality for beats/apm/kibana that requires persistent data storage, other than the identifier? My mental model of beats and kibana is that they can be considered ephemeral and I can scale up and down as necessary, if we're supposed to be considering them stateful and want to persist data solely so we can make a particular alert work, that seems Not Great. Please help me out if my mental model here is wrong or if I'm misunderstanding something. |
@anyasabo I honestly don't know, but it sounds like adapting how we think about these products to serve a single stack monitoring alert doesn't make much sense. Perhaps we can solve this by relying on additional cloud APIs to give us more data. Cloud deployments know how many unique instances/nodes should exist and if we can access that data within the Stack Monitoring plugin, we can make smarter decisions about when to alert. |
I also ask because it is not just ECE, ECK is in play as well. So I am not sure we want to rely on a cloud API as our first direction. |
Can we revisit this for 7.12? Beats is another area which is not affected by ESS and this alert may really be helpful. We need to discuss in 7.12 timeline and see what we can do. Some options we need to explore from our side and working with Cloud team:
|
I opened #86683 to track the work around creating separate alerts for each stack product to satisfy:
|
Right now, the missing monitoring data alert will fire when you upgrade a stack product in cloud because the stack product's
uuid
also changes.I'm not exactly sure what we should do about that, but it's not a great UX for cloud users.
The text was updated successfully, but these errors were encountered: