Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud #83309

Open
chrisronline opened this issue Nov 12, 2020 · 22 comments
Labels
Team:Monitoring Stack Monitoring team v7.12.0

Comments

@chrisronline
Copy link
Contributor

Right now, the missing monitoring data alert will fire when you upgrade a stack product in cloud because the stack product's uuid also changes.

I'm not exactly sure what we should do about that, but it's not a great UX for cloud users.

@chrisronline chrisronline added bug Fixes for quality problems that affect the customer experience Team:Monitoring Stack Monitoring team labels Nov 12, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/stack-monitoring (Team:Monitoring)

@chrisronline
Copy link
Contributor Author

@ravikesarwani I'm curious to your thoughts here. Technically, the alert is working as intended.

This query:

POST .monitoring-kibana-*/_search?filter_path=aggregations.versions.buckets
{
  "size": 0,
  "aggs": {
    "versions": {
      "terms": {
        "field": "kibana_stats.kibana.version",
        "size": 10
      },
      "aggs": {
        "uuids": {
          "terms": {
            "field": "kibana_stats.kibana.uuid",
            "size": 10
          }
        },
        "latest": {
          "max": {
            "field": "timestamp"
          }
        }
      }
    }
  }
}

yields:

{
  "aggregations" : {
    "versions" : {
      "buckets" : [
        {
          "key" : "7.9.2",
          "doc_count" : 23061,
          "latest" : {
            "value" : 1.605197028651E12,
            "value_as_string" : "2020-11-12T16:03:48.651Z"
          },
          "uuids" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cf491eb4-90fc-4430-8510-6fdb24402a16",
                "doc_count" : 23061
              }
            ]
          }
        },
        {
          "key" : "7.10.0",
          "doc_count" : 1513,
          "latest" : {
            "value" : 1.605212208223E12,
            "value_as_string" : "2020-11-12T20:16:48.223Z"
          },
          "uuids" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "662bf661-0add-43c2-956e-59ff218ed37e",
                "doc_count" : 1513
              }
            ]
          }
        }
      ]
    }
  }
}

We can't really know if the 7.9.2 instance is unavailable intentionally or accidentally from the data we have.

Maybe we can talk to the cloud team and see if there are some APIs available to detect this scenario?

@ravikesarwani
Copy link
Contributor

@Kushmaro Any thoughts from the Cloud team?
Looks like kibana instance in Cloud are upgraded (differently that ES, APM etc.) wherein new instances are created (from the perspective of data that we have).
This causes "Missing monitoring data" alert to be fired. We look back 1 day (by default) of monitoring data so users will have this alert firing for 1 day after upgrade in Cloud.

We need some way to identify and not alert in Cloud for this scenario.

@sgrodzicki sgrodzicki removed the bug Fixes for quality problems that affect the customer experience label Nov 16, 2020
@Kushmaro
Copy link

Thanks for the mention @ravikesarwani , looping in @zanbel as he's now the owner of the project. (fka make-it-action)
If this is the case for all Cloud Deployments, then yeah, it's definitely an issue I think.

Currently, I'm not even sure if Kibana can query cloud APIs, technically speaking (if I'm not mistaken) the GET /deployment API should return the status of all Kibana instances.

@chrisronline
Copy link
Contributor Author

@zanbel @Kushmaro I don't know what's possible, but it'd be great if the cloud Kibana plugin could expose APIs that return some data that can help us detect this.

@Kushmaro
Copy link

@chrisronline we're actually working on a way to allow Kibana to make API calls to Cloud in the cloud platform team.
Mainly for the purpose of improving UX and make the experience more seamless, but this is of course also a very important case.

/cc @bevacqua (who's leading this project) & @jowiho

@ravikesarwani
Copy link
Contributor

@chrisronline Looks like this issue also happens when configuration change is applied on Kibana instance in Cloud (based on new comments in SDH https://github.com/elastic/sdh-kibana/issues/958).
As a workaround my take would be to exclude "Kibana" for this alert type. This change in my view should be made for next 9.10.x release.

Working with Cloud team we can figure out a solution and then enable the alert for Kibana.

@ravikesarwani ravikesarwani changed the title [Monitoring] Missing monitoring data alert firing for version upgrade [Monitoring] Missing monitoring data alert firing for version upgrade and configuration changes for Kibana in Cloud Nov 17, 2020
@jowiho
Copy link

jowiho commented Nov 17, 2020

Before we head towards a particular solution, let's make sure we understand the problem. Does Cloud update Kibana the wrong way? Or does Kibana have the wrong expectations of how it gets updated?

@ravikesarwani
Copy link
Contributor

@jowiho thanks for your comments. Makes sense.
Do you or someone in the Cloud team can help comment how update of Kibana is done in ESS? Particular interest is the uuid as that's what ties in monitoring data to the instance we know off.

@chrisronline
Copy link
Contributor Author

@jowiho How does Cloud update Kibana right now? Kibana persists the uuid inside of a data/uuid file but I'm not sure if Cloud maintains this during the upgrade. It seems to maintain it for APM and ES as we don't see this behavior for upgrades on those stack products on Cloud.

@chrisronline
Copy link
Contributor Author

After speaking briefly with @AlexP-Elastic, it doesn't seem like they intentionally persists uuids across upgrades.

It appears that the Elasticsearch node_id is persisted when upgraded in Cloud, but I can verify that the ephemeral_id changes (we don't check this in our alert though).

It appears that APM's upgrade works the same way as Kibana, where a brand new uuid is generated and it's most likely a bug on our side that the alert isn't working in that case.

The alert is working as intended, in that it will detect missing monitoring data, but we are not currently capturing the upgrade scenario (which doesn't just affect Cloud). I'm not sure if we have a way of detecting the difference between a legitimate upgrade and an instance/node going down.

Maybe we can think about solving this by providing additional configuration for the alert, such as Only alert if a node/instance is not reporting AND the total number of nodes/instances is not x so a user with three Kibana instances can configure x=3 and we can use that to ensure we do not alert unnecessarily.

@ravikesarwani
Copy link
Contributor

This alert need is critical for Elasticsearch and if its working there then I would say we make this available for ES. I am not in favor of making the alert configuration complicated.

As a next step we can expand to other stack components.
We should investigate if we can persist uuids across upgrade.
We should also investigate why APM upgrade works. As I understood we don't have any special code for APM server.

@ravikesarwani
Copy link
Contributor

BTW is this alert applicable for beats? If it does, we need to understand the behavior there as well.

@bevacqua
Copy link
Contributor

@chrisronline
Copy link
Contributor Author

We should also investigate why APM upgrade works. As I understood we don't have any special code for APM server.

It's indeed a bug, and the fix is #83646

@ravikesarwani
Copy link
Contributor

As part of upgrade we should be backing up the data and config directories in Cloud.
In fact this is something that we ask users to do for beats upgrade.
"Back up the data and config directories by copying them to another location."

Cloud team, can we look at doing this in the Cloud for Kibana and APM server upgrade?
This makes logical sense as well. The "data" directory can be used by the processes to store temporary/cache data and backing up and restoring that directory helps to recreate the original state after the upgrade.

Chris and I discussed this and for 7.10.1 we will make this alert applicable only for Elasticsearch.
We need to do this as a stop gap otherwise these false positives can get the customers to disable these alerts.
Once we resolve the issue working with the Cloud team we can enable this alert for other objects.
I think its critical to have this alert for APM, beats, logstash etc. as well.

@AlexP-Elastic
Copy link

Cloud team, can we look at doing this in the Cloud for Kibana and APM server upgrade?

This would be a massive change to our infrastructure. Currently we only persist data (other than the global YAML config) across containers where Elasticsearch does that for us (I mentioned to Chris, we don't even have the concept of "we are moving instance X as part of an upgrade", we view it as "we are creating some new instances and then deleting the old ones"). There is little prospect of this happening in the foreseeable future. (cc @andrew-moldovan )

A smaller change that might be useful (it's not clear to me, see below) would be to switch APM/Kibana/etc from "grow-shrink" by default to "rolling in place" by default - I think they work this way just for legacy reasons. (cc @anyasabo @jhalterman not sure if that is planned/in progress/done?).

This would decrease the chance that any given configuration change would trigger an alert (but "moves" due to hardware failure and some capacity increases would still do this)

@anyasabo
Copy link

Is there any other functionality for beats/apm/kibana that requires persistent data storage, other than the identifier? My mental model of beats and kibana is that they can be considered ephemeral and I can scale up and down as necessary, if we're supposed to be considering them stateful and want to persist data solely so we can make a particular alert work, that seems Not Great. Please help me out if my mental model here is wrong or if I'm misunderstanding something.

@chrisronline
Copy link
Contributor Author

@anyasabo I honestly don't know, but it sounds like adapting how we think about these products to serve a single stack monitoring alert doesn't make much sense.

Perhaps we can solve this by relying on additional cloud APIs to give us more data. Cloud deployments know how many unique instances/nodes should exist and if we can access that data within the Stack Monitoring plugin, we can make smarter decisions about when to alert.

@anyasabo
Copy link

I also ask because it is not just ECE, ECK is in play as well. So I am not sure we want to rely on a cloud API as our first direction.

@ravikesarwani
Copy link
Contributor

ravikesarwani commented Dec 18, 2020

Can we revisit this for 7.12?
Looks like we have issues with Kibana and APM (its working but it maybe accidental because of an oversight) in ESS.
For ESS we need to solve the issue working with the Cloud team.

Beats is another area which is not affected by ESS and this alert may really be helpful.
My take is in Kubernetes environment this maybe noisy.

We need to discuss in 7.12 timeline and see what we can do.

Some options we need to explore from our side and working with Cloud team:

  • User can enable/disable alert for each product independently: ES, Kibana, Logstash, APM, Beats
  • The default configuration is tailored for each product and default optimized for that environment
  • Remove false positives in the ESS (don't alerts in known cases)

@chrisronline
Copy link
Contributor Author

I opened #86683 to track the work around creating separate alerts for each stack product to satisfy:

The default configuration is tailored for each product and default optimized for that environment

User can enable/disable alert for each product independently: ES, Kibana, Logstash, APM, Beats

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Team:Monitoring Stack Monitoring team v7.12.0
Projects
None yet
Development

No branches or pull requests

9 participants