Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

7.6.0 Kibana Metrics/Logs waffle/Map view missing hosts - data is being ingested by beats. #57797

Closed
sgreszcz opened this issue Feb 17, 2020 · 10 comments
Assignees
Labels
Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed

Comments

@sgreszcz
Copy link

Kibana version:
Elastic Cloud 7.6.0
Elasticsearch version:
Elastic Cloud 7.6.0
Server OS version:
On-prem beats/logstash shippers running official vanilla Docker images on Ubuntu 18.04
Browser version:
Google Chrome
Version 79.0.3945.130 (Official Build) (64-bit)
Browser OS version:
Mac OS Catalina 10.15.3 (19D76)
Original install method (e.g. download page, yum, from source, etc.):
Docker for beats/logstash, Elastic and Kibana 7.6.0 on Elastic.co cloud.
Describe the bug:

Upgraded Elastic Cloud to 7.6.0 as well as on-prem filebeat, metricbeat, and logstash to 7.6.0 in order to fix ILM issues.

Data arriving to elastic cloud from filebeat, metricbeat from 9 servers. We can see the data in Kibana "Discover" as well as in "Monitor"

However in the Kibana Metrics and Logs "Waffle" views, we don't see all our infrastructure:

Expected behavior:

See all the Linux hosts as in 7.5.1. Also see all the kubernetes pods (as in 7.5.1).

Screenshots (if relevant):

Screenshot 2020-02-17 at 11 58 49

Screenshot 2020-02-17 at 11 58 20

Screenshot 2020-02-17 at 12 08 54

Screenshot 2020-02-17 at 12 09 48

@monfera monfera added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed labels Feb 17, 2020
@elasticmachine
Copy link
Contributor

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

@simianhacker
Copy link
Member

How many hosts do you have in total? Looks like there are 11 hosts showing up in the waffle map.

@sgreszcz
Copy link
Author

sgreszcz commented Mar 3, 2020

How many hosts do you have in total? Looks like there are 11 hosts showing up in the waffle map.

Sorry for the late reply. We have 9 hosts so there should be 9 in the waffle view: 7 servers with hostname starting with cdc-, one with elk- and another with nfl-. The ones with k8s-* should not be shown as hosts, but instead kubernetes nodes.

As you can see here, we only see 11 hosts. 6 of them are kubernetes nodes (k8s-*) and the rest are 5 of the Ubuntu Linux servers that should be showing up. Therefore there are 4 missing Ubuntu Linux servers: cdc-aer-001, cdc-rtp-001, cdc-bgl-001, cdc-sng-001. I can see all of the beats for all of the cdc servers showing up under "monitoring" as well as in the elasticsearch indexes (see below).

In the last screenshot you can see that we are getting data from the 6 kubernetes hosts and the 9 Ubuntu Linux servers. However they are not showing up under "Metrics" view.

Screenshot 2020-03-03 at 14 24 36

Screenshot 2020-03-03 at 14 30 30

Screenshot 2020-03-03 at 14 33 24

Also as an aside, the response time for any of the beats data for SIEM, "Discover" and "Monitoring/Logs" is significantly slower than with the same setup using 7.5.2.

@simianhacker
Copy link
Member

I think at this point you're gonna need to dig into why the hosts don't show up in the underlying terms aggregation (see below). I would find a host that's missing from these results and then start looking at the data to see why it doesn't show up.

  • Do they show up when logged in as a superuser?
  • Is there data for that host that matches the time range?
  • Is the host's clock set to the current time? I've seen this happen where the host's internal clock is behind by a day and when the data is shipped it's too far in the past that it misses the query time range.
  • Is the "users" clock set correctly? The time range used is relative to the user's clock. This usually is only an issue when it "works" for one person but not another.

If we can't get the hosts to show up in a terms agg for the last 5 minutes then they are not going to show up in the UI. Make sure to modify the index patterns in the query below to match the your index patterns set in the Metrics UI's Settings tab. We are using both the Logs (Filebeat) and Metrics (Metricbeat) index patterns.

GET metricbeat-*,filebeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "nodes": {
      "terms": {
        "field": "host.name",
        "size": 20
      }
    }
  }
}

@sgreszcz
Copy link
Author

sgreszcz commented Mar 4, 2020

Thanks for the detailed reply. My main concern is that this was working perfectly on 7.5.2 and previous releases back to when this metrics/infrastructure waffle view was first introduced. I'm using vanilla filebeat and metricbeat from Elastic Docker containers.

I guess I'm going to have to stop and delete the beats data shippers, delete all the indexes, redeploy the containers and see if this fixes things.

@simianhacker
Copy link
Member

Deleting everything seems extreme and you should't have to do that. This seems like some kind of data issue to me, specifically with the cdc-* hosts.

Looking at the bar chart visualization above (with agent.hostname) on the X axis I can see the missing hosts, I wonder what host.name is set to for those events? You could run this aggregation to get an idea:

GET metricbeat-*,filebeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "agentHostnames": {
      "terms": {
        "field": "agent.hostname",
        "size": 20
      },
      "aggs": {
        "hostnames": {
          "terms": {
            "field": "host.name",
            "size": 10
          }
        }
      }
    }
  }
}

Did you also upgrade the Beat shippers to 7.6? Have you made any changes to the Settings tab under Metrics UI?

@sgreszcz
Copy link
Author

sgreszcz commented Mar 10, 2020

There are not many documents right now as I had to delete the filebeat-* and metricbeat-* indexes as they were named incorrectly and that was breaking ILM (elastic/beats#15424)

However the waffle is still incomplete. Silly question is there a maximum of 10 hosts that you can see in the waffle view?

Seems like some of the cdc-* servers are missing based on your query. This is really weird as the docker config for metricbeat and filebeat are exactly the same and the servers are the same (built with the same ansible templates). Also this worked fine in 7.5.1 with no changes to the config (just upgraded beats and the elastic cloud to 7.6 to fix ILM problems).

{
  "took" : 268,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agentHostnames" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "cdc-alln-stg",
          "doc_count" : 38226,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-alln-stg",
                "doc_count" : 38226
              }
            ]
          }
        },
        {
          "key" : "k8s-alln-002",
          "doc_count" : 20671,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-alln-002",
                "doc_count" : 20671
              }
            ]
          }
        },
        {
          "key" : "cdc-alln-001",
          "doc_count" : 13661,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-alln-001",
                "doc_count" : 13661
              }
            ]
          }
        },
        {
          "key" : "cdc-sjc-001",
          "doc_count" : 13650,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-sjc-001",
                "doc_count" : 13650
              }
            ]
          }
        },
        {
          "key" : "k8s-rcdn-002",
          "doc_count" : 5298,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-rcdn-002",
                "doc_count" : 5298
              }
            ]
          }
        },
        {
          "key" : "k8s-alln-003",
          "doc_count" : 5212,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-alln-003",
                "doc_count" : 5212
              }
            ]
          }
        },
        {
          "key" : "k8s-rcdn-003",
          "doc_count" : 4562,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-rcdn-003",
                "doc_count" : 4562
              }
            ]
          }
        },
        {
          "key" : "k8s-alln-001",
          "doc_count" : 4468,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-alln-001",
                "doc_count" : 4468
              }
            ]
          }
        },
        {
          "key" : "k8s-rcdn-001",
          "doc_count" : 4385,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-rcdn-001",
                "doc_count" : 4385
              }
            ]
          }
        },
        {
          "key" : "elk-alln-001",
          "doc_count" : 1955,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "elk-alln-001",
                "doc_count" : 1955
              }
            ]
          }
        },
        {
          "key" : "nfl-aer-001",
          "doc_count" : 1380,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "nfl-aer-001",
                "doc_count" : 1380
              }
            ]
          }
        },
        {
          "key" : "cdc-aer-001",
          "doc_count" : 751,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-aer-001",
                "doc_count" : 751
              }
            ]
          }
        },
        {
          "key" : "cdc-sng-001",
          "doc_count" : 368,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-sng-001",
                "doc_count" : 368
              }
            ]
          }
        }
      ]
    }
  }
}

@simianhacker
Copy link
Member

simianhacker commented Mar 16, 2020

@sgelastic The query/aggregation above is NOT what we are running for the waffle map, if you need to increase from 20 to 100 you can do so. I was merely using that query to see if the agent.hostnames and the host.names matched what we expected and it appears it does.

We are using a composite aggregation and paginating through the results for the display. Theoretically it supports an unlimited number of hosts; we have customers with thousands of hosts. The only filters being applied are what's set in the UI via the search box or indirectly via the groupings.

We recently changed (7.6.0) the waffle map "bucket size" to use event.dataset instead of the one minute bucket that we had hard coded. This means that if you didn't change the defaults in Metricbeat it will try and make 10 second buckets. The issue with that is not all Metricbeats agent's data is sent at the same time (consistently) which could potentially lead to a problem where host might miss the window (10 seconds * 5). Here is the PR to limit things back to 1 minute or greater: #58503

We could test this by changing the time range query to 50 seconds to see if those hosts are being missed. If you run that agg a few times consecutively and they show up and disappear then we know the above fix will probably solve this issue for you.

GET metricbeat-*,filebeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-50s",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "hosts": {
      "terms": {
        "field": "host.name",
        "size": 100
      }
    }
  }
}

@simianhacker
Copy link
Member

What's the status of this issue? It's been almost a month since we've heard back. I'm going to close this issue on Friday, April 17th unless there is any new information.

@jasonrhodes
Copy link
Member

@sgelastic feel free to re-open this and ping Chris and I if you have new information, thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed
Projects
None yet
Development

No branches or pull requests

6 participants