7.6.0 Kibana Metrics/Logs waffle/Map view missing hosts - data is being ingested by beats. #57797

sgreszcz · 2020-02-17T12:11:07Z

Kibana version:
Elastic Cloud 7.6.0
Elasticsearch version:
Elastic Cloud 7.6.0
Server OS version:
On-prem beats/logstash shippers running official vanilla Docker images on Ubuntu 18.04
Browser version:
Google Chrome
Version 79.0.3945.130 (Official Build) (64-bit)
Browser OS version:
Mac OS Catalina 10.15.3 (19D76)
Original install method (e.g. download page, yum, from source, etc.):
Docker for beats/logstash, Elastic and Kibana 7.6.0 on Elastic.co cloud.
Describe the bug:

Upgraded Elastic Cloud to 7.6.0 as well as on-prem filebeat, metricbeat, and logstash to 7.6.0 in order to fix ILM issues.

Data arriving to elastic cloud from filebeat, metricbeat from 9 servers. We can see the data in Kibana "Discover" as well as in "Monitor"

However in the Kibana Metrics and Logs "Waffle" views, we don't see all our infrastructure:

Expected behavior:

See all the Linux hosts as in 7.5.1. Also see all the kubernetes pods (as in 7.5.1).

Screenshots (if relevant):

elasticmachine · 2020-02-17T16:53:14Z

Pinging @elastic/logs-metrics-ui (Team:logs-metrics-ui)

simianhacker · 2020-02-25T15:56:25Z

How many hosts do you have in total? Looks like there are 11 hosts showing up in the waffle map.

sgreszcz · 2020-03-03T14:36:02Z

How many hosts do you have in total? Looks like there are 11 hosts showing up in the waffle map.

Sorry for the late reply. We have 9 hosts so there should be 9 in the waffle view: 7 servers with hostname starting with cdc-, one with elk- and another with nfl-. The ones with k8s-* should not be shown as hosts, but instead kubernetes nodes.

As you can see here, we only see 11 hosts. 6 of them are kubernetes nodes (k8s-*) and the rest are 5 of the Ubuntu Linux servers that should be showing up. Therefore there are 4 missing Ubuntu Linux servers: cdc-aer-001, cdc-rtp-001, cdc-bgl-001, cdc-sng-001. I can see all of the beats for all of the cdc servers showing up under "monitoring" as well as in the elasticsearch indexes (see below).

In the last screenshot you can see that we are getting data from the 6 kubernetes hosts and the 9 Ubuntu Linux servers. However they are not showing up under "Metrics" view.

Also as an aside, the response time for any of the beats data for SIEM, "Discover" and "Monitoring/Logs" is significantly slower than with the same setup using 7.5.2.

simianhacker · 2020-03-03T19:31:05Z

I think at this point you're gonna need to dig into why the hosts don't show up in the underlying terms aggregation (see below). I would find a host that's missing from these results and then start looking at the data to see why it doesn't show up.

Do they show up when logged in as a superuser?
Is there data for that host that matches the time range?
Is the host's clock set to the current time? I've seen this happen where the host's internal clock is behind by a day and when the data is shipped it's too far in the past that it misses the query time range.
Is the "users" clock set correctly? The time range used is relative to the user's clock. This usually is only an issue when it "works" for one person but not another.

If we can't get the hosts to show up in a terms agg for the last 5 minutes then they are not going to show up in the UI. Make sure to modify the index patterns in the query below to match the your index patterns set in the Metrics UI's Settings tab. We are using both the Logs (Filebeat) and Metrics (Metricbeat) index patterns.

GET metricbeat-*,filebeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "nodes": {
      "terms": {
        "field": "host.name",
        "size": 20
      }
    }
  }
}

sgreszcz · 2020-03-04T11:39:34Z

Thanks for the detailed reply. My main concern is that this was working perfectly on 7.5.2 and previous releases back to when this metrics/infrastructure waffle view was first introduced. I'm using vanilla filebeat and metricbeat from Elastic Docker containers.

I guess I'm going to have to stop and delete the beats data shippers, delete all the indexes, redeploy the containers and see if this fixes things.

simianhacker · 2020-03-04T15:02:37Z

Deleting everything seems extreme and you should't have to do that. This seems like some kind of data issue to me, specifically with the cdc-* hosts.

Looking at the bar chart visualization above (with agent.hostname) on the X axis I can see the missing hosts, I wonder what host.name is set to for those events? You could run this aggregation to get an idea:

GET metricbeat-*,filebeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-5m",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "agentHostnames": {
      "terms": {
        "field": "agent.hostname",
        "size": 20
      },
      "aggs": {
        "hostnames": {
          "terms": {
            "field": "host.name",
            "size": 10
          }
        }
      }
    }
  }
}

Did you also upgrade the Beat shippers to 7.6? Have you made any changes to the Settings tab under Metrics UI?

sgreszcz · 2020-03-10T14:01:57Z

There are not many documents right now as I had to delete the filebeat-* and metricbeat-* indexes as they were named incorrectly and that was breaking ILM (elastic/beats#15424)

However the waffle is still incomplete. Silly question is there a maximum of 10 hosts that you can see in the waffle view?

Seems like some of the cdc-* servers are missing based on your query. This is really weird as the docker config for metricbeat and filebeat are exactly the same and the servers are the same (built with the same ansible templates). Also this worked fine in 7.5.1 with no changes to the config (just upgraded beats and the elastic cloud to 7.6 to fix ILM problems).

{
  "took" : 268,
  "timed_out" : false,
  "_shards" : {
    "total" : 2,
    "successful" : 2,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 10000,
      "relation" : "gte"
    },
    "max_score" : null,
    "hits" : [ ]
  },
  "aggregations" : {
    "agentHostnames" : {
      "doc_count_error_upper_bound" : 0,
      "sum_other_doc_count" : 0,
      "buckets" : [
        {
          "key" : "cdc-alln-stg",
          "doc_count" : 38226,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-alln-stg",
                "doc_count" : 38226
              }
            ]
          }
        },
        {
          "key" : "k8s-alln-002",
          "doc_count" : 20671,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-alln-002",
                "doc_count" : 20671
              }
            ]
          }
        },
        {
          "key" : "cdc-alln-001",
          "doc_count" : 13661,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-alln-001",
                "doc_count" : 13661
              }
            ]
          }
        },
        {
          "key" : "cdc-sjc-001",
          "doc_count" : 13650,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-sjc-001",
                "doc_count" : 13650
              }
            ]
          }
        },
        {
          "key" : "k8s-rcdn-002",
          "doc_count" : 5298,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-rcdn-002",
                "doc_count" : 5298
              }
            ]
          }
        },
        {
          "key" : "k8s-alln-003",
          "doc_count" : 5212,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-alln-003",
                "doc_count" : 5212
              }
            ]
          }
        },
        {
          "key" : "k8s-rcdn-003",
          "doc_count" : 4562,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-rcdn-003",
                "doc_count" : 4562
              }
            ]
          }
        },
        {
          "key" : "k8s-alln-001",
          "doc_count" : 4468,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-alln-001",
                "doc_count" : 4468
              }
            ]
          }
        },
        {
          "key" : "k8s-rcdn-001",
          "doc_count" : 4385,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "k8s-rcdn-001",
                "doc_count" : 4385
              }
            ]
          }
        },
        {
          "key" : "elk-alln-001",
          "doc_count" : 1955,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "elk-alln-001",
                "doc_count" : 1955
              }
            ]
          }
        },
        {
          "key" : "nfl-aer-001",
          "doc_count" : 1380,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "nfl-aer-001",
                "doc_count" : 1380
              }
            ]
          }
        },
        {
          "key" : "cdc-aer-001",
          "doc_count" : 751,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-aer-001",
                "doc_count" : 751
              }
            ]
          }
        },
        {
          "key" : "cdc-sng-001",
          "doc_count" : 368,
          "hostnames" : {
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets" : [
              {
                "key" : "cdc-sng-001",
                "doc_count" : 368
              }
            ]
          }
        }
      ]
    }
  }
}

simianhacker · 2020-03-16T18:02:18Z

@sgelastic The query/aggregation above is NOT what we are running for the waffle map, if you need to increase from 20 to 100 you can do so. I was merely using that query to see if the agent.hostnames and the host.names matched what we expected and it appears it does.

We are using a composite aggregation and paginating through the results for the display. Theoretically it supports an unlimited number of hosts; we have customers with thousands of hosts. The only filters being applied are what's set in the UI via the search box or indirectly via the groupings.

We recently changed (7.6.0) the waffle map "bucket size" to use event.dataset instead of the one minute bucket that we had hard coded. This means that if you didn't change the defaults in Metricbeat it will try and make 10 second buckets. The issue with that is not all Metricbeats agent's data is sent at the same time (consistently) which could potentially lead to a problem where host might miss the window (10 seconds * 5). Here is the PR to limit things back to 1 minute or greater: #58503

We could test this by changing the time range query to 50 seconds to see if those hosts are being missed. If you run that agg a few times consecutively and they show up and disappear then we know the above fix will probably solve this issue for you.

GET metricbeat-*,filebeat-*/_search
{
  "query": {
    "bool": {
      "filter": [
        {
          "range": {
            "@timestamp": {
              "gte": "now-50s",
              "lte": "now"
            }
          }
        }
      ]
    }
  },
  "size": 0,
  "aggs": {
    "hosts": {
      "terms": {
        "field": "host.name",
        "size": 100
      }
    }
  }
}

simianhacker · 2020-04-13T18:26:58Z

What's the status of this issue? It's been almost a month since we've heard back. I'm going to close this issue on Friday, April 17th unless there is any new information.

jasonrhodes · 2020-04-20T15:20:20Z

@sgelastic feel free to re-open this and ping Chris and I if you have new information, thanks.

monfera added Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services triage_needed labels Feb 17, 2020

simianhacker added the Feature:Metrics UI Metrics UI feature label Feb 18, 2020

sgrodzicki added [zube]: Ready and removed [zube]: Ready labels Feb 18, 2020

sgrodzicki assigned simianhacker Feb 18, 2020

simianhacker mentioned this issue Feb 25, 2020

[Metrics UI] Inventory View interval should be larger to ensure data shows up. #58494

Closed

jasonrhodes closed this as completed Apr 20, 2020

zube bot added [zube]: Done and removed [zube]: Investigate labels Apr 20, 2020

zube bot removed the [zube]: Done label Oct 13, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

7.6.0 Kibana Metrics/Logs waffle/Map view missing hosts - data is being ingested by beats. #57797

7.6.0 Kibana Metrics/Logs waffle/Map view missing hosts - data is being ingested by beats. #57797

sgreszcz commented Feb 17, 2020

elasticmachine commented Feb 17, 2020

simianhacker commented Feb 25, 2020

sgreszcz commented Mar 3, 2020

simianhacker commented Mar 3, 2020

sgreszcz commented Mar 4, 2020

simianhacker commented Mar 4, 2020

sgreszcz commented Mar 10, 2020 •

edited

Loading

simianhacker commented Mar 16, 2020 •

edited

Loading

simianhacker commented Apr 13, 2020

jasonrhodes commented Apr 20, 2020

7.6.0 Kibana Metrics/Logs waffle/Map view missing hosts - data is being ingested by beats. #57797

7.6.0 Kibana Metrics/Logs waffle/Map view missing hosts - data is being ingested by beats. #57797

Comments

sgreszcz commented Feb 17, 2020

elasticmachine commented Feb 17, 2020

simianhacker commented Feb 25, 2020

sgreszcz commented Mar 3, 2020

simianhacker commented Mar 3, 2020

sgreszcz commented Mar 4, 2020

simianhacker commented Mar 4, 2020

sgreszcz commented Mar 10, 2020 • edited Loading

simianhacker commented Mar 16, 2020 • edited Loading

simianhacker commented Apr 13, 2020

jasonrhodes commented Apr 20, 2020

sgreszcz commented Mar 10, 2020 •

edited

Loading

simianhacker commented Mar 16, 2020 •

edited

Loading