[Infra UI] Improve waffle map custom "group by" fields #31405

skh · 2019-02-18T10:40:49Z

Currently, it is possible to enter a custom value in the "Group by" input field. Depending on what the user enters, it is possible that the documents that are returned don't contain the metrics selected for display in the "Metric" input field.

Example:

User enters service.type as field to group by
User wants to see a system metric on the waffle map squares (until [Infra UI] Support custom fields in Metric-dropdown #30197 is implemented that's all they can select anyway)
The documents that contain a service.type other than "System" do not contain system metrics, so the waffle map is not displayed as expected

Proposed solution:

Split the query performed for the waffle map in two steps:
- First, group by custom field to get a list of node ids
- Second, query by node ids to get metric values to display on the waffle map squares

Caveats:

How can we deal gracefully with very large numbers of node ids? (Related: [Infra UI] Waffle map doesn't scale well to show many nodes in groupings #27500 )
Should the two-step query logic happen on the server or on the client?
- The first would mean that the waffle map view can stay mostly unchanged and the graphql endpoint serving waffle map data will perform several queries to elastic search
- The second would mean that the two-step query logic is reflected in the component / container structure and the graphql endpoints stay (relatively) simple
How will all this affect performance?

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-02-18T10:40:51Z

Pinging @elastic/infrastructure-ui

roncohen · 2019-02-18T12:31:16Z

ping @simianhacker for input

weltenwort · 2019-02-18T14:07:00Z

Other questions that came to mind:

Assuming the querying logic is split into two steps, which will the query bar filters apply to? (intuitively the grouping query?)
Which time ranges are used by the two queries? (a larger interval for the grouping query and a small one for the metrics?)

weltenwort · 2019-02-18T18:36:37Z

In a recent discussion the following querying procedure seemed feasible. As in the original proposal above, it relies on performing two queries, but reduces the risk of overly large request body sizes.

Procedure

For a waffle map of node type T and grouping fields [G1, G2] with metric M...

perform two queries, potentially in parallel:
- query all metrics docs with
  - time range filter
  - size 0
  - terms aggregation over G1
  - therein nested terms aggregation over G2
  - therein nested terms aggregation over T
- query all docs containing M
  - time ranger filter
  - size 0
  - terms aggregation over T
  - therein nested metric aggregation for M
render waffle map structure based on the first query
render node attributes (color, size, text, etc) based on values obtained in the second query

Limitations

To ensure correctness with a large number of nodes, the composite aggregation should probably be used in both queries.
The field for T must be present in all metricbeat documents (e.g. host.name), which limits the set of possible future node types.

Advantages

parallel execution possible
separate refresh schedule possible
request sizes don't grow with node count (responses obviously do)
scales to requesting multiple unrelated metrics at once in the second query

simianhacker · 2019-02-19T15:00:49Z

@weltenwort I had planned to change this over to a composite aggregation soon because it would simplify the code some along with reducing the load on the ES cluster for the request. Instead of sending n / TotalNodes in parallel it would send n / TotalNodes in series because of how composite aggregations work with the after key. It's a little slower but it's less likely to crash the server.

With some creative work on the aggregation, we could use the composite aggregation to get all the data in one pass. I would key off the node id (eg host.name) for the composite then add 2 sub aggregations; one for metric, one for groupings. When the nodes come back you can process each node similar to what we do today. The main caveat for this would be the filter is applied to both the metric and the groupings. Although we could just add a filter aggregation above the groupings, we would need to do some performance testing to decide approach works the best; filter aggregations can sometimes be a little slow.

The aggregation tree would look something like:

composite(host.name)
  - dateHistogram(@timestamp)
    - avg(system.cpu.user.pct)
  - filter(search term)
    - terms(G1)
      - terms(G2)

If we don't apply the filter to the metric the user won't be able to do things like system.cpu.user.pct: >0.5 which is probably fine since I doubt too many people have that use case.

I would avoid moving these requests & processing to the client. Testing stuff in the browser is a total nightmare, IMHO. This stuff gets pretty complex and dealing with that complexity on the server is just easier. Furthermore, if you have to make multiple round trips, which this requires, you usually have lower latency from the server to ES then you do with the client since they typically sit on the same network.

@weltenwort What's the motivation to move this to the client?

weltenwort · 2019-02-19T15:11:53Z

Applying the filter to the metric would mean it the results might be incorrect if the grouping field is not present on every metric field (which is exactly what causes the initial problem). So wrapping only the grouping aggregation into the filter makes sense 👍

You're right, there is no need to move it to the browser (unless we absolutely need different refresh rates or incremental rendering). I forgot to make the distinction between browser/"elasticsearch client" in my description.

skh · 2019-02-20T14:23:34Z

composite(host.name)
  - dateHistogram(@timestamp)
    - avg(system.cpu.user.pct)
  - filter(search term)
    - terms(G1)
      - terms(G2)

That sample query in elasticsearch syntax (thanks @simianhacker):

POST metricbeat-*/_search
{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-1h",
        "lte": "now"
      }
    }
  },
  "size": 0, 
  "aggs": {
    "myNodes": {
      "composite": {
        "sources": [
          {
            "myNode": {
              "terms": {
                "field": "host.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "myTimeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "interval": "1m"
          },
          "aggs": {
            "metric": {
              "avg": {
                "field": "system.cpu.user.pct"
              }
            }
          }
        },
        "myGroupings": {
          "filter": {
            "match_all": {}
          },
          "aggs": {
            "myGroup1": {
              "terms": {
                "field": G1
              },
              "aggs": {
                "myGroup2": {
                  "terms": {
                    "field": G2
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

weltenwort · 2019-02-20T14:35:15Z

I wonder whether the inner terms aggs for G1 and G2 should also be wrapped in composites to ensure all groups can be paginated over. OTOH I don't know what that nested pagination would look like.

skh · 2019-02-20T14:44:36Z

I'm not sure repeating the whole query would be the way to go if we want to add pagination to a single group box. I don't have a better idea right now but it feels wrong.

skh · 2019-02-20T15:18:01Z

Also, with the query above, the myGroupings result is present in every bucket in myNodes, so that would the wrong place to paginate anyway, if I understood your comment correctly.

weltenwort · 2019-02-20T15:29:36Z

I was talking about pagination on the ES api level, not the UI. There is a limit on the number of buckets ES will return per aggregation, so we might have to do multiple round-trips if we exceed that number. Looks like we would have to perform three levels of pagination.

Or we detect that and somehow communicate to the user that we can't display all data.

simianhacker · 2019-02-20T15:47:40Z

How likely will the grouping be high enough cardinality that we need to paginate the groupings in ES to get all of them? I haven't seen anything that was more than 5~10 at each level, and typically they only belong to 1 group at each level. The high cardinality grouping is a pretty rare edge case and will likely result in an unusable UX experience. I think we could safely limit those terms aggs to 10 and be just fine.

skh · 2019-02-20T15:52:38Z

I think it wouldn't work anyway, I get this when I try:

"error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "[composite] aggregation cannot be used with a parent aggregation of type: [CompositeAggregationFactory]"
      }
    ]
   ...

weltenwort · 2019-02-20T15:53:45Z

I can easily see someone wanting to group containers by host with more than 10 containers running on each host.

simianhacker · 2019-02-20T16:03:32Z

Then we should include the group by's in the source keys and make two requests. One request where the source is groupings + nodeId without any child aggregations, one request where sources is just the nodeId but it has the date_histogram and metrics as a child aggregation.

weltenwort · 2019-02-20T16:50:07Z

Both could be sent at once using an _msearch if we're concerned about the round-trip latency.

skh added the Feature:Metrics UI Metrics UI feature label Feb 18, 2019

makwarth mentioned this issue Feb 18, 2019

[Infra UI] Services overview page #26167

Closed

skh self-assigned this Feb 18, 2019

alvarolobato added the [zube]: Ready label Feb 18, 2019

weltenwort added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Feb 18, 2019

alvarolobato added the v7.2.0 label Feb 19, 2019

skh added [zube]: In Progress and removed [zube]: Ready labels Feb 20, 2019

skh mentioned this issue Apr 1, 2019

[Infra UI] Add new graphql endpoint for snapshot data #34264

Merged

skh mentioned this issue Apr 25, 2019

[Infra UI] Use new snapshot endpoint #34938

Merged

2 tasks

skh closed this as completed Apr 29, 2019

skh added [zube]: Done and removed [zube]: In Progress labels Apr 29, 2019

alvarolobato removed the [zube]: Done label Apr 29, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Infra UI] Improve waffle map custom "group by" fields #31405

[Infra UI] Improve waffle map custom "group by" fields #31405

skh commented Feb 18, 2019

elasticmachine commented Feb 18, 2019

roncohen commented Feb 18, 2019

weltenwort commented Feb 18, 2019 •

edited

Loading

weltenwort commented Feb 18, 2019 •

edited

Loading

simianhacker commented Feb 19, 2019

weltenwort commented Feb 19, 2019

skh commented Feb 20, 2019 •

edited

Loading

weltenwort commented Feb 20, 2019

skh commented Feb 20, 2019 •

edited

Loading

skh commented Feb 20, 2019

weltenwort commented Feb 20, 2019

simianhacker commented Feb 20, 2019

skh commented Feb 20, 2019

weltenwort commented Feb 20, 2019 •

edited

Loading

simianhacker commented Feb 20, 2019

weltenwort commented Feb 20, 2019

[Infra UI] Improve waffle map custom "group by" fields #31405

[Infra UI] Improve waffle map custom "group by" fields #31405

Comments

skh commented Feb 18, 2019

elasticmachine commented Feb 18, 2019

roncohen commented Feb 18, 2019

weltenwort commented Feb 18, 2019 • edited Loading

weltenwort commented Feb 18, 2019 • edited Loading

simianhacker commented Feb 19, 2019

weltenwort commented Feb 19, 2019

skh commented Feb 20, 2019 • edited Loading

weltenwort commented Feb 20, 2019

skh commented Feb 20, 2019 • edited Loading

skh commented Feb 20, 2019

weltenwort commented Feb 20, 2019

simianhacker commented Feb 20, 2019

skh commented Feb 20, 2019

weltenwort commented Feb 20, 2019 • edited Loading

simianhacker commented Feb 20, 2019

weltenwort commented Feb 20, 2019

weltenwort commented Feb 18, 2019 •

edited

Loading

weltenwort commented Feb 18, 2019 •

edited

Loading

skh commented Feb 20, 2019 •

edited

Loading

skh commented Feb 20, 2019 •

edited

Loading

weltenwort commented Feb 20, 2019 •

edited

Loading