Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Infra UI] Improve waffle map custom "group by" fields #31405

Closed
skh opened this issue Feb 18, 2019 · 16 comments
Closed

[Infra UI] Improve waffle map custom "group by" fields #31405

skh opened this issue Feb 18, 2019 · 16 comments
Assignees
Labels
Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.2.0

Comments

@skh
Copy link
Contributor

skh commented Feb 18, 2019

Currently, it is possible to enter a custom value in the "Group by" input field. Depending on what the user enters, it is possible that the documents that are returned don't contain the metrics selected for display in the "Metric" input field.

Example:

  • User enters service.type as field to group by
  • User wants to see a system metric on the waffle map squares (until [Infra UI] Support custom fields in Metric-dropdown #30197 is implemented that's all they can select anyway)
  • The documents that contain a service.type other than "System" do not contain system metrics, so the waffle map is not displayed as expected

Proposed solution:

  • Split the query performed for the waffle map in two steps:
    • First, group by custom field to get a list of node ids
    • Second, query by node ids to get metric values to display on the waffle map squares

Caveats:

  • How can we deal gracefully with very large numbers of node ids? (Related: [Infra UI] Waffle map doesn't scale well to show many nodes in groupings #27500 )
  • Should the two-step query logic happen on the server or on the client?
    • The first would mean that the waffle map view can stay mostly unchanged and the graphql endpoint serving waffle map data will perform several queries to elastic search
    • The second would mean that the two-step query logic is reflected in the component / container structure and the graphql endpoints stay (relatively) simple
  • How will all this affect performance?
@skh skh added the Feature:Metrics UI Metrics UI feature label Feb 18, 2019
@elasticmachine
Copy link
Contributor

Pinging @elastic/infrastructure-ui

@roncohen
Copy link
Contributor

ping @simianhacker for input

@weltenwort
Copy link
Member

weltenwort commented Feb 18, 2019

Other questions that came to mind:

  • Assuming the querying logic is split into two steps, which will the query bar filters apply to? (intuitively the grouping query?)
  • Which time ranges are used by the two queries? (a larger interval for the grouping query and a small one for the metrics?)

@weltenwort
Copy link
Member

weltenwort commented Feb 18, 2019

In a recent discussion the following querying procedure seemed feasible. As in the original proposal above, it relies on performing two queries, but reduces the risk of overly large request body sizes.

Procedure

For a waffle map of node type T and grouping fields [G1, G2] with metric M...

  • perform two queries, potentially in parallel:
    • query all metrics docs with
      • time range filter
      • size 0
      • terms aggregation over G1
      • therein nested terms aggregation over G2
      • therein nested terms aggregation over T
    • query all docs containing M
      • time ranger filter
      • size 0
      • terms aggregation over T
      • therein nested metric aggregation for M
  • render waffle map structure based on the first query
  • render node attributes (color, size, text, etc) based on values obtained in the second query

Limitations

  • To ensure correctness with a large number of nodes, the composite aggregation should probably be used in both queries.
  • The field for T must be present in all metricbeat documents (e.g. host.name), which limits the set of possible future node types.

Advantages

  • parallel execution possible
  • separate refresh schedule possible
  • request sizes don't grow with node count (responses obviously do)
  • scales to requesting multiple unrelated metrics at once in the second query

@weltenwort weltenwort added the Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services label Feb 18, 2019
@simianhacker
Copy link
Member

@weltenwort I had planned to change this over to a composite aggregation soon because it would simplify the code some along with reducing the load on the ES cluster for the request. Instead of sending n / TotalNodes in parallel it would send n / TotalNodes in series because of how composite aggregations work with the after key. It's a little slower but it's less likely to crash the server.

With some creative work on the aggregation, we could use the composite aggregation to get all the data in one pass. I would key off the node id (eg host.name) for the composite then add 2 sub aggregations; one for metric, one for groupings. When the nodes come back you can process each node similar to what we do today. The main caveat for this would be the filter is applied to both the metric and the groupings. Although we could just add a filter aggregation above the groupings, we would need to do some performance testing to decide approach works the best; filter aggregations can sometimes be a little slow.

The aggregation tree would look something like:

composite(host.name)
  - dateHistogram(@timestamp)
    - avg(system.cpu.user.pct)
  - filter(search term)
    - terms(G1)
      - terms(G2)

If we don't apply the filter to the metric the user won't be able to do things like system.cpu.user.pct: >0.5 which is probably fine since I doubt too many people have that use case.

I would avoid moving these requests & processing to the client. Testing stuff in the browser is a total nightmare, IMHO. This stuff gets pretty complex and dealing with that complexity on the server is just easier. Furthermore, if you have to make multiple round trips, which this requires, you usually have lower latency from the server to ES then you do with the client since they typically sit on the same network.

@weltenwort What's the motivation to move this to the client?

@weltenwort
Copy link
Member

Applying the filter to the metric would mean it the results might be incorrect if the grouping field is not present on every metric field (which is exactly what causes the initial problem). So wrapping only the grouping aggregation into the filter makes sense 👍

You're right, there is no need to move it to the browser (unless we absolutely need different refresh rates or incremental rendering). I forgot to make the distinction between browser/"elasticsearch client" in my description.

@skh
Copy link
Contributor Author

skh commented Feb 20, 2019

composite(host.name)
  - dateHistogram(@timestamp)
    - avg(system.cpu.user.pct)
  - filter(search term)
    - terms(G1)
      - terms(G2)

That sample query in elasticsearch syntax (thanks @simianhacker):

POST metricbeat-*/_search
{
  "query": {
    "range": {
      "@timestamp": {
        "gte": "now-1h",
        "lte": "now"
      }
    }
  },
  "size": 0, 
  "aggs": {
    "myNodes": {
      "composite": {
        "sources": [
          {
            "myNode": {
              "terms": {
                "field": "host.name"
              }
            }
          }
        ]
      },
      "aggs": {
        "myTimeseries": {
          "date_histogram": {
            "field": "@timestamp",
            "interval": "1m"
          },
          "aggs": {
            "metric": {
              "avg": {
                "field": "system.cpu.user.pct"
              }
            }
          }
        },
        "myGroupings": {
          "filter": {
            "match_all": {}
          },
          "aggs": {
            "myGroup1": {
              "terms": {
                "field": G1
              },
              "aggs": {
                "myGroup2": {
                  "terms": {
                    "field": G2
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

@weltenwort
Copy link
Member

I wonder whether the inner terms aggs for G1 and G2 should also be wrapped in composites to ensure all groups can be paginated over. OTOH I don't know what that nested pagination would look like.

@skh
Copy link
Contributor Author

skh commented Feb 20, 2019

I'm not sure repeating the whole query would be the way to go if we want to add pagination to a single group box. I don't have a better idea right now but it feels wrong.

@skh
Copy link
Contributor Author

skh commented Feb 20, 2019

Also, with the query above, the myGroupings result is present in every bucket in myNodes, so that would the wrong place to paginate anyway, if I understood your comment correctly.

@weltenwort
Copy link
Member

I was talking about pagination on the ES api level, not the UI. There is a limit on the number of buckets ES will return per aggregation, so we might have to do multiple round-trips if we exceed that number. Looks like we would have to perform three levels of pagination.

Or we detect that and somehow communicate to the user that we can't display all data.

@simianhacker
Copy link
Member

How likely will the grouping be high enough cardinality that we need to paginate the groupings in ES to get all of them? I haven't seen anything that was more than 5~10 at each level, and typically they only belong to 1 group at each level. The high cardinality grouping is a pretty rare edge case and will likely result in an unusable UX experience. I think we could safely limit those terms aggs to 10 and be just fine.

@skh
Copy link
Contributor Author

skh commented Feb 20, 2019

I think it wouldn't work anyway, I get this when I try:

"error": {
    "root_cause": [
      {
        "type": "illegal_argument_exception",
        "reason": "[composite] aggregation cannot be used with a parent aggregation of type: [CompositeAggregationFactory]"
      }
    ]
   ...

@weltenwort
Copy link
Member

weltenwort commented Feb 20, 2019

I can easily see someone wanting to group containers by host with more than 10 containers running on each host.

@simianhacker
Copy link
Member

Then we should include the group by's in the source keys and make two requests. One request where the source is groupings + nodeId without any child aggregations, one request where sources is just the nodeId but it has the date_histogram and metrics as a child aggregation.

@weltenwort
Copy link
Member

Both could be sent at once using an _msearch if we're concerned about the round-trip latency.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature:Metrics UI Metrics UI feature Team:Infra Monitoring UI - DEPRECATED DEPRECATED - Label for the Infra Monitoring UI team. Use Team:obs-ux-infra_services v7.2.0
Projects
None yet
Development

No branches or pull requests

6 participants