Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Composite aggs seems to sort too slowly with filter queries #70035

Closed
benwtrent opened this issue Mar 5, 2021 · 6 comments
Closed

Composite aggs seems to sort too slowly with filter queries #70035

benwtrent opened this issue Mar 5, 2021 · 6 comments
Labels
>enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@benwtrent
Copy link
Member

benwtrent commented Mar 5, 2021

Piggy-backing off of previous work: #28745

During the work in #69970 some troubling performance data has reared its ugly head.

Given the following query:

{"bool":{"filter":[{"term":{"event.dataset":"nginx.access"}}]}}

The following composite agg moves at an almost glacial pace:

"aggs": {
    "buckets": {
      "composite": {
        "size": 1000,
        "sources": [
          {
            "date": {
              "date_histogram": {
                "field": "@timestamp",
                "fixed_interval": "15m"
              }
            }
          },
          {
            "source.address": {
              "terms": {
                "field": "source.address"
              }
            }
          }
        ]
      },
      "aggregations": {
        "@timestamp": {
          "max": {
            "field": "@timestamp"
          }
        }
      }
    }
  }

Here are some doc stats:

total_hits: 14479391
cardinality(source.address): 851502
max_timestamp: "2017-03-11T23:59:56.537Z"
min_timestamp: "2017-02-01T00:00:00.189Z"

In datafeeds we "chunk" through when scrolling through data. Consequently, we hit every document and make multiple queries. This is because sorting by timestamp can be costly when hitting many docs.

So, our scrolling datafeed had the following performance:

search_count | 16,649
bucket_count | 935
average_search_time_per_bucket_ms | 81.901
~4.5 ms per search (bucket_count * average_search_time_per_bucket_ms)/search_count

Job finished in ~6 minutes

Doing composite agg without chunking:
🐌 🐌 🐌

search_count | 3,795
bucket_count | 935
average_search_time_per_bucket_ms | 2,705.224
~666.5 ms per search

🐌 🐌 🐌
job finished in 40+ mintes

It seems to me that the composite agg is doing WAY too much work. I think it may be sorting WAY too many documents given the sources.

As an experiment, I added some time based query chunking in 25264688ms intervals (calculated based on term cardinality, count, and total time range)
🔥 🔥 🔥

search_count | 4,124
bucket_count | 935
average_search_time_per_bucket_ms | 112.775
~25 ms per search 

🔥 🔥 🔥
Job finished in ~4 minutes

Datafeeds (and transforms) will ALWAYS be a filter based query (ignoring scores). These queries are user provided, so they could definitely be anything. But it seems to me that there is still room for improvement in the composite agg.

@elasticmachine elasticmachine added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label Mar 5, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-analytics-geo (Team:Analytics)

@benwtrent
Copy link
Member Author

OK, for curiousity's sake, I ran the datafeed without that simple filter query:
⚡ ⚡ ⚡ ⚡

search_count | 3,795
bucket_count | 935
average_search_time_per_bucket_ms | 59.694

⚡ ⚡ ⚡ ⚡
So much faster than my garbage chunking.
And the job finished in 3 minutes.

Oh man, if we could get these speeds with filter queries!!!

@benwtrent
Copy link
Member Author

Note: for machine learning datafeeds, the first composite agg source will always be a date_histogram. We do this to make sure we get the buckets in a known time order.

@benwtrent
Copy link
Member Author

Digging into the code some and discussed the various execution paths with @nik9000 .

} else {
final LeafBucketCollector inner = queue.getLeafCollector(ctx, getFirstPassCollector(docIdSetBuilder, sortPrefixLen));
return new LeafBucketCollector() {
@Override
public void collect(int doc, long zeroBucket) throws IOException {
assert zeroBucket == 0L;
inner.collect(doc);
}
};
}

Is the path we are hitting as:

  • The source index is NOT sorted
  • We have more than a range query in our user provided query

But, it does seem weird to hit every document.

Assume that the top source is a date_histogram and you have the after_key and size in hand. That seems to provide ample opportunity for reducing the number of docs hit as one can bring down the range of docs considered. Obviously, this is easier said than done.

Making that "slow path" faster will greatly improve throughput for transforms (which constantly uses composite aggs with range queries and term filter queries) and datafeeds (which allow arbitrary user provided filter queries).

The nice thing about datafeeds is that the top source will ALWAYS be a date_histogram.

For transforms AND datafeeds, the composite agg will be the only top level aggregation.

I leave this in more capable hands than mine :).

@dimitris-athanasiou
Copy link
Contributor

I've raised #92197 which might be able to help with this issue.

@wchaparro
Copy link
Member

closing as not planned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>enhancement Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

No branches or pull requests

5 participants