Skip to content

Commit

Permalink
Optimize composite aggregation based on index sorting (#48399) (#50272)
Browse files Browse the repository at this point in the history
Co-authored-by: Daniel Huang <[email protected]>

This is a spinoff of #48130 that generalizes the proposal to allow early termination with the composite aggregation when leading sources match a prefix or the entire index sort specification.
In such case the composite aggregation can use the index sort natural order to early terminate the collection when it reaches a composite key that is greater than the bottom of the queue.
The optimization is also applicable when a query other than match_all is provided. However the optimization is deactivated for sources that match the index sort in the following cases:
  * Multi-valued source, in such case early termination is not possible.
  * missing_bucket is set to true
  • Loading branch information
jimczi authored Dec 20, 2019
1 parent 1c7bfeb commit 2acafd4
Show file tree
Hide file tree
Showing 18 changed files with 660 additions and 127 deletions.
125 changes: 124 additions & 1 deletion docs/reference/aggregations/bucket/composite-aggregation.asciidoc
Original file line number Diff line number Diff line change
Expand Up @@ -117,6 +117,7 @@ Example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -135,6 +136,7 @@ Like the `terms` aggregation it is also possible to use a script to create the v
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -170,6 +172,7 @@ Example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -188,6 +191,7 @@ The values are built from a numeric field or a script that return numerical valu
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -220,6 +224,7 @@ is specified by date/time expression:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -249,6 +254,7 @@ the format specified with the format parameter:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -291,6 +297,7 @@ For example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -313,6 +320,7 @@ in the composite buckets.
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -342,6 +350,7 @@ For example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -368,6 +377,7 @@ It is possible to include them in the response by setting `missing_bucket` to
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -393,7 +403,7 @@ first 10 composite buckets created from the values source.
The response contains the values for each composite bucket in an array containing the values extracted
from each value source.

==== After
==== Pagination

If the number of composite buckets is too high (or unknown) to be returned in a single response
it is possible to split the retrieval in multiple requests.
Expand All @@ -407,6 +417,7 @@ For example:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down Expand Up @@ -472,6 +483,7 @@ round of result can be retrieved with:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand All @@ -489,6 +501,116 @@ GET /_search

<1> Should restrict the aggregation to buckets that sort **after** the provided values.

==== Early termination

For optimal performance the <<index-modules-index-sorting,index sort>> should be set on the index so that it matches
parts or fully the source order in the composite aggregation.
For instance the following index sort:

[source,console]
--------------------------------------------------
PUT twitter
{
"settings" : {
"index" : {
"sort.field" : ["username", "timestamp"], <1>
"sort.order" : ["asc", "desc"] <2>
}
},
"mappings": {
"properties": {
"username": {
"type": "keyword",
"doc_values": true
},
"timestamp": {
"type": "date"
}
}
}
}
--------------------------------------------------

<1> This index is sorted by `username` first then by `timestamp`.
<2> ... in ascending order for the `username` field and in descending order for the `timestamp` field.

.. could be used to optimize these composite aggregations:

[source,console]
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "user_name": { "terms" : { "field": "user_name" } } } <1>
]
}
}
}
}
--------------------------------------------------

<1> `user_name` is a prefix of the index sort and the order matches (`asc`).

[source,console]
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "user_name": { "terms" : { "field": "user_name" } } }, <1>
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } } <2>
]
}
}
}
}
--------------------------------------------------

<1> `user_name` is a prefix of the index sort and the order matches (`asc`).
<2> `timestamp` matches also the prefix and the order matches (`desc`).

In order to optimize the early termination it is advised to set `track_total_hits` in the request
to `false`. The number of total hits that match the request can be retrieved on the first request
and it would be costly to compute this number on every page:

[source,console]
--------------------------------------------------
GET /_search
{
"size": 0,
"track_total_hits": false,
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "user_name": { "terms" : { "field": "user_name" } } },
{ "date": { "date_histogram": { "field": "timestamp", "calendar_interval": "1d", "order": "desc" } } }
]
}
}
}
}
--------------------------------------------------

Note that the order of the source is important, in the example below switching the `user_name` with the `timestamp`
would deactivate the sort optimization since this configuration wouldn't match the index sort specification.
If the order of sources do not matter for your use case you can follow these simple guidelines:

* Put the fields with the highest cardinality first.
* Make sure that the order of the field matches the order of the index sort.
* Put multi-valued fields last since they cannot be used for early termination.

WARNING: <<index-modules-index-sorting,index sort>> can slowdown indexing, it is very important to test index sorting
with your specific use case and dataset to ensure that it matches your requirement. If it doesn't note that `composite`
aggregations will also try to early terminate on non-sorted indices if the query matches all document (`match_all` query).

==== Sub-aggregations

Like any `multi-bucket` aggregations the `composite` aggregation can hold sub-aggregations.
Expand All @@ -501,6 +623,7 @@ per composite bucket:
--------------------------------------------------
GET /_search
{
"size": 0,
"aggs" : {
"my_buckets": {
"composite" : {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -235,7 +235,8 @@ protected AggregatorFactory doBuild(QueryShardContext queryShardContext, Aggrega
} else {
afterKey = null;
}
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size, configs, afterKey);
return new CompositeAggregationFactory(name, queryShardContext, parent, subfactoriesBuilder, metaData, size,
configs, afterKey);
}


Expand Down
Loading

0 comments on commit 2acafd4

Please sign in to comment.