Remove support for sorting terms aggregation by ascending count #17614

colings86 · 2016-04-08T09:44:22Z

We try to be as flexible as possible when it comes to sorting terms aggregations. However, sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts, which is disappointing to users. For this reason we should remove the ability to sort the terms aggregation by ascending count (split out from #17588)

rashidkpc · 2016-04-08T15:36:10Z

If and when you decide if you're going to do this, please open an issue on kibana. We currently support this in the UI so we'll need to issue a deprecation notice well in advance

rashidkpc · 2016-04-08T19:40:35Z

Oh, also, I'm +1 on this, its confusing and rarely useful anyway. If we're going to deprecate it pre-5.0.0, which I'd prefer, let me know.

Relates to #17614

jimczi · 2016-06-17T15:35:05Z

Since this requires some changes in Kibana I've reverted the removal. Though it is now deprecated in 2.x/2.4.

jccq · 2016-08-23T12:37:25Z

Frankly i know people who were using this actually i'd say rely on this e.g. to spot anomalies, if the only problem is that it wasnt working very well, could you just not have clarified in the docs? i mean so many things in Elasticsearch dont really return the right number

jccq · 2016-08-23T12:42:56Z

unless there is an alternative for that use cases of course, is there? e.g. what are the least frequent MD5s executed across the logs

djschny · 2016-08-30T14:19:01Z

For folks that have small datasets this is still useful and does not hurt anything to my knowledge. I do not understand why this is being removed as well. I plead to have this kept and enhance the docs and/or add a check based upon the size or cardinality of the field perhaps.

nich07as · 2016-09-01T02:01:39Z

I've worked with a few customers who uses the ascending count to determine what are the least popular items and also the occasional outliers. Agree that it might be inaccurate over large datasets but would be helpful on smaller samples and doesn't hurt to keep it around as @djschny suggests.

jccq · 2016-09-01T06:01:50Z

The point is.. what is the replacement really? sure its inaccurate but is
there any other way to achieve something like this? i dont think so. Thus
the big damage in removing vs... simply explaining the limitation.

On Thu, Sep 1, 2016 at 3:02 AM, nich07as [email protected] wrote:

I've worked with a few customers who uses the ascending count to determine
what are the least popular items and also the occasional outliers. Agree
that it might be inaccurate over large datasets but would be helpful on
smaller samples and doesn't hurt to keep it around as @djschny
https://github.com/djschny suggests.

—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#17614 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuGyLkuCM3oiAkuqValXtEn93axFcJnks5qljJGgaJpZM4IC2Tl
.

Giovanni Tummarello
CEO - SIREn Solutions

jayswan · 2016-09-01T14:20:50Z

We use this extensively to find outliers in log data. Usually in large data sets you can filter out the most common items before performing the aggregation, so loss of accuracy isn't a big problem.

Removing this feature is a huge problem for logging use cases -- nearly crippling, IMO.

jayswan · 2016-09-01T15:12:28Z

Furthermore: in exploratory log analysis the exact content of the tails often doesn't matter as much as the general kinds of things in the tails.

As an example, I just used this feature to reduce the set "interesting" documents in an index of about 6 million logs from 1.5 million to about 100 over the course of 5 minutes by iteratively excluding categories of things I found in the tails.

henrikjohansen · 2016-09-01T15:13:24Z

@colings86 @jimferenczi ... this is a rather poor decision for a number of use-cases where finding rare occasions is important. It severely impacts security analytics for example.

colings86 · 2016-09-01T15:22:35Z

very hard to return the correct top buckets and counts

I want to explain this a bit more as I don't think its really clear on the description above (apologies for that)

The problem here isn't that the counts can be wrong, the problem is that there is currently no bound on how wrong the counts can be (and no way to know what the error might be). To explain this consider the following example.

Imagine we are looking for the top 5 terms in a field across 3 shards ordered by ascending count. The terms aggregation goes and retrieves the top 5 terms from each shard and then merges them together (in practice it actually retrieves more than the top 5 from each shard but for the purposes of this example lets assume size and shard_size are the same). The shards might return the following:

Shard 1	Shard 2	Shard 3
a (1)	a (1)	a (1)
b (1)	c (1)	b (1)
d (1)	d (1)	f (1)
g (2)	f (2)	i (1)
h (9)	g (2)	j (2)

When merged on the reduce node this will produce the final list of:

c (1)
i (1)
d (2)
b (2)
j (2)
a (3)
f (3)
g (4)
h (9)

So the final top 5 will be:

c (1)
i (1)
d (2)
b (2)
j (2)

Which seems great until you look into the results from the shards a bit closer.

The counts returned from each shard are 100% accurate so if a shard says it has 1 document with the term a it only has one document with the term a. But its the information thats not returned from the shard that leads to issues. From the shard results above we can see that a is returned from every shard so we know that the document count for a is completely accurate. But if we now look at d we can see that it was only returned form shards 1 and 2. We don't know whether Shard 3 doesn't have any documents containing d or whether d just didn't make it into the top 5 terms on Shard 3 (i.e. whether there are 0 or > 2 documents containing d on the shard). It could be that shard 3 happens to have 100 documents containing d but it just wasn't returned in the top N list.

The terms aggregation documentation tries to explain how with descending count ordering we can calculate the worst case error in the doc_count by using the doc_count of the last term returned on each shard. But in the case of ascending order the doc_count a terms could have on a shard that didn't return it could be anything, all we know is that it's either 0 or greater than or equal to the doc_count of the last term returned by each shard.

djschny · 2016-09-01T15:27:58Z

This is still relevant and accurate when you are searching an index with only 1 shard correct?

epixa · 2017-05-20T15:34:48Z

@clintongormley @colings86 @jimczi Are there any plans to continue with this for 6.0? It looks stalled at the moment, but I want to make sure we remove the feature from Kibana if it is being removed from Elasticsearch as well.

jimczi · 2017-05-22T06:48:15Z

@epixa there is no plan and it seems that this functionality is important for some use cases.
As @jpountz said we can solve this with documentation, explaining that ascending count sort is not accurate and does not give any hint regarding the level of errors.

epixa · 2017-05-22T13:57:19Z

@jimczi Thanks for the update. Any reason why we can't close this then?

IdanWo · 2017-07-18T20:03:13Z

Hey, @colings86 . Can you please explain why increasing shard_size isn't good enough for ordering in all kinds of ways? I know that ElasticSearch can't tell about the error bounds in some cases, but if I don't rely on it and make my own tests and know how many documents I need to take from each shard by using shard_size - then will I be able to always be 100% accurate? For example: making the shard_size equal to the size (on which sum_other_doc_count is always 0). To be concrete, we make terms aggregations, which is ordered by a sub reverse_nested aggregation and then sub value_count aggregation.

In addition, why using the shard_size isn't scaled horizontally? If I need to take a lot of documents from each shard, I can split the index to more shards and assign them to more nodes (at setup time, of course). The shard_size would be the same, that's true, but each shard would be less in size/documents count.

colings86 · 2017-07-19T09:37:59Z

@IdanWo Increasing shard_size is only good enough if you can guarantee that the shard_size is big enough that all of the terms are returned from each shard. Although this may work for low cardinality fields and/or when the number of shards is relatively small, it does not scale well with number of shards or cardinality of the field. It is true though that in the single shard case and in the case where shard_size > number_terms_buckets_created on every shard the results will be 100% accurate with any ordering.

Although you can indeed split the data across shards you still need to return number_of_shards * shard_size buckets to the coordinating node for the reduce phase in order to get the final result. This means that even though the work on the shards is decreased by splitting the work across more of them, for the same shard_size more shards means that the coordinating node has to hold more in memory (the response form each shard) and do more work during the reduce phase.

IdanWo · 2017-07-21T11:38:58Z

@colings86 , thanks for the excellent explanation! I understand the circumstances, BUT I believe that something isn't right with the design decisions made: I don't understand why terms aggregation is considered a memory intensive operation where as cross cluster search - which is potentially much memory intensive since it obviously involves multiple indices with multiple shards that return responses to the coordinating node, is considered okay. In cross cluster search the design decision wasn't to limit the request (the request's query or the number of involved shards in the request), but to return batched results to the coordinating node. Why can't we make something similar here? And why actually this improvement doesn't help solving the current issue with a large shard_size in terms aggregation?

Therefore, it seems that there is a motivation to support cross cluster search but a low motivation to support full terms aggregations - although technologically they are quite the same in aspects of performance issues. It seems to me that increasing the shard_size in a request to a single index, is by far less dangerous than making a request to unlimited number of shards at once. How come sometimes the default is unlimited (action.search.shard_count.limit is unlimited) and sometime its nothing and has to be configured (size:0 in terms aggregations is deprecated). Making a terms aggregation in 1000 shards for size:300 is worse than a terms aggregation with size: 0 on 1 index with 10 shards and 200 unique buckets only.

This is taken from the Elasticsearch 5.4.0 released blog post (talking about #23946):

That said, it is quite easy to reach the 1,000 shard limit, especially with the recent release of Cross Cluster Search. As of 5.4.0, Top-N search results and aggregations are reduced in batches of 512, which puts an upper limit on the amount of memory used on the coordinating node, which has allowed us to set the shard soft limit to unlimited by default.

This is taken from the Tribe Nodes & Cross-Cluster Search blog post (pay attention to what is considered a good user experience here):

Now with the addition of cross cluster search where we are emphasizing searches across many, many shards, having such a soft limit isn’t providing a good user experience after all. In order to eliminate the impact of querying a large number of shards, the coordinating node now reduces aggregations in batches of 512 results down to a single aggregation object while it’s waiting for more results to arrive. This upcoming improvement was the initial step to eventually remove the soft limit altogether. The upcoming 5.4.0 release will also allow to reduce the top-N documents in batches. With these two major memory consumers under control, we now default the action.search.shard_count.limit setting to unlimited. This allows users to still limit and protect their searches in the number of shards while providing a good user experience for other users. When you perform a search request, the node which receives the request becomes the coordinating node which is in charge of forwarding the shard-level requests to the appropriate data nodes, collecting the results, and merging them into a single result set. The memory use on the coordinating node varies according to the number of involved shards. Previous, we had added a 1,000 shard soft limit to try to prevent coordinating nodes from using too much memory.

Remark:
I can agree that sometimes it's better to stop the use of a bad-practice, instead of enabling it and support its consequences. Personally, I sort terms in descending order but via a sub aggregation - which is also discouraged. But, I really have to do it. So I i'm keep using shard_size: 20000, depending on the terms field, which acts okay until now in our environment (600-800 ms at most for a query, but in most of the times the real number of buckets is considerately lower than 20,000, and is more like 300).

theflakes · 2017-09-17T19:31:19Z

As others have noted, sorting in ascending order is critical for exploratory data analysis and simple cyber security hunting for LFO events.

I'm probably going to expose my ignorance of ES under the hood cause I am ignorant there. Could a recursive comparison of the aggregate of all shard results help bring more confidence to the returned results? Possibly with sub-queries of certain results in question to specific shards capped at X number of allowed sub-aggregations? I'm sure there are serious performance considerations involved there that my ignorance of ES doesn't make me fully appreciative of.

Again, with all that said, I'll re-emphasize the criticality of ascending search order in many data types I deal with in my work. What concerns me here are not just removing the capability but also the confidence level of the currently returned results when querying across multiple shards, if I'm understanding this correctly.

markharwood · 2018-03-16T11:45:02Z

Since 5.2 elasticsearch has had support for partitioning in the terms aggs.

For those searching for low-frequency events with accuracy they should use multiple search requests, using an appropriate choice for number of partitions (the documentation for terms partitioning describes how to do this). Essentially you have to ensure numTermsReturned < size setting in the terms agg.

@colings86 given people have a genuine use for this and we have a workaround which maintains accuracy maybe we can keep the reverse-sort feature but error if we determine accuracy is potentially compromised and point to the solution?

cc @elastic/es-search-aggs

polyfractal · 2018-06-01T13:42:55Z

Hey all. I linked to #20586 a while ago, but never explicitly commented about it.

We think we've devised an aggregation that will allow aggregating "Rare Terms" in a way that, while not 100% accurate, will provide bounded errors (unlike sorting by count ascending). Our plan is to implement the Rare Terms agg so that there's a path to providing this functionality in a more predictable, bounded manner... and then look into deprecating sorting ascending from Terms agg.

Still no timeline/ETA, but wanted to update everyone about what we were thinking.

jccq · 2018-06-01T13:48:04Z

uber cool!

epixa · 2018-06-01T15:25:18Z

cc @elastic/kibana-visualizations

danielo515 · 2019-01-15T20:06:21Z

by the way, and while you decide if you remove it or not, how are you making queries to get the terms with less doc_count ?

arusanescu · 2019-04-05T02:21:04Z

I don't see any updates here for a while, what is the current status/path going forward with this? Adding @colings86 to get some traction 👍

polyfractal · 2019-04-05T13:06:19Z

We're still working on RareTerms aggregation outlined in #20586 (WIP PR here: #35718), so the plan outlined in #17614 (comment) above is still valid.

In short, we want to implement RareTerms first, then deprecate sorting by terms agg ascending, then remove at a later date.

arusanescu · 2019-04-08T03:00:52Z

@polyfractal Awesome, thank you for the insight on this.
I did want to confirm if the statement in the top comment by colings86: "sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts" means that such results are NOT correct in general, or if there is a level of discrepancy then what are we looking at exactly? If it is known for a fact that said sorting doesn't return correct top buckets/counts, I just want to know where we stand with the usage of such sorting. While it is deprecated, if above is true shouldn't there be a stronger message to align with the risks of using this type of sorting?

polyfractal · 2019-04-08T14:49:29Z

@arusanescu When sorting by count ascending (or sub-agg metrics in many cases), there is a possibility of error. Whether an error creeps into the count or not is dependent on the aggregation (size, shard_size), number of shards and data distribution.

We report the potential worst-case error with a doc_count_error_upper_bound field for the entire aggregation. This returns the highest doc count of a term which potentially didn't make the top-n list, and can be used to help judge how inaccurate the results are. This value can also be enabled on a per-bucket level to get more detailed error reporting.

The idea is that users can determine if they are comfortable with the error being reported, and either accept it or adjust shard_size to get the preferred error (or to not use the particular aggregation at all).

E.g. if the reported error shows a doc_count that could potentially be ranked 95 out of 100 requested results, that might be an error that the user is comfortable with. If the error shows a doc_count that could potentially be ranked 5 out of 100, that's probably unacceptable and they should reconsider.

While it is deprecated, if above is true shouldn't there be a stronger message to align with the risks of using this type of sorting?

If you look at the terms aggregation documentation, something like 50% of the page is dedicated to explaining the errors, how to interpret it, and warning users off sorting :)

Detailed explanation of how doc counts can be approximate, including a worked example with sample shard distributions
Explanation of how shard_size can improve the accuracy at expense of more overhead
How error bounds are calculated, and how to enable for per-bucket error reporting
An explicit warning saying that users should not sort by _count ascending or sub aggs. Note the phrase "[...] errors are unbounded" ;)

I think we've done everything we can do to document the behavior and warn users off of bad sort orders. :)

arusanescu · 2019-04-09T01:21:09Z

@polyfractal Thank you for the very detailed explanation! While I had read some of the documents I did also miss some and so this has helped me to better understand the problem and how to deal with it! 👍

nik9000 · 2020-07-23T17:24:39Z

Now that rare_terms has well and truly landed I think it is time to talk about this again. Maybe deprecating the option in a 7.x release and removing it entirely in 8.0.

nik9000 · 2020-07-29T18:45:39Z

A bunch of us got together and asked what it'd take to make sorting by ascending count accurate. We weren't sure, but we have some interest in giving it a shot, just not any time soon.

polyfractal · 2020-11-04T16:56:55Z

We had another discussion :)

As much as we'd like to remove ordering by ascending... we don't think that's a practical reality. Too many people and products rely on the functionality, despite it potentially returning unbounded error (sigh). So I'm going to close this ticket since we don't see ourselves deprecating the functionality.

I'm going to open a subsequent ticket to explore if we can possibly support sort ascending in the terms agg more accurately. RareTerms is still our recommended method if you want fast and reasonably accurate accounting of the long-tail... but we might be able to implement some new kind of execution mode or something for terms agg. It would undoubtedly be slower -- potentially a lot slower -- so it's unclear if we want to introduce such a feature. But that discussion should take place on a different issue.

I'll cross-link the tickets when it is open.

colings86 added >breaking discuss :Analytics/Aggregations Aggregations labels Apr 8, 2016

colings86 mentioned this issue Apr 8, 2016

Simplify ordering support on terms aggregations #17588

Closed

rashidkpc mentioned this issue Apr 8, 2016

[4.x] Deprecate ascending _count sort from terms agg elastic/kibana#6833

Closed

rashidkpc mentioned this issue Apr 28, 2016

Plans for adding terms agg support? elastic/timelion#15

Closed

colings86 added help wanted adoptme and removed discuss labels Jun 10, 2016

jimczi mentioned this issue Jun 17, 2016

Remove support for sorting terms aggregation by ascending count #18940

Merged

jimczi removed the help wanted adoptme label Jun 17, 2016

jimczi closed this as completed in #18940 Jun 17, 2016

epixa mentioned this issue Jun 17, 2016

Remove ascending _count sort from terms agg elastic/kibana#7494

Closed

jimczi added a commit that referenced this issue Jun 17, 2016

Deprecates the ability to sort terms aggregation by ascending count

a942b8d

Relates to #17614

jimczi added a commit that referenced this issue Jun 17, 2016

Deprecates the ability to sort terms aggregation by ascending count

691e76d

Relates to #17614

jimczi reopened this Jun 17, 2016

jimczi self-assigned this Jun 17, 2016

colings86 added the discuss label Sep 1, 2016

rashidkpc mentioned this issue May 22, 2017

[Timelion] Support ascending order sort on terms agg elastic/kibana#11622

Closed

tomcallahan added stalled and removed discuss labels Jun 1, 2018

$@polyfractal$ polyfractal mentioned this issue Aug 2, 2018

Possibility to add new ordering scheme for Aggregation results. #26570

Closed

epixa mentioned this issue Sep 21, 2018

Remove deprecation notice from ascending sort for terms elastic/kibana#23409

Closed

alexfrancoeur mentioned this issue Nov 28, 2018

Add support for RareTerms aggregation elastic/kibana#26340

Closed

rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020

$@polyfractal$ polyfractal closed this as completed Nov 4, 2020

wchaparro assigned nik9000 and unassigned nik9000 Jun 16, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove support for sorting terms aggregation by ascending count #17614

Remove support for sorting terms aggregation by ascending count #17614

colings86 commented Apr 8, 2016

rashidkpc commented Apr 8, 2016

rashidkpc commented Apr 8, 2016

jimczi commented Jun 17, 2016

jccq commented Aug 23, 2016

jccq commented Aug 23, 2016

djschny commented Aug 30, 2016

nich07as commented Sep 1, 2016

jccq commented Sep 1, 2016

jayswan commented Sep 1, 2016

jayswan commented Sep 1, 2016

henrikjohansen commented Sep 1, 2016

colings86 commented Sep 1, 2016 •

edited

Loading

djschny commented Sep 1, 2016

epixa commented May 20, 2017

jimczi commented May 22, 2017

epixa commented May 22, 2017 •

edited

Loading

IdanWo commented Jul 18, 2017 •

edited

Loading

colings86 commented Jul 19, 2017

IdanWo commented Jul 21, 2017 •

edited

Loading

theflakes commented Sep 17, 2017 •

edited

Loading

markharwood commented Mar 16, 2018 •

edited

Loading

polyfractal commented Jun 1, 2018

jccq commented Jun 1, 2018

epixa commented Jun 1, 2018

danielo515 commented Jan 15, 2019

arusanescu commented Apr 5, 2019 •

edited

Loading

polyfractal commented Apr 5, 2019

arusanescu commented Apr 8, 2019 •

edited

Loading

polyfractal commented Apr 8, 2019

arusanescu commented Apr 9, 2019

nik9000 commented Jul 23, 2020

nik9000 commented Jul 29, 2020

polyfractal commented Nov 4, 2020

Remove support for sorting terms aggregation by ascending count #17614

Remove support for sorting terms aggregation by ascending count #17614

Comments

colings86 commented Apr 8, 2016

rashidkpc commented Apr 8, 2016

rashidkpc commented Apr 8, 2016

jimczi commented Jun 17, 2016

jccq commented Aug 23, 2016

jccq commented Aug 23, 2016

djschny commented Aug 30, 2016

nich07as commented Sep 1, 2016

jccq commented Sep 1, 2016

jayswan commented Sep 1, 2016

jayswan commented Sep 1, 2016

henrikjohansen commented Sep 1, 2016

colings86 commented Sep 1, 2016 • edited Loading

djschny commented Sep 1, 2016

epixa commented May 20, 2017

jimczi commented May 22, 2017

epixa commented May 22, 2017 • edited Loading

IdanWo commented Jul 18, 2017 • edited Loading

colings86 commented Jul 19, 2017

IdanWo commented Jul 21, 2017 • edited Loading

theflakes commented Sep 17, 2017 • edited Loading

markharwood commented Mar 16, 2018 • edited Loading

polyfractal commented Jun 1, 2018

jccq commented Jun 1, 2018

epixa commented Jun 1, 2018

danielo515 commented Jan 15, 2019

arusanescu commented Apr 5, 2019 • edited Loading

polyfractal commented Apr 5, 2019

arusanescu commented Apr 8, 2019 • edited Loading

polyfractal commented Apr 8, 2019

arusanescu commented Apr 9, 2019

nik9000 commented Jul 23, 2020

nik9000 commented Jul 29, 2020

polyfractal commented Nov 4, 2020

colings86 commented Sep 1, 2016 •

edited

Loading

epixa commented May 22, 2017 •

edited

Loading

IdanWo commented Jul 18, 2017 •

edited

Loading

IdanWo commented Jul 21, 2017 •

edited

Loading

theflakes commented Sep 17, 2017 •

edited

Loading

markharwood commented Mar 16, 2018 •

edited

Loading

arusanescu commented Apr 5, 2019 •

edited

Loading

arusanescu commented Apr 8, 2019 •

edited

Loading