Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove support for sorting terms aggregation by ascending count #17614

Closed
colings86 opened this issue Apr 8, 2016 · 40 comments · Fixed by #18940
Closed

Remove support for sorting terms aggregation by ascending count #17614

colings86 opened this issue Apr 8, 2016 · 40 comments · Fixed by #18940
Labels
:Analytics/Aggregations Aggregations >breaking stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)

Comments

@colings86
Copy link
Contributor

We try to be as flexible as possible when it comes to sorting terms aggregations. However, sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts, which is disappointing to users. For this reason we should remove the ability to sort the terms aggregation by ascending count (split out from #17588)

@rashidkpc
Copy link

If and when you decide if you're going to do this, please open an issue on kibana. We currently support this in the UI so we'll need to issue a deprecation notice well in advance

@rashidkpc
Copy link

Oh, also, I'm +1 on this, its confusing and rarely useful anyway. If we're going to deprecate it pre-5.0.0, which I'd prefer, let me know.

@jimczi
Copy link
Contributor

jimczi commented Jun 17, 2016

Since this requires some changes in Kibana I've reverted the removal. Though it is now deprecated in 2.x/2.4.

@jimczi jimczi reopened this Jun 17, 2016
@jimczi jimczi self-assigned this Jun 17, 2016
@jccq
Copy link

jccq commented Aug 23, 2016

Frankly i know people who were using this actually i'd say rely on this e.g. to spot anomalies, if the only problem is that it wasnt working very well, could you just not have clarified in the docs? i mean so many things in Elasticsearch dont really return the right number

@jccq
Copy link

jccq commented Aug 23, 2016

unless there is an alternative for that use cases of course, is there? e.g. what are the least frequent MD5s executed across the logs

@djschny
Copy link
Contributor

djschny commented Aug 30, 2016

For folks that have small datasets this is still useful and does not hurt anything to my knowledge. I do not understand why this is being removed as well. I plead to have this kept and enhance the docs and/or add a check based upon the size or cardinality of the field perhaps.

@nich07as
Copy link

nich07as commented Sep 1, 2016

I've worked with a few customers who uses the ascending count to determine what are the least popular items and also the occasional outliers. Agree that it might be inaccurate over large datasets but would be helpful on smaller samples and doesn't hurt to keep it around as @djschny suggests.

@jccq
Copy link

jccq commented Sep 1, 2016

The point is.. what is the replacement really? sure its inaccurate but is
there any other way to achieve something like this? i dont think so. Thus
the big damage in removing vs... simply explaining the limitation.

On Thu, Sep 1, 2016 at 3:02 AM, nich07as [email protected] wrote:

I've worked with a few customers who uses the ascending count to determine
what are the least popular items and also the occasional outliers. Agree
that it might be inaccurate over large datasets but would be helpful on
smaller samples and doesn't hurt to keep it around as @djschny
https://github.com/djschny suggests.


You are receiving this because you commented.
Reply to this email directly, view it on GitHub
#17614 (comment),
or mute the thread
https://github.com/notifications/unsubscribe-auth/AEuGyLkuCM3oiAkuqValXtEn93axFcJnks5qljJGgaJpZM4IC2Tl
.

Giovanni Tummarello
CEO - SIREn Solutions

@jayswan
Copy link

jayswan commented Sep 1, 2016

We use this extensively to find outliers in log data. Usually in large data sets you can filter out the most common items before performing the aggregation, so loss of accuracy isn't a big problem.

Removing this feature is a huge problem for logging use cases -- nearly crippling, IMO.

@jayswan
Copy link

jayswan commented Sep 1, 2016

Furthermore: in exploratory log analysis the exact content of the tails often doesn't matter as much as the general kinds of things in the tails.

As an example, I just used this feature to reduce the set "interesting" documents in an index of about 6 million logs from 1.5 million to about 100 over the course of 5 minutes by iteratively excluding categories of things I found in the tails.

@henrikjohansen
Copy link

@colings86 @jimferenczi ... this is a rather poor decision for a number of use-cases where finding rare occasions is important. It severely impacts security analytics for example.

@colings86
Copy link
Contributor Author

colings86 commented Sep 1, 2016

very hard to return the correct top buckets and counts

I want to explain this a bit more as I don't think its really clear on the description above (apologies for that)

The problem here isn't that the counts can be wrong, the problem is that there is currently no bound on how wrong the counts can be (and no way to know what the error might be). To explain this consider the following example.

Imagine we are looking for the top 5 terms in a field across 3 shards ordered by ascending count. The terms aggregation goes and retrieves the top 5 terms from each shard and then merges them together (in practice it actually retrieves more than the top 5 from each shard but for the purposes of this example lets assume size and shard_size are the same). The shards might return the following:

Shard 1 Shard 2 Shard 3
a (1) a (1) a (1)
b (1) c (1) b (1)
d (1) d (1) f (1)
g (2) f (2) i (1)
h (9) g (2) j (2)

When merged on the reduce node this will produce the final list of:

  • c (1)
  • i (1)
  • d (2)
  • b (2)
  • j (2)
  • a (3)
  • f (3)
  • g (4)
  • h (9)

So the final top 5 will be:

  • c (1)
  • i (1)
  • d (2)
  • b (2)
  • j (2)

Which seems great until you look into the results from the shards a bit closer.

The counts returned from each shard are 100% accurate so if a shard says it has 1 document with the term a it only has one document with the term a. But its the information thats not returned from the shard that leads to issues. From the shard results above we can see that a is returned from every shard so we know that the document count for a is completely accurate. But if we now look at d we can see that it was only returned form shards 1 and 2. We don't know whether Shard 3 doesn't have any documents containing d or whether d just didn't make it into the top 5 terms on Shard 3 (i.e. whether there are 0 or > 2 documents containing d on the shard). It could be that shard 3 happens to have 100 documents containing d but it just wasn't returned in the top N list.

The terms aggregation documentation tries to explain how with descending count ordering we can calculate the worst case error in the doc_count by using the doc_count of the last term returned on each shard. But in the case of ascending order the doc_count a terms could have on a shard that didn't return it could be anything, all we know is that it's either 0 or greater than or equal to the doc_count of the last term returned by each shard.

@djschny
Copy link
Contributor

djschny commented Sep 1, 2016

This is still relevant and accurate when you are searching an index with only 1 shard correct?

@epixa
Copy link
Contributor

epixa commented May 20, 2017

@clintongormley @colings86 @jimczi Are there any plans to continue with this for 6.0? It looks stalled at the moment, but I want to make sure we remove the feature from Kibana if it is being removed from Elasticsearch as well.

@jimczi
Copy link
Contributor

jimczi commented May 22, 2017

@epixa there is no plan and it seems that this functionality is important for some use cases.
As @jpountz said we can solve this with documentation, explaining that ascending count sort is not accurate and does not give any hint regarding the level of errors.

@epixa
Copy link
Contributor

epixa commented May 22, 2017

@jimczi Thanks for the update. Any reason why we can't close this then?

@IdanWo
Copy link

IdanWo commented Jul 18, 2017

Hey, @colings86 . Can you please explain why increasing shard_size isn't good enough for ordering in all kinds of ways? I know that ElasticSearch can't tell about the error bounds in some cases, but if I don't rely on it and make my own tests and know how many documents I need to take from each shard by using shard_size - then will I be able to always be 100% accurate? For example: making the shard_size equal to the size (on which sum_other_doc_count is always 0). To be concrete, we make terms aggregations, which is ordered by a sub reverse_nested aggregation and then sub value_count aggregation.

In addition, why using the shard_size isn't scaled horizontally? If I need to take a lot of documents from each shard, I can split the index to more shards and assign them to more nodes (at setup time, of course). The shard_size would be the same, that's true, but each shard would be less in size/documents count.

@colings86
Copy link
Contributor Author

@IdanWo Increasing shard_size is only good enough if you can guarantee that the shard_size is big enough that all of the terms are returned from each shard. Although this may work for low cardinality fields and/or when the number of shards is relatively small, it does not scale well with number of shards or cardinality of the field. It is true though that in the single shard case and in the case where shard_size > number_terms_buckets_created on every shard the results will be 100% accurate with any ordering.

Although you can indeed split the data across shards you still need to return number_of_shards * shard_size buckets to the coordinating node for the reduce phase in order to get the final result. This means that even though the work on the shards is decreased by splitting the work across more of them, for the same shard_size more shards means that the coordinating node has to hold more in memory (the response form each shard) and do more work during the reduce phase.

@IdanWo
Copy link

IdanWo commented Jul 21, 2017

@colings86 , thanks for the excellent explanation! I understand the circumstances, BUT I believe that something isn't right with the design decisions made: I don't understand why terms aggregation is considered a memory intensive operation where as cross cluster search - which is potentially much memory intensive since it obviously involves multiple indices with multiple shards that return responses to the coordinating node, is considered okay. In cross cluster search the design decision wasn't to limit the request (the request's query or the number of involved shards in the request), but to return batched results to the coordinating node. Why can't we make something similar here? And why actually this improvement doesn't help solving the current issue with a large shard_size in terms aggregation?

Therefore, it seems that there is a motivation to support cross cluster search but a low motivation to support full terms aggregations - although technologically they are quite the same in aspects of performance issues. It seems to me that increasing the shard_size in a request to a single index, is by far less dangerous than making a request to unlimited number of shards at once. How come sometimes the default is unlimited (action.search.shard_count.limit is unlimited) and sometime its nothing and has to be configured (size:0 in terms aggregations is deprecated). Making a terms aggregation in 1000 shards for size:300 is worse than a terms aggregation with size: 0 on 1 index with 10 shards and 200 unique buckets only.

This is taken from the Elasticsearch 5.4.0 released blog post (talking about #23946):

That said, it is quite easy to reach the 1,000 shard limit, especially with the recent release of Cross Cluster Search. As of 5.4.0, Top-N search results and aggregations are reduced in batches of 512, which puts an upper limit on the amount of memory used on the coordinating node, which has allowed us to set the shard soft limit to unlimited by default.

This is taken from the Tribe Nodes & Cross-Cluster Search blog post (pay attention to what is considered a good user experience here):

Now with the addition of cross cluster search where we are emphasizing searches across many, many shards, having such a soft limit isn’t providing a good user experience after all. In order to eliminate the impact of querying a large number of shards, the coordinating node now reduces aggregations in batches of 512 results down to a single aggregation object while it’s waiting for more results to arrive. This upcoming improvement was the initial step to eventually remove the soft limit altogether. The upcoming 5.4.0 release will also allow to reduce the top-N documents in batches. With these two major memory consumers under control, we now default the action.search.shard_count.limit setting to unlimited. This allows users to still limit and protect their searches in the number of shards while providing a good user experience for other users. When you perform a search request, the node which receives the request becomes the coordinating node which is in charge of forwarding the shard-level requests to the appropriate data nodes, collecting the results, and merging them into a single result set. The memory use on the coordinating node varies according to the number of involved shards. Previous, we had added a 1,000 shard soft limit to try to prevent coordinating nodes from using too much memory.

Remark:
I can agree that sometimes it's better to stop the use of a bad-practice, instead of enabling it and support its consequences. Personally, I sort terms in descending order but via a sub aggregation - which is also discouraged. But, I really have to do it. So I i'm keep using shard_size: 20000, depending on the terms field, which acts okay until now in our environment (600-800 ms at most for a query, but in most of the times the real number of buckets is considerately lower than 20,000, and is more like 300).

@theflakes
Copy link

theflakes commented Sep 17, 2017

As others have noted, sorting in ascending order is critical for exploratory data analysis and simple cyber security hunting for LFO events.

I'm probably going to expose my ignorance of ES under the hood cause I am ignorant there. Could a recursive comparison of the aggregate of all shard results help bring more confidence to the returned results? Possibly with sub-queries of certain results in question to specific shards capped at X number of allowed sub-aggregations? I'm sure there are serious performance considerations involved there that my ignorance of ES doesn't make me fully appreciative of.

Again, with all that said, I'll re-emphasize the criticality of ascending search order in many data types I deal with in my work. What concerns me here are not just removing the capability but also the confidence level of the currently returned results when querying across multiple shards, if I'm understanding this correctly.

@markharwood
Copy link
Contributor

markharwood commented Mar 16, 2018

Since 5.2 elasticsearch has had support for partitioning in the terms aggs.

For those searching for low-frequency events with accuracy they should use multiple search requests, using an appropriate choice for number of partitions (the documentation for terms partitioning describes how to do this). Essentially you have to ensure numTermsReturned < size setting in the terms agg.

@colings86 given people have a genuine use for this and we have a workaround which maintains accuracy maybe we can keep the reverse-sort feature but error if we determine accuracy is potentially compromised and point to the solution?

cc @elastic/es-search-aggs

@polyfractal
Copy link
Contributor

Hey all. I linked to #20586 a while ago, but never explicitly commented about it.

We think we've devised an aggregation that will allow aggregating "Rare Terms" in a way that, while not 100% accurate, will provide bounded errors (unlike sorting by count ascending). Our plan is to implement the Rare Terms agg so that there's a path to providing this functionality in a more predictable, bounded manner... and then look into deprecating sorting ascending from Terms agg.

Still no timeline/ETA, but wanted to update everyone about what we were thinking.

@jccq
Copy link

jccq commented Jun 1, 2018

uber cool!

@epixa
Copy link
Contributor

epixa commented Jun 1, 2018

cc @elastic/kibana-visualizations

@danielo515
Copy link

by the way, and while you decide if you remove it or not, how are you making queries to get the terms with less doc_count ?

@arusanescu
Copy link
Contributor

arusanescu commented Apr 5, 2019

I don't see any updates here for a while, what is the current status/path going forward with this? Adding @colings86 to get some traction 👍

@polyfractal
Copy link
Contributor

We're still working on RareTerms aggregation outlined in #20586 (WIP PR here: #35718), so the plan outlined in #17614 (comment) above is still valid.

In short, we want to implement RareTerms first, then deprecate sorting by terms agg ascending, then remove at a later date.

@arusanescu
Copy link
Contributor

arusanescu commented Apr 8, 2019

@polyfractal Awesome, thank you for the insight on this.
I did want to confirm if the statement in the top comment by colings86: "sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts" means that such results are NOT correct in general, or if there is a level of discrepancy then what are we looking at exactly? If it is known for a fact that said sorting doesn't return correct top buckets/counts, I just want to know where we stand with the usage of such sorting. While it is deprecated, if above is true shouldn't there be a stronger message to align with the risks of using this type of sorting?

@polyfractal
Copy link
Contributor

@arusanescu When sorting by count ascending (or sub-agg metrics in many cases), there is a possibility of error. Whether an error creeps into the count or not is dependent on the aggregation (size, shard_size), number of shards and data distribution.

We report the potential worst-case error with a doc_count_error_upper_bound field for the entire aggregation. This returns the highest doc count of a term which potentially didn't make the top-n list, and can be used to help judge how inaccurate the results are. This value can also be enabled on a per-bucket level to get more detailed error reporting.

The idea is that users can determine if they are comfortable with the error being reported, and either accept it or adjust shard_size to get the preferred error (or to not use the particular aggregation at all).

E.g. if the reported error shows a doc_count that could potentially be ranked 95 out of 100 requested results, that might be an error that the user is comfortable with. If the error shows a doc_count that could potentially be ranked 5 out of 100, that's probably unacceptable and they should reconsider.

While it is deprecated, if above is true shouldn't there be a stronger message to align with the risks of using this type of sorting?

If you look at the terms aggregation documentation, something like 50% of the page is dedicated to explaining the errors, how to interpret it, and warning users off sorting :)

  • Detailed explanation of how doc counts can be approximate, including a worked example with sample shard distributions
  • Explanation of how shard_size can improve the accuracy at expense of more overhead
  • How error bounds are calculated, and how to enable for per-bucket error reporting
  • An explicit warning saying that users should not sort by _count ascending or sub aggs. Note the phrase "[...] errors are unbounded" ;)
    image

I think we've done everything we can do to document the behavior and warn users off of bad sort orders. :)

@arusanescu
Copy link
Contributor

@polyfractal Thank you for the very detailed explanation! While I had read some of the documents I did also miss some and so this has helped me to better understand the problem and how to deal with it! 👍

@rjernst rjernst added the Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo) label May 4, 2020
@nik9000
Copy link
Member

nik9000 commented Jul 23, 2020

Now that rare_terms has well and truly landed I think it is time to talk about this again. Maybe deprecating the option in a 7.x release and removing it entirely in 8.0.

@nik9000
Copy link
Member

nik9000 commented Jul 29, 2020

A bunch of us got together and asked what it'd take to make sorting by ascending count accurate. We weren't sure, but we have some interest in giving it a shot, just not any time soon.

@polyfractal
Copy link
Contributor

We had another discussion :)

As much as we'd like to remove ordering by ascending... we don't think that's a practical reality. Too many people and products rely on the functionality, despite it potentially returning unbounded error (sigh). So I'm going to close this ticket since we don't see ourselves deprecating the functionality.

I'm going to open a subsequent ticket to explore if we can possibly support sort ascending in the terms agg more accurately. RareTerms is still our recommended method if you want fast and reasonably accurate accounting of the long-tail... but we might be able to implement some new kind of execution mode or something for terms agg. It would undoubtedly be slower -- potentially a lot slower -- so it's unclear if we want to introduce such a feature. But that discussion should take place on a different issue.

I'll cross-link the tickets when it is open.

@wchaparro wchaparro assigned nik9000 and unassigned nik9000 Jun 16, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Analytics/Aggregations Aggregations >breaking stalled Team:Analytics Meta label for analytical engine team (ESQL/Aggs/Geo)
Projects
None yet
Development

Successfully merging a pull request may close this issue.