-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove support for sorting terms aggregation by ascending count #17614
Comments
If and when you decide if you're going to do this, please open an issue on kibana. We currently support this in the UI so we'll need to issue a deprecation notice well in advance |
Oh, also, I'm +1 on this, its confusing and rarely useful anyway. If we're going to deprecate it pre-5.0.0, which I'd prefer, let me know. |
Since this requires some changes in Kibana I've reverted the removal. Though it is now deprecated in 2.x/2.4. |
Frankly i know people who were using this actually i'd say rely on this e.g. to spot anomalies, if the only problem is that it wasnt working very well, could you just not have clarified in the docs? i mean so many things in Elasticsearch dont really return the right number |
unless there is an alternative for that use cases of course, is there? e.g. what are the least frequent MD5s executed across the logs |
For folks that have small datasets this is still useful and does not hurt anything to my knowledge. I do not understand why this is being removed as well. I plead to have this kept and enhance the docs and/or add a check based upon the size or cardinality of the field perhaps. |
I've worked with a few customers who uses the ascending count to determine what are the least popular items and also the occasional outliers. Agree that it might be inaccurate over large datasets but would be helpful on smaller samples and doesn't hurt to keep it around as @djschny suggests. |
The point is.. what is the replacement really? sure its inaccurate but is On Thu, Sep 1, 2016 at 3:02 AM, nich07as [email protected] wrote:
Giovanni Tummarello |
We use this extensively to find outliers in log data. Usually in large data sets you can filter out the most common items before performing the aggregation, so loss of accuracy isn't a big problem. Removing this feature is a huge problem for logging use cases -- nearly crippling, IMO. |
Furthermore: in exploratory log analysis the exact content of the tails often doesn't matter as much as the general kinds of things in the tails. As an example, I just used this feature to reduce the set "interesting" documents in an index of about 6 million logs from 1.5 million to about 100 over the course of 5 minutes by iteratively excluding categories of things I found in the tails. |
@colings86 @jimferenczi ... this is a rather poor decision for a number of use-cases where finding rare occasions is important. It severely impacts security analytics for example. |
I want to explain this a bit more as I don't think its really clear on the description above (apologies for that) The problem here isn't that the counts can be wrong, the problem is that there is currently no bound on how wrong the counts can be (and no way to know what the error might be). To explain this consider the following example. Imagine we are looking for the top 5 terms in a field across 3 shards ordered by ascending count. The terms aggregation goes and retrieves the top 5 terms from each shard and then merges them together (in practice it actually retrieves more than the top 5 from each shard but for the purposes of this example lets assume
When merged on the reduce node this will produce the final list of:
So the final top 5 will be:
Which seems great until you look into the results from the shards a bit closer. The counts returned from each shard are 100% accurate so if a shard says it has 1 document with the term The |
This is still relevant and accurate when you are searching an index with only 1 shard correct? |
@clintongormley @colings86 @jimczi Are there any plans to continue with this for 6.0? It looks stalled at the moment, but I want to make sure we remove the feature from Kibana if it is being removed from Elasticsearch as well. |
@jimczi Thanks for the update. Any reason why we can't close this then? |
Hey, @colings86 . Can you please explain why increasing In addition, why using the |
@IdanWo Increasing Although you can indeed split the data across shards you still need to return |
@colings86 , thanks for the excellent explanation! I understand the circumstances, BUT I believe that something isn't right with the design decisions made: I don't understand why Therefore, it seems that there is a motivation to support cross cluster search but a low motivation to support full terms aggregations - although technologically they are quite the same in aspects of performance issues. It seems to me that increasing the This is taken from the Elasticsearch 5.4.0 released blog post (talking about #23946):
This is taken from the Tribe Nodes & Cross-Cluster Search blog post (pay attention to what is considered a good user experience here):
Remark: |
As others have noted, sorting in ascending order is critical for exploratory data analysis and simple cyber security hunting for LFO events. I'm probably going to expose my ignorance of ES under the hood cause I am ignorant there. Could a recursive comparison of the aggregate of all shard results help bring more confidence to the returned results? Possibly with sub-queries of certain results in question to specific shards capped at X number of allowed sub-aggregations? I'm sure there are serious performance considerations involved there that my ignorance of ES doesn't make me fully appreciative of. Again, with all that said, I'll re-emphasize the criticality of ascending search order in many data types I deal with in my work. What concerns me here are not just removing the capability but also the confidence level of the currently returned results when querying across multiple shards, if I'm understanding this correctly. |
Since 5.2 elasticsearch has had support for partitioning in the terms aggs. For those searching for low-frequency events with accuracy they should use multiple search requests, using an appropriate choice for number of partitions (the documentation for terms partitioning describes how to do this). Essentially you have to ensure numTermsReturned < @colings86 given people have a genuine use for this and we have a workaround which maintains accuracy maybe we can keep the reverse-sort feature but error if we determine accuracy is potentially compromised and point to the solution? cc @elastic/es-search-aggs |
Hey all. I linked to #20586 a while ago, but never explicitly commented about it. We think we've devised an aggregation that will allow aggregating "Rare Terms" in a way that, while not 100% accurate, will provide bounded errors (unlike sorting by count ascending). Our plan is to implement the Rare Terms agg so that there's a path to providing this functionality in a more predictable, bounded manner... and then look into deprecating sorting ascending from Terms agg. Still no timeline/ETA, but wanted to update everyone about what we were thinking. |
uber cool! |
cc @elastic/kibana-visualizations |
by the way, and while you decide if you remove it or not, how are you making queries to get the terms with less doc_count ? |
I don't see any updates here for a while, what is the current status/path going forward with this? Adding @colings86 to get some traction 👍 |
We're still working on RareTerms aggregation outlined in #20586 (WIP PR here: #35718), so the plan outlined in #17614 (comment) above is still valid. In short, we want to implement RareTerms first, then deprecate sorting by terms agg ascending, then remove at a later date. |
@polyfractal Awesome, thank you for the insight on this. |
@arusanescu When sorting by count ascending (or sub-agg metrics in many cases), there is a possibility of error. Whether an error creeps into the count or not is dependent on the aggregation (size, shard_size), number of shards and data distribution. We report the potential worst-case error with a The idea is that users can determine if they are comfortable with the error being reported, and either accept it or adjust E.g. if the reported error shows a doc_count that could potentially be ranked 95 out of 100 requested results, that might be an error that the user is comfortable with. If the error shows a doc_count that could potentially be ranked 5 out of 100, that's probably unacceptable and they should reconsider.
If you look at the terms aggregation documentation, something like 50% of the page is dedicated to explaining the errors, how to interpret it, and warning users off sorting :)
I think we've done everything we can do to document the behavior and warn users off of bad sort orders. :) |
@polyfractal Thank you for the very detailed explanation! While I had read some of the documents I did also miss some and so this has helped me to better understand the problem and how to deal with it! 👍 |
Now that |
A bunch of us got together and asked what it'd take to make sorting by ascending count accurate. We weren't sure, but we have some interest in giving it a shot, just not any time soon. |
We had another discussion :) As much as we'd like to remove ordering by ascending... we don't think that's a practical reality. Too many people and products rely on the functionality, despite it potentially returning unbounded error (sigh). So I'm going to close this ticket since we don't see ourselves deprecating the functionality. I'm going to open a subsequent ticket to explore if we can possibly support sort ascending in the terms agg more accurately. RareTerms is still our recommended method if you want fast and reasonably accurate accounting of the long-tail... but we might be able to implement some new kind of execution mode or something for terms agg. It would undoubtedly be slower -- potentially a lot slower -- so it's unclear if we want to introduce such a feature. But that discussion should take place on a different issue. I'll cross-link the tickets when it is open. |
We try to be as flexible as possible when it comes to sorting terms aggregations. However, sorting by anything but by _term or descending _count makes it very hard to return the correct top buckets and counts, which is disappointing to users. For this reason we should remove the ability to sort the terms aggregation by ascending count (split out from #17588)
The text was updated successfully, but these errors were encountered: