-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for "missing" to all bucket aggregations #5324
Comments
@roytmana we're considering adding it, though can't promise it for 1.2 (we're currently working on several other enhancements/features for aggs, so we'll see if it'll fit) |
+1 |
Any plans of picking this up? I am having the issue as well and I currently have a workaround using multi search requests. It is ok so far since I only have a couple of aggregation levels but I can see this getting messy If I go deeper. |
Any news on this issue? |
I just ran into this -- we're trying to migrate from the deprecated facets to aggregations and are finding this a real gap in the supposed new and better way of doing things. |
I'm trying to move to aggregations and this is really holding us back. Forced to use facets for now... |
Same here - lack of support for _missing as a bucket as well as lack of _other as a bucket (however inefficient it may be it still will be better than a convoluted multi-step process I would be forced to use to calculate it myself) prevents me from moving from facets to aggs |
Deprecating facets without a viable solution via aggregations is madness, if/when this happens, some of us won't be able update to the new ES version. |
Exactly my feeling. I commented on depreciation change request a while ago and it looks like ES team does not feel this way. They consider facet a legacy that makes it harder to move the product forward. It is understandably but very unplesant for people who do heavy duty (and generic) analytics with facets. I wish they looked at aggs from productivity standpoint and considered how well it suits for traditional data mart style applications which typically operate on a consistent dataset regardless of grouping, rollups etc. This is where MISSING and OTHER is really handy... I also wish theybwould look at metrics that are expressions against other metrics in query aggs... |
In facets, In the end, we can't return the document count for other terms. However, we could imagine returning the sum of the document counts for other terms, would that be enough? Note that in the single-valued case, it would be equal to the number of documents that have another term, the issue that I described above only occurs with multi-valued fields. Here is a suggested format for the response: {
"aggregations": {
"colors": {
"buckets": [
{
"key": "red",
"doc_count": 5
},
{
"key": "green",
"doc_count": 3
}
],
"sum_of_other_buckets": {
"doc_count": 3
}
}
}
}
Something else that might be possible would be to return sub aggregations for the other terms, but it would suffer from the same issues in the multi-valued case, and I would like to keep it for later as it would require significant work. |
+1 |
The ability to sub-aggregate is absolutely necessary. I think the proposal to have the "other" bucket mean something slightly different is ok as long as we document its issue with multi-valued fields. As Adrien mentioned, this is not an issue with single valued properties. However, I think its important to treat the "other" bucket as equivalent to the regular buckets. Aggregations have a defined nested structure and breaking that will harm the ability to process them in a cleanly recursive fashion. I'd rather see the other bucket be added to the bucket array, with perhaps a customizable key as suggested by the original proposal, something like: Request:
Response:
|
sum_of_other_buckets makes sense and its name clearly describes what it is. Maybe documents_in_other_buckets is clearer? |
Like @rashidkpc I would love if _other was a bucket aggregation not just a metric. My primary interest is to be able to calculate metrics for _other bucket not just counts. Say, my analytics shows $ sales breakdown by store. I would like to be able to show sales for to 20 stores and then lump all other sales into _other so the total roll up for entire company does not depend on number of "visible" buckets. But if we could also support bucket sub aggs within _other bucket it would be fantastic. For single valued field logic of _missing and _other is rather clear and it is the most common case. For multivalued fields it is not so clear as @jpountz noted. Maybe if ES provide _other bucket and let me pick metrics within and how to interpret it it would be a more generic use case? There could be a bucket aggregation called Distinct which takes parent document set and distinct it and any metrics within such bucket will not doublecount My example above would not work very well with multivalued field as it will be double-counting $ but so it will be double-counting if I tried to roll up visible buckets But I would like to say that single value use case is arguable more important to have complete and very productive implementation (in my mind it would be support for _missing and _other as buckets) and if I need to do analytics like in my example on a multivalued data element, I should probably structure my data so that each value of the "multi" is a document carrying its fraction of $ or accept doublecounting in some shape or form My current solution for _other (even with facets since facets do not support _other on anything by count) is to calculate the same metric for entire dataset and then substract sum of the metric for "visible" facets. and that of course is not working for multi-valued fields. but I can |
@rashidkpc I'm concerned that it requires to know a term that doesn't exist among your documents, otherwise there could be a collision. This might not always be easy? We might be able to do something about sub aggregations for the other bucket but this requires much more work so I would like to do it in several steps and start with just the count (however the format should allow for adding data for sub-buckets in the future). |
+1 |
+1. If my ELK stack can't help me to solidly point the finger at something --- because signals get weakened if I can't tell how significant the top-N things are --- then it becomes much harder to use Kibana (4) effectively. It effectively makes me want to keep Kibana 3 around. |
+1 |
+1 for the "other_bucket" |
looks like a lot of people have the same need as I do (bucketing on _missing and _other) |
The reason why this feature has not been implemented entirely is that it is very challenging. Your complaint feels like we are leaving users in a dead end but I don't think it's true. As a follow-up of this issue we added the ability to get the document count for other buckets (#8213) and for documents that miss a value for the field, it is still possible to use the In addition, we are currently exploring new ways to work on top of aggregations in #9876. It is not clear yet whether it will help on this issue but it will at least open new doors. |
@jpountz I appreciate the update and I realize that it is a complex matter but it is a very much needed for this rather common use case (whether for developer using ES or ELK users) I am not really complaining I am just making an observation that it has been a year since I submitted the issue and it seen fairly high interest from the user community and yet has not received a target version number thus it is quite possible that it will not get implemented (for what could be very valid technical or strategic reasons). If that's so I personally would like to know it so I concentrate on finding a solution. As you said, missing + total aggregations together with some transformation logic would allow to calculate _other and _missing but for complex scenarios of UI driven dynamic multilevel aggregations, such queries quickly become really huge and complicated and lot more expensive than even an un-optimized built-in solution. So if the consensus of the ES Team that implementation of _other and _missing buckets is not feasible, I would like to know it and start working on a solution that expands a concise "logical" query defining which agg should take _missing and/or other into account into a large ES ("physical") one and then transform results to calculate and inject _other and _missing buckets on all levels of the result tree transparently. Thank you |
Really need this too and it's the biggest barrier in moving to kibana 4 IMO. +1 |
I think that without that functionality the "top N" feature is really not that useful. +1 |
+1 |
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now support a new `missing` option which defines the value to consider when a field does not have a value. This can be handy if you eg. want a terms aggregation to handle the same way documents that have "N/A" or no value for a `tag` field. This works in a very similar way to the `missing` option on the `sort` element. One known issue is that this option sometimes cannot make the right decision in the unmapped case: it needs to replace all values with the `missing` value but might not know what kind of values source should be produced (numerics, strings, geo points?). For this reason, we might want to add an `unmapped_type` option in the future like we did for sorting. Related to elastic#5324
+1 |
Most aggregations (terms, histogram, stats, percentiles, geohash-grid) now support a new `missing` option which defines the value to consider when a field does not have a value. This can be handy if you eg. want a terms aggregation to handle the same way documents that have "N/A" or no value for a `tag` field. This works in a very similar way to the `missing` option on the `sort` element. One known issue is that this option sometimes cannot make the right decision in the unmapped case: it needs to replace all values with the `missing` value but might not know what kind of values source should be produced (numerics, strings, geo points?). For this reason, we might want to add an `unmapped_type` option in the future like we did for sorting. Related to elastic#5324
Update: Now that most aggregations support a missing option (#11042) and that the terms aggregation returns counts for the While I'm pretty happy with the way we deal with missing, I'm concerned about adding an |
My main reason for +1 this issue was that I wanted an other bucket for filter aggs. I found it inconvenient to have to duplicate the entire filter agg (which can be very complex) within a parallel filter agg that has an enclosing 'not' filter to get the remainder bucket. Afaik that's still not taken care of, so no simple way to get the remainder/other bucket for a filter agg. |
Indeed |
I would say "other" is very useful as well particularly for in case people implement data mart like systems with elasticsearch. They tend to present complete dataset rollup to the user where "other" is needed to have the complete dataset representation. Also in traditional datamarts with start like data models they would avoid any multivalued relations instead modeling it as multiple fact "tables" so the issue with counting "other" multiple times would not be especially relevant to these use cases. And when people facet on multivalue field they would need to understand multiplicity issue just like they need to understand that sum of counts of its buckets is not the number of docs but values. I would like "other" to be a fully supported bucket enabled only if user explicetly specified it to be calculated by providing a key value for it. Ithink the case of single value aggs where it works as expected intuitively is so useful in itself as to justify this feature and the multivalued case would need to be understood and dealt with by each elastic user Just my 2c Jist |
+1 |
2 similar comments
+1 |
+1 |
+1 |
This is actually now implemented via #11042 and will be available in elasticsearch 2.0. |
Hmm, just remembered that in spite of the title, this issue was not only about missing but also about an other bucket. I consider the |
hi there let me ask one thing. my scenario is that let's say i have 3 requests to A system and i expected 3 responses from A system. eg raw msg:
Above A system logs stream to logstash eg fields:
In Request and Response logs, there is no response log found in second request? is there anybody to help me out? |
NEED: In many (if not majority cases) when present users with business analytics, the user would want to see numbers for complete data set. No matter how you aggregate it should present the same data with the same number of documents. Inability to handle "missing" values exclude those from analysis making analyzed data set incomplete and grand totals dependent on which field(s) the aggregation is done. It is impossible to explain to the users why the lower level totals do not add up to the upper level ones!
WORKAROUND: Currently field based bucket aggregations (term, range etc) have no way to aggregate missing values. The only way is to use missing aggregation on the same level and the same field as the term aggregation itself. It is easy enough when dealing with one level aggregations but if you have 2-3 level aggregation number of "missing" aggregations (and complete lower level aggregation to be repeated in them) mushrooms very quickly to the point that the query is huge, convoluted and not debuggable. It may affect performance as well. Also fetched date needs to be heavily post-processed to extract multiple levels aggregation buckets from under various "missing" elements and put them inline with the regular aggregation values. Below please see a simple query to do 2 level aggregation with just one sum metrics
PROPOSAL: I would suggest that any aggregation operating on a field should have a missing option. If missing config is specified, aggregation should accumulate missing values under that value and honor any nested aggregations within. It should never assume any value like 0 or _missing since it may clash with actual keys. If it is not specified the aggregation should skip missing values as it does now.
This approach makes it entirely compatible with existing logic and give developers complete control over whether to aggregate missing and under what key. In cases when it is not needed (and not specified) there will be no performance overhead. But when it needed it will work faster as we would not need to do missing aggregation and aggregations under it separately (same goes for "other" aggregation)
To be honest, I would love to see the same handling for "other" - documents that have not been included in aggregation due to the aggregation size constraints. Again the same rationale - ability to slice complete data set regardless of aggregation structure. It is just as needed as "missing" and just as troublesome to calculate but
I could understand if you did not add it as it may be not compatible with your algorithms but PLEASE PLEASE add "missing" handling at least
cc @uboness, @jpountz
The text was updated successfully, but these errors were encountered: