-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track min/max on numerics in field data per segment #5829
Comments
If no one else is currently working on this, I'd like to attempt it as a first contribution. Is there a particular milestone this would be useful for? Thanks. |
Note that we added this to Lucene, in https://issues.apache.org/jira/browse/LUCENE-5610 which will be available when ES upgrades to Lucene 4.9. So in ES we just need to call the methods in NumericUtils and then act accordingly... |
Thanks. I'll keep an eye on this for when the 4.9 upgrade is happening. |
There seems to be activity related to this issue at https://issues.apache.org/jira/browse/LUCENE-5860 |
Hi, I just wanted to ask, what was the fix for this? |
#10523 already exposed the min/max APIs added in LUCENE-5860, on an index level, but for this issue nothing has been done to e.g. optimize range filters based on the min/max of a segment, because it's currently too costly for Lucene's postings APIs to compute the max numeric value: it requires a binary search over the terms because of how the numeric prefix terms are encoded. Once we cutover to auto-prefix encoding for numeric terms, this becomes much cheaper and I think optimizations like this become more realistic. I think higher level optimizations could be very worthwhile, e.g. for time-based indices, knowing that a given index won't have any hits because there is a top-level range filter, should be a big speed up in many cases ... there is a separate issue to explore this but I can't find it right now. |
The discussed optimizations have been implemented in 5.0. |
If we track for numeric values the min/max values in field data, we can potentially use it in several places to optimize execution.
For example, in range filter, if the field data for a field is loaded, it can be used to check if the term / range filter needs to be executed at all, or it can work as a match all. Potentially, also adding improvements to boolean filter to have a special case for match all.
Another option to use this is in aggs, where this can be used to do bucket estimations.
The text was updated successfully, but these errors were encountered: