-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use the new points API to index numeric fields. #17746
Conversation
That's not really true, its that lucene doesn't need horrible range APIs with tons of booleans and nulls. if you really want to do this, you can adjust the endpoints with NumericUtils.add() and NumericUtils.subtract() yourself: these work on byte[]. But I would not do this without still deprecating this brokenness in the query DSL. Its just brain damage from the inverted index, this stuff does not make sense for network addresses! |
I do think there is a potential trap here wrt global ordinals. We've got potentially large data sizes using this encoding, and global ordinals are not needed for anything it does, right? Sorting doesn't need it, and e.g. range faceting wouldn't need it (if implemented): it could just lookup range->ordinal per segment up front. Can we avoid building global ordinals for IP fields? :) |
I took a pass through and this looks great. I had difficulty navigating these mapper apis the way they were structured before and I like how clean the integration now is. Its also good the legacy* stuff is isolated without being entangled everywhere. I took a look through each of the data types: dates, ip addresses, primitives, and didn't spot any issues. |
We don't have range faceting implemented for SORTED_SET doc values yet (I opened #17700 about whether we should), but if we do it, I agree this should not use global ordinals. Sorting would not use global ordinals, only terms aggregations would (eg. figuring out the top recurring ip addresses that hit a web page). We could add the ability for terms aggs to not build global ordinals, but terms aggs already have lots of specializations so I am reluctant ot add one more. Additionally the cost of building global ords is the same as merging counts from segments in the end if the query matches most values, so I don't think it's too bad? |
Yeah, if we want to do a terms-agg by frequency, we need them. It is true there will be a merging cost: and this part will be done in uncompressed space which we can expect to be a turtle if cardinality is high (with java 9 apis we can make it better). I'd hate for global ordinals to become common though for cases where its unnecessary: e.g. complicated aggregations/processing built off terms agg (demanding global ordinals) all trying to work around the lack of an efficient range faceting if that is really what is wanted. |
If users do not have range faceting, I think the fallback will rather be on the |
OK. I don't mean for it to hold up the issue, it is just something to mention since we are using Sorted type for a "numeric-like" type here which is different than anything before. But the problem is probably also not unique: if we want to add a BigInteger type, i think we will need the same thing (and we can make even less assumptions about it)? |
Can you explain this statement more? I find it a little odd: Mike has operations like KNN going at ~ 1500QPS against large datasets (LUCENE-7069), points has an efficient newSetQuery to replace "termsquery" type cases, etc. So i'm wondering exactly what operations we expect to be slower? |
Admittedly I haven't run any benchmark. This assertion was based on the fact that if some values have high document frequencies, then exact queries will have to visit all matching docs with points, while with the inverted index you can leverage the skip lists to skip over documents that do not match other required clauses. I can remove it if you think this is confusing. |
I definitely think we should remove it. I think its better not to make assumptions here: its a different datastructure and the rules are different. So far we have only been able to make everything faster. |
Also, it is true for some cases extreme cases (massive boolean AND of high-cardinality features) things could conceptually be slower than skiplist intersection. But in many such cases (e.g. AND of latitude and longitude), multiple dimensions may be the better/faster solution anyway. I definitely agree we should not go there here, but it is possible with these data types in lucene, we should keep it in our minds. For most common uses like date ranges, I think people will only see points as faster, so we shouldn't set ourselves up for failure. |
Wow, this change looks wonderful, thank you @jpountz! You didn't hit any new bugs in points? :) We can also now support larger integers (equivalent of "long long", 128 bits, is available in Lucene's sandbox I've also wondered about exact/set points performance vs TermQuery/TermsQuery but haven't run any tests yet... |
LGTM |
Fortunately no, which is good news since the release is out already. :) Thanks all for having a look. I will merge shortly. |
This makes all numeric fields including `date`, `ip` and `token_count` use points instead of the inverted index as a lookup structure. This is expected to perform worse for exact queries, but faster for range queries. It also requires less storage. Notes about how the change works: - Numeric mappers have been split into a legacy version that is essentially the current mapper, and a new version that uses points, eg. LegacyDateFieldMapper and DateFieldMapper. - Since new and old fields have the same names, the decision about which one to use is made based on the index creation version. - If you try to force using a legacy field on a new index or a field that uses points on an old index, you will get an exception. - IP addresses now support IPv6 via Lucene's InetAddressPoint and store them in SORTED_SET doc values using the same encoding (fixed length of 16 bytes and sortable). - The internal MappedFieldType that is stored by the new mappers does not have any of the points-related properties set. Instead, it keeps setting the index options when parsing the `index` property of mappings and does `if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }` when parsing documents. Known issues that won't fix: - You can't use numeric fields in significant terms aggregations anymore since this requires document frequencies, which points do not record. - Term queries on numeric fields will now return constant scores instead of giving better scores to the rare values. Known issues that we could work around (in follow-up PRs, this one is too large already): - Range queries on `ip` addresses only work if both the lower and upper bounds are inclusive (exclusive bounds are not exposed in Lucene). We could either decide to implement it, or drop range support entirely and tell users to query subnets using the CIDR notation instead. - Since IP addresses now use a different representation for doc values, aggregations will fail when running a terms aggregation on an ip field on a list of indices that contains both pre-5.0 and 5.0 indices. - The ip range aggregation does not work on the new ip field. We need to either implement range aggs for SORTED_SET doc values or drop support for ip ranges and tell users to use filters instead. elastic#17700 Closes elastic#16751 Closes elastic#17007 Closes elastic#11513
4dbb362
to
d84c643
Compare
This makes all numeric fields including
date
,ip
andtoken_count
usepoints instead of the inverted index as a lookup structure. This is expected
to perform worse for exact queries, but faster for range queries. It also
requires less storage.
Notes about how the change works:
the current mapper, and a new version that uses points, eg.
LegacyDateFieldMapper and DateFieldMapper.
to use is made based on the index creation version.
points on an old index, you will get an exception.
in SORTED_SET doc values using the same encoding (fixed length of 16 bytes
and sortable).
any of the points-related properties set. Instead, it keeps setting the index
options when parsing the
index
property of mappings and doesif (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }
when parsing documents.
Known issues that won't fix:
this requires document frequencies, which points do not record.
giving better scores to the rare values.
Known issues that we could work around (in follow-up PRs, this one is too large
already):
ip
addresses only work if both the lower and upper boundsare inclusive (exclusive bounds are not exposed in Lucene). We could either
decide to implement it, or drop range support entirely and tell users to
query subnets using the CIDR notation instead.
aggregations will fail when running a terms aggregation on an ip field on a
list of indices that contains both pre-5.0 and 5.0 indices.
implement range aggs for SORTED_SET doc values or drop support for ip ranges
and tell users to use filters instead. IP range aggregation will not work anymore once points are integrated #17700
Closes #16751
Closes #17007
Closes #11513