Use the new points API to index numeric fields. #17746

jpountz · 2016-04-14T08:12:55Z

This makes all numeric fields including date, ip and token_count use
points instead of the inverted index as a lookup structure. This is expected
to perform worse for exact queries, but faster for range queries. It also
requires less storage.

Notes about how the change works:

Numeric mappers have been split into a legacy version that is essentially
the current mapper, and a new version that uses points, eg.
LegacyDateFieldMapper and DateFieldMapper.
Since new and old fields have the same names, the decision about which one
to use is made based on the index creation version.
If you try to force using a legacy field on a new index or a field that uses
points on an old index, you will get an exception.
IP addresses now support IPv6 via Lucene's InetAddressPoint and store them
in SORTED_SET doc values using the same encoding (fixed length of 16 bytes
and sortable).
The internal MappedFieldType that is stored by the new mappers does not have
any of the points-related properties set. Instead, it keeps setting the index
options when parsing the index property of mappings and does
if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }
when parsing documents.

Known issues that won't fix:

You can't use numeric fields in significant terms aggregations anymore since
this requires document frequencies, which points do not record.
Term queries on numeric fields will now return constant scores instead of
giving better scores to the rare values.

Known issues that we could work around (in follow-up PRs, this one is too large
already):

Range queries on ip addresses only work if both the lower and upper bounds
are inclusive (exclusive bounds are not exposed in Lucene). We could either
decide to implement it, or drop range support entirely and tell users to
query subnets using the CIDR notation instead.
Since IP addresses now use a different representation for doc values,
aggregations will fail when running a terms aggregation on an ip field on a
list of indices that contains both pre-5.0 and 5.0 indices.
The ip range aggregation does not work on the new ip field. We need to either
implement range aggs for SORTED_SET doc values or drop support for ip ranges
and tell users to use filters instead. IP range aggregation will not work anymore once points are integrated #17700

Closes #16751
Closes #17007
Closes #11513

rmuir · 2016-04-14T11:55:16Z

Range queries on ip addresses only work if both the lower and upper bounds are inclusive (exclusive bounds are not exposed in Lucene). We could either decide to implement it, or drop range support entirely and tell users to query subnets using the CIDR notation instead.

That's not really true, its that lucene doesn't need horrible range APIs with tons of booleans and nulls. if you really want to do this, you can adjust the endpoints with NumericUtils.add() and NumericUtils.subtract() yourself: these work on byte[].

But I would not do this without still deprecating this brokenness in the query DSL. Its just brain damage from the inverted index, this stuff does not make sense for network addresses!

rmuir · 2016-04-14T12:06:28Z

IP addresses now support IPv6 via Lucene's InetAddressPoint and store them in SORTED_SET doc values using the same encoding (fixed length of 16 bytes and sortable).

I do think there is a potential trap here wrt global ordinals. We've got potentially large data sizes using this encoding, and global ordinals are not needed for anything it does, right? Sorting doesn't need it, and e.g. range faceting wouldn't need it (if implemented): it could just lookup range->ordinal per segment up front. Can we avoid building global ordinals for IP fields? :)

rmuir · 2016-04-14T12:30:56Z

I took a pass through and this looks great. I had difficulty navigating these mapper apis the way they were structured before and I like how clean the integration now is. Its also good the legacy* stuff is isolated without being entangled everywhere. I took a look through each of the data types: dates, ip addresses, primitives, and didn't spot any issues.

jpountz · 2016-04-14T12:58:18Z

I do think there is a potential trap here wrt global ordinals. We've got potentially large data sizes using this encoding, and global ordinals are not needed for anything it does, right? Sorting doesn't need it, and e.g. range faceting wouldn't need it (if implemented): it could just lookup range->ordinal per segment up front. Can we avoid building global ordinals for IP fields?

We don't have range faceting implemented for SORTED_SET doc values yet (I opened #17700 about whether we should), but if we do it, I agree this should not use global ordinals. Sorting would not use global ordinals, only terms aggregations would (eg. figuring out the top recurring ip addresses that hit a web page). We could add the ability for terms aggs to not build global ordinals, but terms aggs already have lots of specializations so I am reluctant ot add one more. Additionally the cost of building global ords is the same as merging counts from segments in the end if the query matches most values, so I don't think it's too bad?

rmuir · 2016-04-14T13:13:51Z

We don't have range faceting implemented for SORTED_SET doc values yet (I opened #17700 about whether we should), but if we do it, I agree this should not use global ordinals. Sorting would not use global ordinals, only terms aggregations would (eg. figuring out the top recurring ip addresses that hit a web page). We could add the ability for terms aggs to not build global ordinals, but terms aggs already have lots of specializations so I am reluctant ot add one more. Additionally the cost of building global ords is the same as merging counts from segments in the end if the query matches most values, so I don't think it's too bad?

Yeah, if we want to do a terms-agg by frequency, we need them. It is true there will be a merging cost: and this part will be done in uncompressed space which we can expect to be a turtle if cardinality is high (with java 9 apis we can make it better).

I'd hate for global ordinals to become common though for cases where its unnecessary: e.g. complicated aggregations/processing built off terms agg (demanding global ordinals) all trying to work around the lack of an efficient range faceting if that is really what is wanted.

jpountz · 2016-04-14T13:17:00Z

e.g. complicated aggregations/processing built off terms agg (demanding global ordinals) all trying to work around the lack of an efficient range faceting if that is really what is wanted.

If users do not have range faceting, I think the fallback will rather be on the filters aggregation than the terms aggregation. Which should work pretty well thanks to filter caching if there are not too many ranges and if ranges are always the same (which I think would be common).

rmuir · 2016-04-14T13:49:31Z

OK. I don't mean for it to hold up the issue, it is just something to mention since we are using Sorted type for a "numeric-like" type here which is different than anything before. But the problem is probably also not unique: if we want to add a BigInteger type, i think we will need the same thing (and we can make even less assumptions about it)?

rmuir · 2016-04-14T14:11:31Z

This is expected
to perform worse for exact queries

Can you explain this statement more? I find it a little odd: Mike has operations like KNN going at ~ 1500QPS against large datasets (LUCENE-7069), points has an efficient newSetQuery to replace "termsquery" type cases, etc. So i'm wondering exactly what operations we expect to be slower?

jpountz · 2016-04-14T14:26:16Z

Admittedly I haven't run any benchmark. This assertion was based on the fact that if some values have high document frequencies, then exact queries will have to visit all matching docs with points, while with the inverted index you can leverage the skip lists to skip over documents that do not match other required clauses. I can remove it if you think this is confusing.

rmuir · 2016-04-14T14:27:53Z

I definitely think we should remove it. I think its better not to make assumptions here: its a different datastructure and the rules are different. So far we have only been able to make everything faster.

rmuir · 2016-04-14T14:45:07Z

Also, it is true for some cases extreme cases (massive boolean AND of high-cardinality features) things could conceptually be slower than skiplist intersection. But in many such cases (e.g. AND of latitude and longitude), multiple dimensions may be the better/faster solution anyway. I definitely agree we should not go there here, but it is possible with these data types in lucene, we should keep it in our minds.

For most common uses like date ranges, I think people will only see points as faster, so we shouldn't set ourselves up for failure.

mikemccand · 2016-04-14T14:48:35Z

Wow, this change looks wonderful, thank you @jpountz! You didn't hit any new bugs in points? :)

We can also now support larger integers (equivalent of "long long", 128 bits, is available in Lucene's sandbox BigItegerPoint), but in a separate issue: #17006.

I've also wondered about exact/set points performance vs TermQuery/TermsQuery but haven't run any tests yet...

rjernst · 2016-04-14T15:01:06Z

LGTM

jpountz · 2016-04-14T15:16:53Z

You didn't hit any new bugs in points?

Fortunately no, which is good news since the release is out already. :)

Thanks all for having a look. I will merge shortly.

This makes all numeric fields including `date`, `ip` and `token_count` use points instead of the inverted index as a lookup structure. This is expected to perform worse for exact queries, but faster for range queries. It also requires less storage. Notes about how the change works: - Numeric mappers have been split into a legacy version that is essentially the current mapper, and a new version that uses points, eg. LegacyDateFieldMapper and DateFieldMapper. - Since new and old fields have the same names, the decision about which one to use is made based on the index creation version. - If you try to force using a legacy field on a new index or a field that uses points on an old index, you will get an exception. - IP addresses now support IPv6 via Lucene's InetAddressPoint and store them in SORTED_SET doc values using the same encoding (fixed length of 16 bytes and sortable). - The internal MappedFieldType that is stored by the new mappers does not have any of the points-related properties set. Instead, it keeps setting the index options when parsing the `index` property of mappings and does `if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }` when parsing documents. Known issues that won't fix: - You can't use numeric fields in significant terms aggregations anymore since this requires document frequencies, which points do not record. - Term queries on numeric fields will now return constant scores instead of giving better scores to the rare values. Known issues that we could work around (in follow-up PRs, this one is too large already): - Range queries on `ip` addresses only work if both the lower and upper bounds are inclusive (exclusive bounds are not exposed in Lucene). We could either decide to implement it, or drop range support entirely and tell users to query subnets using the CIDR notation instead. - Since IP addresses now use a different representation for doc values, aggregations will fail when running a terms aggregation on an ip field on a list of indices that contains both pre-5.0 and 5.0 indices. - The ip range aggregation does not work on the new ip field. We need to either implement range aggs for SORTED_SET doc values or drop support for ip ranges and tell users to use filters instead. elastic#17700 Closes elastic#16751 Closes elastic#17007 Closes elastic#11513

jpountz added >enhancement review release highlight :Search Foundations/Mapping Index mappings, including merging and defining field types das awesome v5.0.0-alpha2 labels Apr 14, 2016

jpountz mentioned this pull request Apr 14, 2016

IP range aggregation will not work anymore once points are integrated #17700

Closed

jpountz force-pushed the upgrade/numerics2 branch from 4dbb362 to d84c643 Compare April 14, 2016 15:56

jpountz merged commit d84c643 into elastic:master Apr 14, 2016

jpountz deleted the upgrade/numerics2 branch April 14, 2016 16:13

jpountz mentioned this pull request Apr 17, 2016

Support for IPv6 mapping type #3714

Closed

markharwood mentioned this pull request Apr 22, 2016

Significant_terms aggregation Java API returns unformatted strings for Dates and IPs #17707

Closed

eskibars mentioned this pull request Apr 26, 2016

Add new IPv6 types to docs where it's supported #17994

Merged

chrismccracken mentioned this pull request May 9, 2016

Figure out what version of ES we can move to without needing to refactor stuff dotCMS/core#9065

Closed

jordansissel mentioned this pull request Jun 9, 2016

Natively support IPv6 address lookups logstash-plugins/logstash-filter-geoip#12

Closed

This was referenced Feb 23, 2023

Add back a terms index for numeric fields #94047

Open

Numeric terms support #94048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use the new points API to index numeric fields. #17746

Use the new points API to index numeric fields. #17746

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

mikemccand commented Apr 14, 2016

rjernst commented Apr 14, 2016

jpountz commented Apr 14, 2016

Use the new points API to index numeric fields. #17746

Use the new points API to index numeric fields. #17746

Conversation

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

jpountz commented Apr 14, 2016

rmuir commented Apr 14, 2016

rmuir commented Apr 14, 2016

mikemccand commented Apr 14, 2016

rjernst commented Apr 14, 2016

jpountz commented Apr 14, 2016