Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use the new points API to index numeric fields. #17746

Merged
merged 1 commit into from
Apr 14, 2016

Conversation

jpountz
Copy link
Contributor

@jpountz jpountz commented Apr 14, 2016

This makes all numeric fields including date, ip and token_count use
points instead of the inverted index as a lookup structure. This is expected
to perform worse for exact queries, but faster for range queries. It also
requires less storage.

Notes about how the change works:

  • Numeric mappers have been split into a legacy version that is essentially
    the current mapper, and a new version that uses points, eg.
    LegacyDateFieldMapper and DateFieldMapper.
  • Since new and old fields have the same names, the decision about which one
    to use is made based on the index creation version.
  • If you try to force using a legacy field on a new index or a field that uses
    points on an old index, you will get an exception.
  • IP addresses now support IPv6 via Lucene's InetAddressPoint and store them
    in SORTED_SET doc values using the same encoding (fixed length of 16 bytes
    and sortable).
  • The internal MappedFieldType that is stored by the new mappers does not have
    any of the points-related properties set. Instead, it keeps setting the index
    options when parsing the index property of mappings and does
    if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }
    when parsing documents.

Known issues that won't fix:

  • You can't use numeric fields in significant terms aggregations anymore since
    this requires document frequencies, which points do not record.
  • Term queries on numeric fields will now return constant scores instead of
    giving better scores to the rare values.

Known issues that we could work around (in follow-up PRs, this one is too large
already):

  • Range queries on ip addresses only work if both the lower and upper bounds
    are inclusive (exclusive bounds are not exposed in Lucene). We could either
    decide to implement it, or drop range support entirely and tell users to
    query subnets using the CIDR notation instead.
  • Since IP addresses now use a different representation for doc values,
    aggregations will fail when running a terms aggregation on an ip field on a
    list of indices that contains both pre-5.0 and 5.0 indices.
  • The ip range aggregation does not work on the new ip field. We need to either
    implement range aggs for SORTED_SET doc values or drop support for ip ranges
    and tell users to use filters instead. IP range aggregation will not work anymore once points are integrated #17700

Closes #16751
Closes #17007
Closes #11513

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

Range queries on ip addresses only work if both the lower and upper bounds are inclusive (exclusive bounds are not exposed in Lucene). We could either decide to implement it, or drop range support entirely and tell users to query subnets using the CIDR notation instead.

That's not really true, its that lucene doesn't need horrible range APIs with tons of booleans and nulls. if you really want to do this, you can adjust the endpoints with NumericUtils.add() and NumericUtils.subtract() yourself: these work on byte[].

But I would not do this without still deprecating this brokenness in the query DSL. Its just brain damage from the inverted index, this stuff does not make sense for network addresses!

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

IP addresses now support IPv6 via Lucene's InetAddressPoint and store them in SORTED_SET doc values using the same encoding (fixed length of 16 bytes and sortable).

I do think there is a potential trap here wrt global ordinals. We've got potentially large data sizes using this encoding, and global ordinals are not needed for anything it does, right? Sorting doesn't need it, and e.g. range faceting wouldn't need it (if implemented): it could just lookup range->ordinal per segment up front. Can we avoid building global ordinals for IP fields? :)

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

I took a pass through and this looks great. I had difficulty navigating these mapper apis the way they were structured before and I like how clean the integration now is. Its also good the legacy* stuff is isolated without being entangled everywhere. I took a look through each of the data types: dates, ip addresses, primitives, and didn't spot any issues.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 14, 2016

I do think there is a potential trap here wrt global ordinals. We've got potentially large data sizes using this encoding, and global ordinals are not needed for anything it does, right? Sorting doesn't need it, and e.g. range faceting wouldn't need it (if implemented): it could just lookup range->ordinal per segment up front. Can we avoid building global ordinals for IP fields?

We don't have range faceting implemented for SORTED_SET doc values yet (I opened #17700 about whether we should), but if we do it, I agree this should not use global ordinals. Sorting would not use global ordinals, only terms aggregations would (eg. figuring out the top recurring ip addresses that hit a web page). We could add the ability for terms aggs to not build global ordinals, but terms aggs already have lots of specializations so I am reluctant ot add one more. Additionally the cost of building global ords is the same as merging counts from segments in the end if the query matches most values, so I don't think it's too bad?

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

We don't have range faceting implemented for SORTED_SET doc values yet (I opened #17700 about whether we should), but if we do it, I agree this should not use global ordinals. Sorting would not use global ordinals, only terms aggregations would (eg. figuring out the top recurring ip addresses that hit a web page). We could add the ability for terms aggs to not build global ordinals, but terms aggs already have lots of specializations so I am reluctant ot add one more. Additionally the cost of building global ords is the same as merging counts from segments in the end if the query matches most values, so I don't think it's too bad?

Yeah, if we want to do a terms-agg by frequency, we need them. It is true there will be a merging cost: and this part will be done in uncompressed space which we can expect to be a turtle if cardinality is high (with java 9 apis we can make it better).

I'd hate for global ordinals to become common though for cases where its unnecessary: e.g. complicated aggregations/processing built off terms agg (demanding global ordinals) all trying to work around the lack of an efficient range faceting if that is really what is wanted.

@jpountz
Copy link
Contributor Author

jpountz commented Apr 14, 2016

e.g. complicated aggregations/processing built off terms agg (demanding global ordinals) all trying to work around the lack of an efficient range faceting if that is really what is wanted.

If users do not have range faceting, I think the fallback will rather be on the filters aggregation than the terms aggregation. Which should work pretty well thanks to filter caching if there are not too many ranges and if ranges are always the same (which I think would be common).

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

OK. I don't mean for it to hold up the issue, it is just something to mention since we are using Sorted type for a "numeric-like" type here which is different than anything before. But the problem is probably also not unique: if we want to add a BigInteger type, i think we will need the same thing (and we can make even less assumptions about it)?

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

This is expected
to perform worse for exact queries

Can you explain this statement more? I find it a little odd: Mike has operations like KNN going at ~ 1500QPS against large datasets (LUCENE-7069), points has an efficient newSetQuery to replace "termsquery" type cases, etc. So i'm wondering exactly what operations we expect to be slower?

@jpountz
Copy link
Contributor Author

jpountz commented Apr 14, 2016

Admittedly I haven't run any benchmark. This assertion was based on the fact that if some values have high document frequencies, then exact queries will have to visit all matching docs with points, while with the inverted index you can leverage the skip lists to skip over documents that do not match other required clauses. I can remove it if you think this is confusing.

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

I definitely think we should remove it. I think its better not to make assumptions here: its a different datastructure and the rules are different. So far we have only been able to make everything faster.

@rmuir
Copy link
Contributor

rmuir commented Apr 14, 2016

Also, it is true for some cases extreme cases (massive boolean AND of high-cardinality features) things could conceptually be slower than skiplist intersection. But in many such cases (e.g. AND of latitude and longitude), multiple dimensions may be the better/faster solution anyway. I definitely agree we should not go there here, but it is possible with these data types in lucene, we should keep it in our minds.

For most common uses like date ranges, I think people will only see points as faster, so we shouldn't set ourselves up for failure.

@mikemccand
Copy link
Contributor

Wow, this change looks wonderful, thank you @jpountz! You didn't hit any new bugs in points? :)

We can also now support larger integers (equivalent of "long long", 128 bits, is available in Lucene's sandbox BigItegerPoint), but in a separate issue: #17006.

I've also wondered about exact/set points performance vs TermQuery/TermsQuery but haven't run any tests yet...

@rjernst
Copy link
Member

rjernst commented Apr 14, 2016

LGTM

@jpountz
Copy link
Contributor Author

jpountz commented Apr 14, 2016

You didn't hit any new bugs in points?

Fortunately no, which is good news since the release is out already. :)

Thanks all for having a look. I will merge shortly.

This makes all numeric fields including `date`, `ip` and `token_count` use
points instead of the inverted index as a lookup structure. This is expected
to perform worse for exact queries, but faster for range queries. It also
requires less storage.

Notes about how the change works:
 - Numeric mappers have been split into a legacy version that is essentially
   the current mapper, and a new version that uses points, eg.
   LegacyDateFieldMapper and DateFieldMapper.
 - Since new and old fields have the same names, the decision about which one
   to use is made based on the index creation version.
 - If you try to force using a legacy field on a new index or a field that uses
   points on an old index, you will get an exception.
 - IP addresses now support IPv6 via Lucene's InetAddressPoint and store them
   in SORTED_SET doc values using the same encoding (fixed length of 16 bytes
   and sortable).
 - The internal MappedFieldType that is stored by the new mappers does not have
   any of the points-related properties set. Instead, it keeps setting the index
   options when parsing the `index` property of mappings and does
   `if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }`
   when parsing documents.

Known issues that won't fix:
 - You can't use numeric fields in significant terms aggregations anymore since
   this requires document frequencies, which points do not record.
 - Term queries on numeric fields will now return constant scores instead of
   giving better scores to the rare values.

Known issues that we could work around (in follow-up PRs, this one is too large
already):
 - Range queries on `ip` addresses only work if both the lower and upper bounds
   are inclusive (exclusive bounds are not exposed in Lucene). We could either
   decide to implement it, or drop range support entirely and tell users to
   query subnets using the CIDR notation instead.
 - Since IP addresses now use a different representation for doc values,
   aggregations will fail when running a terms aggregation on an ip field on a
   list of indices that contains both pre-5.0 and 5.0 indices.
 - The ip range aggregation does not work on the new ip field. We need to either
   implement range aggs for SORTED_SET doc values or drop support for ip ranges
   and tell users to use filters instead. elastic#17700

Closes elastic#16751
Closes elastic#17007
Closes elastic#11513
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
4 participants