Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for IPv6 #17007

Closed
clintongormley opened this issue Mar 8, 2016 · 13 comments · Fixed by #17746
Closed

Support for IPv6 #17007

clintongormley opened this issue Mar 8, 2016 · 13 comments · Fixed by #17746
Labels
discuss >feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@clintongormley
Copy link
Contributor

Lucene now has sandbox support for IPv6 (LUCENE-7043).

InetAddressPoint is 1-dimensional by nature: of course you can have multiple values per field, thats different. Since we prefix-compress values, we can just map IPv4 addresses to IPv6 space and it works for both types. This is consistent with what InetAddress does itself anyway.

We should look at what needs to be done to support IPv6 in Elasticsearch. We should add a new ipv6 field which handles both IPv4 and IPv6.

The only problem I can see is that users might expect to get an IPv4 address back (eg 192.168.1.5) and instead it'd be returned as 0:0:0:0:0:ffff:c0a8:105. Perhaps we should have an ipv4 field as well that handles only IPv4 addresses, or add some option to render as IPv4 when possible?

@clintongormley clintongormley added >feature discuss :Search Foundations/Mapping Index mappings, including merging and defining field types labels Mar 8, 2016
@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

The only problem I can see is that users might expect to get an IPv4 address back (eg 192.168.1.5) and instead it'd be returned as 0:0:0:0:0:ffff:c0a8:105. Perhaps we should have an ipv4 field as well that handles only IPv4 addresses, or add some option to render as IPv4 when possible?

That is not true. This is only something that happens internally in a BKD tree. Users do not see it. You do not "get things back" from that API.

Also note: if you follow the RFC correctly, then calling https://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#getByAddress%28byte[]%29 with a 128-bit mapped IPv4 address returns back a 32-bit Inet4Address: it does this round tripping for you. That is why all Query.toString()'s and so on work intuitively with the lucene support.

@jpountz
Copy link
Contributor

jpountz commented Mar 8, 2016

While we don't have this issue with the bkd tree, I think it's something we will have to think about for doc values, eg. for computing the top ip addresses that hit a web server.

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

its really not at all. just encode the same way, and use InetAddress to decode for any presentation form.

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

Its this part that is super-important: calling https://docs.oracle.com/javase/7/docs/api/java/net/InetAddress.html#getByAddress%28byte[]%29 with a 128-bit mapped IPv4 address returns back a 32-bit Inet4Address: it does this round tripping for you

so if you want to display to the user in any way, make an InetAddress from the bytes. it does the right thing.

Please don't make ipv4 vs ipv6 types or any other craziness when this all works exactly as the user expects using basic java apis!

@jpountz
Copy link
Contributor

jpountz commented Mar 8, 2016

I agree there should not be two types and I'm not concerned about the Java API. The question was more about what we do on the rest layer eg. is it ok for aggs to render ip addresses differently from the way they were formatted at index time. I think it is, but it did not happen with the v4-only field so I think it's also important that we raise the question and document the behaviour.

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

I don't think you guys get how the transition mechanism works inside the java InetAddress class. I encourage you to play with it.

@clintongormley
Copy link
Contributor Author

Please don't make ipv4 vs ipv6 types or any other craziness when this all works exactly as the user expects using basic java apis!

Relax, Rob. The only reason I suggested it was because I was unaware of the information that you have provided. Thanks for the link.

Isn't it true though that if a user indexes an IPv4 address in IPv6 format, doc values will return the address in IPv4 format? (ie not what they indexed?).

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

If someone supplies an ipv4 address (mapped or not), its an ipv4 address. thats what that reserved space in the ipv4 region is for. If someone supplies an ipv6 address, its an ipv6 address.

note that ipv6 addresses can be represented in many ways. i recommend emitting canonical representations. So if someone supplies 0:0:0:0:0:0:0:1 they can expect to get it back as ::1.

Its always "what they indexed". its the same address. Its like returning 16.0 vs returning 16.

Please, lets not try to overcomplicate this. Its not complicated.

@clintongormley
Copy link
Contributor Author

Please, lets not try to overcomplicate this. Its not complicated.

I wasn't trying to overcomplicate it, I was trying to understand it. Thank you for explaining.

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

as far as docvalues, i would just use SortedSetDocValues class for internet addresses. This is critical for ipv6 data because (like the BKD tree) it will provide prefix compression.

For network addresses, "range" operations are a common operation (e.g. cidr/prefix operations), so having ordinal indirection is not useless, and can speed things up.

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

And for that to work, docvalues needs a little tweak before the compression will happen: we currently always 'bail' on prefix-compression for fixed-length data: https://github.com/apache/lucene-solr/blob/master/lucene/core/src/java/org/apache/lucene/codecs/lucene54/Lucene54DocValuesConsumer.java#L423-L425

that is not always the best tradeoff...

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

We may even want to turn that logic off completely. basically that logic was originally added to "save space" (= save on addresses) before we had prefix compression at all...

Unfortunately, we really need to solve this to avoid "my ip address field quadrupled in size!"

@rmuir
Copy link
Contributor

rmuir commented Mar 8, 2016

jpountz added a commit to jpountz/elasticsearch that referenced this issue Apr 14, 2016
This makes all numeric fields including `date`, `ip` and `token_count` use
points instead of the inverted index as a lookup structure. This is expected
to perform worse for exact queries, but faster for range queries. It also
requires less storage.

Notes about how the change works:
 - Numeric mappers have been split into a legacy version that is essentially
   the current mapper, and a new version that uses points, eg.
   LegacyDateFieldMapper and DateFieldMapper.
 - Since new and old fields have the same names, the decision about which one
   to use is made based on the index creation version.
 - If you try to force using a legacy field on a new index or a field that uses
   points on an old index, you will get an exception.
 - IP addresses now support IPv6 via Lucene's InetAddressPoint and store them
   in SORTED_SET doc values using the same encoding (fixed length of 16 bytes
   and sortable).
 - The internal MappedFieldType that is stored by the new mappers does not have
   any of the points-related properties set. Instead, it keeps setting the index
   options when parsing the `index` property of mappings and does
   `if (fieldType.indexOptions() != IndexOptions.NONE) { // add point field }`
   when parsing documents.

Known issues that won't fix:
 - You can't use numeric fields in significant terms aggregations anymore since
   this requires document frequencies, which points do not record.
 - Term queries on numeric fields will now return constant scores instead of
   giving better scores to the rare values.

Known issues that we could work around (in follow-up PRs, this one is too large
already):
 - Range queries on `ip` addresses only work if both the lower and upper bounds
   are inclusive (exclusive bounds are not exposed in Lucene). We could either
   decide to implement it, or drop range support entirely and tell users to
   query subnets using the CIDR notation instead.
 - Since IP addresses now use a different representation for doc values,
   aggregations will fail when running a terms aggregation on an ip field on a
   list of indices that contains both pre-5.0 and 5.0 indices.
 - The ip range aggregation does not work on the new ip field. We need to either
   implement range aggs for SORTED_SET doc values or drop support for ip ranges
   and tell users to use filters instead. elastic#17700

Closes elastic#16751
Closes elastic#17007
Closes elastic#11513
@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss >feature :Search Foundations/Mapping Index mappings, including merging and defining field types Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants