Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for IPv6 mapping type #3714

Closed
bodgit opened this issue Sep 17, 2013 · 42 comments
Closed

Support for IPv6 mapping type #3714

bodgit opened this issue Sep 17, 2013 · 42 comments
Labels
>feature high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types stalled Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch

Comments

@bodgit
Copy link

bodgit commented Sep 17, 2013

Currently I can't use the ip mapping type as I have fields that can be either IPv4 or IPv6. However, being able to use range queries is really useful but I can't make use of them because I have to treat the field as a string to handle the case when the field contains an IPv6 value.

Obviously this causes extra hassle as storage would then require 128 bits and when searching, range queries using IPv6 addresses shouldn't match IPv4 addresses, unless you're using the ::ffff:d.d.d.d notation, and IPv6 addresses shouldn't match IPv4 range queries at all.

(I found this thread when this has been raised previously)

@abh
Copy link
Contributor

abh commented Nov 26, 2013

I'd like to convert a postgresql based application to use ES but got hung up on missing this feature, too. The queries are using netmasks/cidrs so just having the IPv6 address as a string won't be "good enough".

@dadoonet
Copy link
Member

For IP V6, just mark your field as not_analyzed in mapping.

@abh
Copy link
Contributor

abh commented Nov 26, 2013

@dadoonet That doesn't make any sense.

@dadoonet
Copy link
Member

Do you mean that you don't understand my answer or my answer does not answer to your question?

@abh
Copy link
Contributor

abh commented Nov 26, 2013

@dadoonet What's the point of the "ip type" if a reasonable answer to supporting IPv6 is "just make it a not analyzed string"? They're not the same thing, I'd hope.

@dadoonet
Copy link
Member

IP type is only for IP v4. Type name should be ipv4 instead of ip.
For ipv6 I don't think a special type is needed. Keeping ipv6 as non tokenized string should do the job.

Hhow do you expect ipv6 content to be converted to?

@abh
Copy link
Contributor

abh commented Nov 26, 2013

It could be converted to a number, for instance, and then allow range searches etc similar to the "ipv4" type.

Better yet the "ip type" should "just work" for both (similar to what postgresql does, for example).

@bodgit
Copy link
Author

bodgit commented Nov 26, 2013

There are ways of expressing IPv6 addresses that would likely fail a simple string-based match, the whole '::' expansion for one.

@dadoonet
Copy link
Member

@bodgit Very good point! I'm going to think about it a bit more.

@ghost ghost assigned dadoonet Nov 27, 2013
@lifo101
Copy link

lifo101 commented Dec 17, 2013

I have an app that stores iPv4/6 addresses as DECIMAL(39,0) in a mysql database which allows for very easy range searching. I wait for the day when ES will support something similar for IPv6 so I can finally use ES for indexing my database.

@jvbrandis
Copy link

When storing IPv6 addresses, I store it as a "fully formatted" IPv6-string, i.e. XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX:XXXX, regardless of zeros (so, never shortening a segment to less than four digits, and never shortcutting segments with ::).
This way, all IPv6 addresses are fully sortable and searchable, so this should work with ES (using not-analyzed mapping). However, it is very space-consuming when comparing to what an IPv6 address really is, which is 16 bytes (while this becomes 40 bytes...)
Also, if putting IPv4-addresses into this mix, sorting/filtering on a range will lead to problems mixing IPv4 and IPv6.
(This could be solved by using the IPv4-mapping format of IPv6, that is all IPv4 addresses are stored as IPv6 as ::FFFF:XXXX:XXXX (last four bytes being the IPv4 address)

The other approach is to store as a binary field, using 16 bytes, (still storing IPv4-addresses in IPv4-mapped IPv6-format). My approach to this in mysql is actually a BINARY(16) column.
However, this is inconvenient as manually browsing/inspecting the data becomes cumbersome.

So; I am also eagerly awaiting ES support for IPv6, storing IPv6-addresses in numeric format, but with support for properly displaying them and accepting query parameters in IP-format.

@cpdean
Copy link

cpdean commented Mar 28, 2014

+1. would like this feature

@ioc32
Copy link

ioc32 commented Mar 29, 2014

As @abh pointed out the fact ES is not currently supporting both protocols equally is a show stopper for many applications - to be ported or to be implemented from scratch.

ES is pretty much becoming a de facto standard when it comes to scalable event storage, search and analysis. In my particular use case, and I do not think I am the only one here, I deal with IPv4 just as much as with IPv6. Having both address families under a single, coherent data type is to be desired. Mappings, queries, indexing... would become unified and consequently easier to use for everyone.

@dadoonet I wonder what's the reason for ES to support IPv4-only data types in first place. Was it a technical decision due to implementation difficulties, was it a matter of priorities? Or, on the other hand, was it a consequence of you guys perceiving ES users did not care about IPv6? Is it at least in your roadmap?

@dadoonet
Copy link
Member

@ioc32 It is on my TODO list for sure! I need to find some quiet time to work on it.

@ioc32
Copy link

ioc32 commented Mar 29, 2014

@dadoonet great! Thank you for updating us!

@kimchy
Copy link
Member

kimchy commented Mar 29, 2014

the reason is simple, ipv4 can easily be translated to 64bit long, which supports range constructs, ipv6 is more complex.

@cpdean
Copy link

cpdean commented Mar 30, 2014

definitely looking forward to this. It'll really round out the ELK stack for feature complete network analysis. thanks!

@kimchy
Copy link
Member

kimchy commented Mar 31, 2014

understood, though for now, if you can get around with prefix checks, you can map the IP as string.

@zachfi
Copy link

zachfi commented Apr 1, 2014

+1 to defending @dadoonet's quiet time. I'd love to see this happen.

@Dunaeth
Copy link

Dunaeth commented Apr 3, 2014

Wouldn't it be possible to use fixed length lucene binary field types for ips and use binary sorting (I read about binary utf8 sorting in lucene, but I lack somme skills on the subject) ?

@jpountz
Copy link
Contributor

jpountz commented Apr 3, 2014

It is indeed possible to encode ipv6 ips as binary fields, Lucene doesn't require index terms to be UTF-8 sequences, it can be anything. The challenge here is more that for IPs, we need to support efficient ranges because that's typically how these fields are filtered. Lucene provides support for efficient ranges with numeric fields (see NumericRangeQuery): basically every field gets indexed with different precision levels, and this allows range queries to visit few terms no matter how large the range is (the fewer terms are visited the more efficient queries are). So we would need a similar mechanism for storing ipv6 addresses.

@seti123
Copy link

seti123 commented Apr 3, 2014

+1

@clintongormley
Copy link
Contributor

@avleen
Copy link

avleen commented Jul 14, 2015

I see we're still blocked on Lucene's support for BigInt for this. But that ticket hasn't seen any action in a while either.
Any updates for this @clintongormley? IPv6 is becoming a real thing, so this would be really handy :-)

@zachfi
Copy link

zachfi commented Jul 15, 2015

Its a 14 year old protocol. We're well beyond 'real thing' :)

@jpountz
Copy link
Contributor

jpountz commented Jul 15, 2015

The Lucene issue is stalled indeed, as it proved very hard to integrate... The feature is currently exposed as an experimental postings format which is not supported in terms of backward compatibility.

With small numbers (up to 64 bits) today we have static pre-computed ranges, which is probably fine. For instance for ints (32 bits) we have a default precision step of 8 bits which means that we pre-compute ranges for all numbers that have the same 24, 16 or 8 upper bits (0-256, 256-512, 512-768, ..., 0-65536, 65536-131072, 131072-196608, ..., 0-16777216, 16777216-33554432, 33554432-50331648, ...). Any arbitrary range can be translated to a union of these pre-computed ranges, and this is the way we manage to have fast ranges on numerics.

With high numbers of bits, like 128 here, the space-time trade-off becomes tricky I think. For instance with a precision step of 16, we would have to index 8 tokens per value while range queries would still visit hundreds of thousands of terms in the worst-case.

Given that ipv6 addresses tend to use the lower bytes less, maybe that would be fine, but I'm a bit reluctant to expose a new field type for ipv6 addresses that would not perform well for range queries. An option could be to have a new type for ipv6 addresses that would only support sorting and aggs but not queries, however I'm not sure how useful it would be?

@avleen
Copy link

avleen commented Jul 16, 2015

Agreed.
/64's are the smallest allocations that are generally given out, so
searching for a range may not (initially) need more precision than that.
If we see an IPv6 address, we can store the range of the /64 it is in, and
then work up from there?
/64, /32, /16, /8, /4, /2, /1, /0
That's 8 bits there, and from a practical perspective it might be
sufficient. Most end users get a /64, which makes searching in that easy.
ISPs get at least /32 sized blocks.

Most IPv6 address allocated today, when converted to decimal are about 38
bytes. That means 76 bytes (upper and lower bounds) to store each range. So
about 600 bytes of storage required for the precision, per address, in
addition to the ~38 bytes for the address itself.. That's quite a lot, but
that's really just the way it is - we can't make these numbers smaller ;-)

If we restrict range searches to at least /64, could this then work out?

On Wed, Jul 15, 2015 at 5:57 PM Adrien Grand [email protected]
wrote:

The Lucene issue is stalled indeed, as it proved very hard to integrate...
The feature is currently exposed as an experimental postings format which
is not supported in terms of backward compatibility.

With small numbers (up to 64 bits) today we have static pre-computed
ranges, which is probably fine. For instance for ints (32 bits) we have a
default precision step of 8 bits which means that we pre-compute ranges for
all numbers that have the same 24, 16 or 8 upper bits (0-256, 256-512,
512-768, ..., 0-65536, 65536-131072, 131072-196608, ..., 0-16777216,
16777216-33554432, 33554432-50331648, ...). Any arbitrary range can be
translated to a union of these pre-computed ranges, and this is the way we
manage to have fast ranges on numerics.

With high numbers of bits, like 128 here, the space-time trade-off becomes
tricky I think. For instance with a precision step of 16, we would have to
index 8 tokens per value while range queries would still visit hundreds of
thousands of terms in the worst-case.

Given that ipv6 addresses tend to use the lower bytes less, maybe that
would be fine, but I'm a bit reluctant to expose a new field type for ipv6
addresses that would not perform well for range queries. An option could be
to have a new type for ipv6 addresses that would only support sorting and
aggs but not queries, however I'm not sure how useful it would be?


Reply to this email directly or view it on GitHub
#3714 (comment)
.

@bodgit
Copy link
Author

bodgit commented Jul 16, 2015

How would this work with a type that handles both IPv4 and IPv6? As I originally stated in my use case I don't know the address family ahead of time, only that it is "an IP address" so I would prefer a type that can handle both. If that meant storing IPv4 addresses as IPv6-mapped it means that for such addresses, you do care about the lesser significant bits more as the address is ::ffff:d.d.d.d and so the first 96 bits are always going to be the same.

@clintongormley clintongormley added >feature high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types labels Sep 21, 2015
@avleen
Copy link

avleen commented Sep 25, 2015

FWIW, ARIN announced depletion of their free IP pool today:
http://teamarin.net/category/ipv4-depletion/

@hanej
Copy link

hanej commented Sep 28, 2015

Our access logs use a combination of IPv6 and IPv4 in the same field so we're in the same situation as @bodgit

@avleen
Copy link

avleen commented Sep 30, 2015

The Lucene ticket mentioned above isn't being worked on.
Instead they implemented a different way of doing things, which could enable an ipv6 type:
https://issues.apache.org/jira/browse/LUCENE-5879
But I think it might be up to Elasticsearch to implement that on top of the work they did on the auto-prefix terms?

@rmuir
Copy link
Contributor

rmuir commented Sep 30, 2015

Thats not really true. @mikemccand and @nknize are hard at work, and have been for a long time, adding all kinds of experimental data structures to lucene: to better solve the issues of numeric-like fields, spatial data structures, etc.

Another one that is promising for cases like this is https://issues.apache.org/jira/browse/LUCENE-6697

But there is still work to do, to graduate them from the sandbox: for example (this is not criticism, these guys are iterating and that is how it goes), some of these formats create large files in /tmp during merge. This kind of "sandy" stuff has to be cleaned up before they are production-strength.

Furthermore integrating them is a little tricky, in the past everyone has jumped to build numerics/spatial on top of what lucene already had (things like inverted index structures), and currently I see them still "wedging" the new stuff behind those apis.

I think in order to fix it properly, we have to expand the index format (Codec apis) with abstractions for these kinds of data structures, simple ones we can live with, improve for users over minor releases, and support backwards compatibility for. We can't just shove this stuff out there quickly: exposing these kinds of features means we are committing ourselves to long-term backwards compatibility of the format, that is one reason it takes longer.

I am not really following all that closely, nobody can keep up with those guys, so I might be wrong, but this is just my high level view on the thing. Its not that we are lazy and don't care about IPv6 or anything like that.

@avleen
Copy link

avleen commented Sep 30, 2015

Robert, I don't for a moment think you, or anyone working on ES or Lucene
is lazy.
You folks all do incredible work and give it to us for free. We're very
grateful for you efforts.

I think ipv6 is just a big deal to a lot of people, which is why we see so
much interest in this issue, and we're just waiting for the technology to
catch up to our needs :)

On Tue, Sep 29, 2015, 21:56 Robert Muir [email protected] wrote:

Thats not really true. @mikemccand https://github.com/mikemccand and
@nknize https://github.com/nknize are hard at work, and have been for a
long time, adding all kinds of experimental data structures to lucene: to
better solve the issues of numeric-like fields, spatial data structures,
etc.

Another one that is promising for cases like this is
https://issues.apache.org/jira/browse/LUCENE-6697

But there is still work to do, to graduate them from the sandbox: for
example (this is not criticism, these guys are iterating and that is how it
goes), some of these formats create large files in /tmp during merge. This
kind of "sandy" stuff has to be cleaned up before they are
production-strength.

Furthermore integrating them is a little tricky, in the past everyone has
jumped to build numerics/spatial on top of what lucene already had (things
like inverted index structures), and currently I see them still "wedging"
the new stuff behind those apis.

I think in order to fix it properly, we have to expand the index format
(Codec apis) with abstractions for these kinds of data structures, simple
ones we can live with, improve for users over minor releases, and support
backwards compatibility for. We can't just shove this stuff out there
quickly: exposing these kinds of features means we are committing ourselves
to long-term backwards compatibility of the format, that is one reason it
takes longer.

I am not really following all that closely, nobody can keep up with those
guys, so I might be wrong, but this is just my high level view on the
thing. Its not that we are lazy and don't care about IPv6 or anything like
that.


Reply to this email directly or view it on GitHub
#3714 (comment)
.

@kjelle
Copy link

kjelle commented Feb 4, 2016

+1

@kkirsche
Copy link
Contributor

+1 This would be extremely helpful

@nknize
Copy link
Contributor

nknize commented Feb 24, 2016

You're in luck. Thx to @rmuir this is getting closer. https://issues.apache.org/jira/browse/LUCENE-7043

@damm
Copy link

damm commented Mar 28, 2016

Yes it should be closer; I hope ES 5? :)

@jpountz
Copy link
Contributor

jpountz commented Apr 17, 2016

Fixed via #17746

@jpountz jpountz closed this as completed Apr 17, 2016
@zachfi
Copy link

zachfi commented Apr 17, 2016

Thank you for the effort(s).

@kkirsche
Copy link
Contributor

Thank you for the work on this!

@bodgit
Copy link
Author

bodgit commented Apr 18, 2016

🍰 🎉 👍

@bananabr
Copy link

+1

@javanna javanna added the Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch label Jul 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>feature high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types stalled Team:Search Foundations Meta label for the Search Foundations team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.