Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BigInteger/BigDecimal support #5683

Closed
wants to merge 6 commits into from
Closed

BigInteger/BigDecimal support #5683

wants to merge 6 commits into from

Conversation

jprante
Copy link
Contributor

@jprante jprante commented Apr 4, 2014

For XContentBuilder/XContentParser and document mapping, this will add support for "big" numeric types BigInteger/BigDecimal.

BigInteger/BigDecimal support for XContentBuilder/XContentParser is implemented by using the existing Jackson support for the "big" numeric types. A new method losslessDecimals() is used to switch the XContentParser into recognizing BigInteger/BigDecimal in precedence over primitive numeric types, for better convenience when using the Java API for parsing document sources with BigInteger/BigDecimal field values.

For the document mapping, new core types biginteger and bigdecimal are introduced. With a new flag lossless_numeric_detection, the precedence of BigInteger/BigDecimal over primitive numeric types can be controlled in the mapping. When set to true, new dynamic numeric fields are assigned to "big" numeric types first. Default is false, where primitive numeric types still take precedence.

Caveat: BigInteger/BigDecimal support is just meant for search and indexing/storing. The "big" numeric types are degraded to their .longValue() and .doubleValue() components when they are used in NumericRangeQuery and related contexts, so it is not recommended to use values larger than Long.MAX_VALUE or Double.MAX_VALUE in analytical queries like facets and aggregations, strange cut-offs or underflows/overflows should occur.

@jpountz
Copy link
Contributor

jpountz commented Apr 15, 2014

The "big" numeric types are degraded to their .longValue() and .doubleValue() components when they are used in NumericRangeQuery and related contexts

FYI there is some discussion on https://issues.apache.org/jira/browse/LUCENE-5596 in order to add this range support to types that are more than 64 bits.

@jpountz
Copy link
Contributor

jpountz commented Aug 22, 2014

Quick update: most of this change is good and we would be a good start to support big integers/decimals in the future. I added the stalled label, since I think it would be important to support efficient range queries on such types without information loss (either via https://issues.apache.org/jira/browse/LUCENE-5879 or https://issues.apache.org/jira/browse/LUCENE-5596). Some other thoughts/open questions:

  • these types should probably be forbidden in the numeric metrics aggregations, otherwise we would either need to use big decimals there which would kill performance, or the information loss would make results unusable
  • these types should probably be opt-ins only since they would have different capabilities than the other numeric fields,
  • for sorting, should we use SORTED or BINARY doc value types? (I would lend towards SORTED which would make sorting faster)
  • should they be specified as strings or numbers in the _source document? (would there be compatibility issues with some languages/json parsers/json generators with numbers?)

@jprante
Copy link
Contributor Author

jprante commented Aug 22, 2014

@jpountz

"these types should probably be forbidden in the numeric metrics aggregations, otherwise we would either need to use big decimals there which would kill performance, or the information loss would make results unusable"

I agree that numeric metrics aggregation must never use BigInteger/BigDecimal types. A thought is to add a special aggregation type, like "monetary/financial aggregation", where performance is less important with regard to exactness/correctness of numeric results, and BigDecimal is not converted to double/float.

"should they be specified as strings or numbers in the _source document? (would there be compatibility issues with some languages/json parsers/json generators with numbers?)"

The Jackson library maps it to "JSON Type number" http://wiki.fasterxml.com/JacksonDataBinding
there are some mechanisms to let the parser auto-detect BigInteger (no fraction), but BigDecimal must be configured to override double/float (with fraction).

@clintongormley
Copy link
Contributor

"should they be specified as strings or numbers in the _source document? (would there be compatibility issues with some languages/json parsers/json generators with numbers?)"

The Jackson library maps it to "JSON Type number" http://wiki.fasterxml.com/JacksonDataBinding
there are some mechanisms to let the parser auto-detect BigInteger (no fraction), but BigDecimal must be configured to override double/float (with fraction).

My concern here is more with other languages, eg Javascript can't support bigint/decimals, and we'll find lots of similar issues. It may be ok to accept them as numbers, as long as we also support coercing from strings. That way users of languages without support can still use them.

@jprante
Copy link
Contributor Author

jprante commented Aug 22, 2014

The problem of Javascript is, it has poor support of numbers, even 64bit ints fail (and I think ES/Lucene supports 64bit longs for a while now). BigInteger/BigDecimals can be added as an extension, at least to Node.js https://www.npmjs.org/package/json-bignum

@kul
Copy link
Contributor

kul commented Jan 14, 2015

👍 much awaited.

@mikemccand
Copy link
Contributor

I think https://issues.apache.org/jira/browse/LUCENE-6697 (just released in Lucene 5.3.0) is a compelling way to allow fast range filters on BigInteger/Decimal values.

Values for the field must be indexed as a SortedSetDocValuesField (with the BigInteger/Decimal value converted to a byte[]) and the field must use the RangeTreeDocValuesFormat. Then use the NumericRangeTreeQuery at search time.

Some care must be taken in the byte[] encoding, so that sort order is the same, e.g. I think this means the BigInteger field must have a max allowed value (set once up front in the mapping), and maybe the BigDecimal field must have the same up-front scale across all values (?), and the sign bit needs to be flipped like we do for NumericField.

But I think it should work well, and from my limited perf testing on the original issue, the resulting index is smaller and filters are faster than NumericField/RangeQuery.

One caveat is because this code is very new, it lives in sandbox now, and there's no guarantee of back-compat of the file-format it writes. But then, the file format is also ridiculously simple ...

@muelli
Copy link

muelli commented Sep 29, 2015

Big integers are also interesting for cryptographic applications.

@SKumarMN
Copy link

SKumarMN commented Oct 5, 2015

@jprante Does the above fix support range and filter queries too ?. Any Idea when Elastic Search is gonna add BigDecimal /BigInteger Support oficially

@jprante
Copy link
Contributor Author

jprante commented Oct 5, 2015

From what I can see BigDecimal/BigInteger is implemented in Lucene 5.3 which will appear in Elasticsearch 2.x (not 2.0)

@SKumarMN
Copy link

SKumarMN commented Oct 5, 2015

@jprante

Hey, I have applied this fix mentioned in this post however when I index data or fetch it data is getting rounded off. I am using the REST API calls. Am I doing anything wrong here.

Here is my mappings

{
"tweety": {
"properties": {
"message": {
"type": "string"
},
"post_date": {
"type": "date",
"format": "dateOptionalTime"
},
"newint": {
"type": "biginteger",
"lossless_numeric_detection": true
}
}
}
}

Data:

{
"newint": 19999999999999999999999999999999999,
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elastic-search"
}

Get Result:

{
"_index": "twitter",
"_type": "tweety",
"_id": "1",
"_version": 1,
"found": true,
"_source": {
"newint": 2e+34,
"post_date": "2009-11-15T14:12:12",
"message": "trying out Elastic-search"
}
}

@jprante
Copy link
Contributor Author

jprante commented Oct 5, 2015

@SKumarMN the patch is only 50% of the required work. It only means that BigInteger/BigDecimal is accepted as JSON input. The default is to downgrade the accepted values to double/float wherever possible, otherwise, the change would not be compatible to existing ES applications. REST actions would have to be changed to prefer BigInteger/BigDecimal.

@clintongormley
Copy link
Contributor

From what I can see BigDecimal/BigInteger is implemented in Lucene 5.3 which will appear in Elasticsearch 2.x (not 2.0)

This code is in the Lucene sandbox only. We need to wait until it graduates to core before we can start using it.

@mikemccand
Copy link
Contributor

We need to wait until it graduates to core before we can start using it.

I'm working on graduating this to Lucene's core ... here's the first step: https://issues.apache.org/jira/browse/LUCENE-6825

@clintongormley
Copy link
Contributor

w00t!

@SKumarMN
Copy link

@jpountz

Hi,

I have used the fixhttps://github.com//pull/5758 in my 1.4.4 code to support big integer by changing the IPV6 Mapper. Search and range queries works fine. Our application needs support for Bigdecimal too. Could you please provide me pointers about how can i implement big decimal support with range functionality as well..

@clintongormley
Copy link
Contributor

Closing in favour of #17006

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high hanging fruit :Search Foundations/Mapping Index mappings, including merging and defining field types
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants