Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

possible performance issue with very big doubles #32

Closed
pjfanning opened this issue Dec 18, 2022 · 25 comments
Closed

possible performance issue with very big doubles #32

pjfanning opened this issue Dec 18, 2022 · 25 comments

Comments

@pjfanning
Copy link
Contributor

JavaDoubleParser seems to be slower than Double.parseDouble for very large numbers (thousands of digits).
Malicious actors often create input files with large numbers to try to cause denial of service issues.

I have a jmh benchmark at https://github.com/pjfanning/jackson-number-parse-bench

./gradlew jmh

It's worth checking the build.gradle file as I have a param that controls which benchmark to run.

jmh {
    includes = ['org.example.jackson.bench.DoubleParserBench']
}

I'm wondering if it would be possible to disregard the least significant digits. If there are 1000 digits, only the first 30 or 40 digits should really impact the double value - even if you were conservative and limited it 100 or 200, this would limit the risk vector.

@wrandelshofer
Copy link
Owner

I get the following results on my Mac mini 2018:

# JMH version: 1.35
# VM version: JDK 19, OpenJDK 64-Bit Server VM, 19+36-2238

Benchmark                             Mode  Cnt         Score         Error  Units
DoubleParserBench.double10           thrpt    5  38852169.446 ± 2051582.451  ops/s
DoubleParserBench.double1000         thrpt    5   1615511.309 ±   52510.462  ops/s
DoubleParserBench.double1000000      thrpt    5      1591.276 ±      42.500  ops/s
DoubleParserBench.fastDouble10       thrpt    5  52012968.516 ± 4261938.881  ops/s
DoubleParserBench.fastDouble1000     thrpt    5    658186.310 ±   86431.775  ops/s
DoubleParserBench.fastDouble1000000  thrpt    5       669.971 ±      17.593  ops/s

The FastDoubleParser is slower for long digit sequences, because it falls back to Double.parseDouble(). Therefore it scans the digits twice. This is not a useful vector for a denial of service attack, because it is a linear overhead.

For double numbers, we indeed stop parsing the significand after 768 digits¹. This is not visible in the code, because of the fallback to Double.parseDouble(). What you can see in the code, is that we scan all digits and do some multiplications that we then throw away.

To demonstrate this, I have included the JavaBigDecimalParser in your benchmark. I also added the annotations @OutputTimeUnit(TimeUnit.NANOSECONDS) @BenchmarkMode(Mode.AverageTime). This yields the following results:

Benchmark                                Mode  Cnt          Score         Error  Units
DoubleParserBench.double10               avgt    5         25.480 ±       0.437  ns/op
DoubleParserBench.double1000             avgt    5        623.658 ±      22.041  ns/op
DoubleParserBench.double1000000          avgt    5     623618.515 ±   24134.389  ns/op
DoubleParserBench.fastDouble10           avgt    5         18.165 ±       0.181  ns/op
DoubleParserBench.fastDouble1000         avgt    5       1371.018 ±       9.989  ns/op
DoubleParserBench.fastDouble1000000      avgt    5    1408092.544 ±    8572.876  ns/op
DoubleParserBench.fastBigDecimal10       avgt    5         19.451 ±       0.703  ns/op
DoubleParserBench.fastBigDecimal1000     avgt    5       6786.623 ±      77.915  ns/op
DoubleParserBench.fastBigDecimal1000000  avgt    5  228543520.443 ± 4037012.871  ns/op

We can see that only the JavaBigDecimalParser shows the massive slow down for very long digit sequences. For BigDecimal numbers, we have to parse all those digits. If we did the same for Double numbers, we would see the same slow down.

I am aware that the JavaDoubleParser is slower than Double.parseDouble for long digit sequences. My intention is to provide a faster implementation once the pull request for JDK-8205592 is integrated into the JDK.

¹ See 'Daniel Lemire, Number Parsing at a Gigabyte per Second, Software: Practice and Experience 51 (8), 2021, Chapter 11 Processing Numbers Quickly': "it may be necessary to read tens or even hundreds of digits (up to 768 digits in the worst case)"

@wrandelshofer
Copy link
Owner

Shouldn't JSON number inputs with more than 24 characters be rejected anyway?

(I think Double.toString(double) never produces more than 24 characters. I haven't tested this though).

@pjfanning
Copy link
Contributor Author

pjfanning commented Dec 19, 2022

Thanks @wrandelshofer - with jackson-core, we are introducing checks on number sizes. From your detailed analysis, it looks like we should probably have a few different sizes that we allow. So far, the unreleased dev code has one limit that defaults to 1000 chars. This is fine for BigDecimal/BigInteger. For Floats and Doubles, we should have different and much smaller limits.

@cowtowncoder
Copy link

cowtowncoder commented Dec 20, 2022

@wrandelshofer You are considering that JSON is limited to what Javascript supports but I think this is not really the case -- format is specified in terms of textual representation and thereby languages/platforms can and do support higher precision too: for Java BigDecimal supports arbitrary lengths (or at least way past 24 characters).
So I don't think 24 would be a good limit in practice as it may impact some use cases.

But at the same time we do want to impose limits based on performance characteristics.

@cowtowncoder
Copy link

Ok, another probably naive question: would it make sense, from Jackson side, to truncate "too long" input Strings for double / float parsing: so we'd have higher max limit for allowed tokens (... because checks are done at the point where eventually desired type [ float, double, BigDecimal ] is not yet known), but only use "useful portion"?
Or the problem really that due to algorithm needed for base-2-fp-from-decimal-string may actually up to hundreds of digits in certain cases, meaning that for full accuracy this is not an option?

@wrandelshofer
Copy link
Owner

@wrandelshofer You are considering that JSON is limited to what Javascript supports but I think this is not really the case ...

No, no, I just conjectured this from the topic of this issue 😅: this issue is about parsing double values with many digits, right?

I believe it is sensible to assume that the producer and the conumser of data agree on the format of the data.
Producing JSON numbers with superfluous digits is most likely a bug if the consumer is known to interpret them as double values. So, it would make sense to accept no more than 24 characters by default. Being more lenient should need extra configuration.

Likewise, if someone needs to parse BigDecimal with hundreds of digits, then they have to provision their system for this workload. But they may not want to provision their system for BigDecimals with thousands or even millions of digits.

@wrandelshofer
Copy link
Owner

Or the problem really that due to algorithm needed for base-2-fp-from-decimal-string may actually up to hundreds of digits in certain cases, meaning that for full accuracy this is not an option?

These are crafted string representations. Libraries (like the JDK) do not produce strings with that many digits from a double. Therefore, In my opinion, this should not be accepted by default.

@wrandelshofer
Copy link
Owner

wrandelshofer commented Dec 20, 2022

Ok, another probably naive question: would it make sense, from Jackson side, to truncate "too long" input Strings for double / float parsing: so we'd have higher max limit for allowed tokens (... because checks are done at the point where eventually desired type [ float, double, BigDecimal ] is not yet known), but only use "useful portion"?

No, because this is what Double.parseDouble and JavaDoubleParser already do.

They have to scan through all digits though, because there can be an exponent after the digits.
Scanning is very fast in Double.parseDouble, and somewhat slower in JavaDoubleParser.

@cowtowncoder
Copy link

cowtowncoder commented Dec 21, 2022

Ok, thank you @wrandelshofer -- very good points.

Now, as to maximums: one reason why I definitely would not want to add failing-beyond-24-digits case is that we have no idea of how many users would now get failure on previously working use cases -- that is, getting an exception where there used to be none. Never mind that usage was sub-optimal and most of, say, 60-digit value was ignored.
My gut feeling is that there'd be enough to get lots of bug reports for the change, and for most users it comes down to "why did you break my system" from their POV. In a way it is also case of "if it ain't broke don't fix it", if we leave out legit DoS concerns; there's not much upside from maintainer perspective. Breaking perceived backwards-compatibility is generally a bad idae.

However. I'd be quite interested in knowing about quietly truncating unused digits if there is specific limit -- that sounds like something that would be backwards-compatible approach. If we can quietly truncate reminder of useless tail for double/float parsing before calling FDP (or JDK equivalent), sure: that would avoid even more of waste.

So.... what kind of limit would be safe? Since JSON does not allow leading zeroes in integral part, truncating beginning of the String would work. Although leading zeroes in fractional part would count so maybe it's not that trivial.

@cowtowncoder
Copy link

Or the problem really that due to algorithm needed for base-2-fp-from-decimal-string may actually up to hundreds of digits in certain cases, meaning that for full accuracy this is not an option?

These are crafted string representations. Libraries (like the JDK) do not produce strings with that many digits from a double. Therefore, In my opinion, this should not be accepted by default.

Yes, but they may produce them from BigDecimal, no? (or anything else on other platforms that produces/uses higher precision).

But what I was trying to get at was specifically performance differences between BigDecimal (which internally is 10-based, right?) and double (etc). I realize that as per its name FastDoubleParser focuses on its area, but Jackson supports both and as such there are places where stricter limits can only be imposed when we know actual usage via API calls -- but not yet at point of decoding where verification just checks that JSON number representation is followed (which does not, at spec level, impose any specific limits but just states implementations may impose their limits).

@wrandelshofer
Copy link
Owner

wrandelshofer commented Dec 21, 2022

However. I'd be quite interested in knowing about quietly truncating unused digits if there is specific limit -- that sounds like something that would be backwards-compatible approach. If we can quietly truncate reminder of useless tail for double/float parsing before calling FDP (or JDK equivalent), sure: that would avoid even more of waste.

So.... what kind of limit would be safe? Since JSON does not allow leading zeroes in integral part, truncating beginning of the String would work. Although leading zeroes in fractional part would count so maybe it's not that trivial.

Truncating involves parsing the number. We have to truncate from the first non-zero digit. That digit can be before or after the decimal point in the significand. So, it is best done inside the parser. Otherwise we would have to scan and parse the number twice.

Of course, we would have to be faster than Double.parseDouble(). And that's no easy feat. On my trusty Mac mini 2018, I get the following performance with Double.parseDouble():

digits ns/op ns/digit op/sec
24 181 7.56 5'512'709.5519
1 14 13.59 73'561'865.5289
10 22 2.22 45'069'406.8866
100 277 2.77 3'610'668.8042
1'000 441 0.44 2'267'661.1117
10'000 4'426 0.44 225'926.5133
100000 47'251 0.47 21'163.4093
1000000 469'428 0.47 2'130.2525
10000000 4'821'577 0.48 207.4010
100000000 48'427'968 0.48 20.6492
1000000000 490'691'955 0.49 2.0379
2147483643 1'040'876'939 0.48 0.9607

We can have up to Integer.MAX_VALUE - 4 = 2147483643 characters in a String.
The table shows that Double.parseDouble() takes 0.48 ns per digit.

The table is also interesting for assessing whether there is a need for imposing a maximal character length for numbers in Jackson.

Double.parseDouble() accepts up to Integer.MAX_VALUE - 4 = 2147483643 characters per value.
If we allow for that many input characters, we can guarantee a performance of 0.9607 values per second.

However, if we are able to limit number values to 24 characters, we can guarantee a performance of 5.5 million values per second.

@pjfanning
Copy link
Contributor Author

https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk).

With floats, the limit is a lot lower (149 if I read the values correctly).

In the end, I think jackson-core is better off enforcing limits on the number of chars in a number than trying to eke out better performance for edge case scenarios.

@lemire
Copy link
Contributor

lemire commented Dec 21, 2022

If you have crazily long input strings, you can almost always determine exactly the right number with only the leading 19 digits... See section 11 of the paper at https://arxiv.org/pdf/2101.11408.pdf cited by @wrandelshofer above...

It involves adding at most 2-3 lines of code...

https://github.com/fastfloat/fast_float/blob/a4f8c86f08688b6be7d91233a4dfdbd32e1ca358/include/fast_float/parse_number.h#L173-L176

It is also conceptually trivial. Suppose that I have...

1.321342134212321321321...321312e10

and I want to parse it... suppose I pick a bunch of leading digits...

1.321342134212e10

now if I add one to these digits...

1.3213421342123e10

The exact number I need, is somewhat between these two, right? Well, if I parse both and they end up generating the same IEEE floating-point number, then it implies that I do not need to look further... I can discard the leftover digits.

@wrandelshofer
Copy link
Owner

https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk).

No, that page does not suggest that.

The "Summary" section shows a table with the value 767 for row 'double / Max Fraction Digits Significant'. The paragraph below the table states: "It is the maximum number of digits in a fraction that determines the maximum number of digits for a given IEEE format."

This is consistent with the 768 digits in Lemire's paper.

@wrandelshofer
Copy link
Owner

wrandelshofer commented Dec 21, 2022

If you have crazily long input strings, you can almost always determine exactly the right number with only the leading 19 digits... See section 11 of the paper at https://arxiv.org/pdf/2101.11408.pdf cited by @wrandelshofer above...

It involves adding at most 2-3 lines of code...

This is awesome! 👍

However this is way more than 2-3 lines of code!

In an earlier iteration, I had ported all code for the 'slow path' from your fast_float project. But at that time, the code was a lot slower than the one in java.util.Double. So I removed it later on. Now I am waiting until JDK-8205592 is resolved. When this code becomes available, I will be able to port the 'slow path' with much less work to Java. And - hopefully - it will be faster than `java.util.Double.parseDouble(String)'.

This issue is about a malicious use of the parser. I expect that a malicious actor will use a number that can only be resolved by processing the maximal number of digits.

@lemire
Copy link
Contributor

lemire commented Dec 21, 2022

However this is way more than 2-3 lines of code!

Not really. It is much easier than you seem to imagine.

Here is the pseudocode:

 adjusted_mantissa am = compute_float(exponent, mantissa);
 if(too_many_digits) {
    if(am != compute_float(exponent, mantissa + 1)) {
      go to slow path;
    }
  }

If you really don't see it, I can try to do a PR.

This issue is about a malicious use of the parser. I expect that a malicious actor will use a number that can only be resloved after the maximal number of digits has been processed.

The trick that I describe will solve the issue that you encounter with the benchmarks and make it harder for malicious users to trick you.

But even then... you need to rescale the running time with respect to the size of the input.

Let us look at that...

DoubleParserBench.fastDouble1000         avgt    5       1371.018 ±       9.989  ns/op
DoubleParserBench.fastDouble1000000      avgt    5    1408092.544 ±    8572.876  ns/op

So you have 1000 more digits... and the code gets... well... 1000 times slower. This means that on a per-byte basis, you have constant time performance.

I mean... If I send you a JSON file that is 1000 x larger and you take 1000 x longer to process it, it is fine. It is not much of an attack opportunity.

At that point, you are probably more vulnerable for out-of-memory errors...

@wrandelshofer
Copy link
Owner

https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk).

Okay. I understand now, why you wrote 'well over 1000' and I wrote 768. What I meant to say is that we need to perform the costly conversion from decimal to binary with up to 768 digits. Leading zero digits are cheap: we need to skip over leading zeroes that are before the decimal point, and we need to count leading zeros that are after the decimal point.

In the end, I think jackson-core is better off enforcing limits on the number of chars in a number than trying to eke out better performance for edge case scenarios.

Yes. If a 'number' in a JSON has more than 24 characters, it is probably not a double value.

@wrandelshofer
Copy link
Owner

wrandelshofer commented Dec 21, 2022

Not really. It is much easier than you seem to imagine.

I am sure you are right. 🤔

I imagined, that you proposed code hat progressively consumes additional digits until it is able to compute the proper value of the mantissa.

However, when I look at your code snippet again, then I believe now that it corresponds to code that I have already ported from fast_float to Java.

if (isSignificandTruncated) {
// We have too many digits. We may have to round up.
// To know whether rounding up is needed, we may have to examine up to 768 digits.
// There are cases, in which rounding has no effect.
if (DOUBLE_MIN_EXPONENT_POWER_OF_TEN <= exponentOfTruncatedSignificand
&& exponentOfTruncatedSignificand <= DOUBLE_MAX_EXPONENT_POWER_OF_TEN) {
double withoutRounding = tryDecFloatToDoubleWithFastAlgorithm(isNegative, significand, exponentOfTruncatedSignificand);
double roundedUp = tryDecFloatToDoubleWithFastAlgorithm(isNegative, significand + 1, exponentOfTruncatedSignificand);
if (!Double.isNaN(withoutRounding) && roundedUp == withoutRounding) {
return withoutRounding;
}
}
// We have to take a slow path.
result = Double.NaN;

Is this correct?

@lemire
Copy link
Contributor

lemire commented Dec 21, 2022

Is this correct?

Yes. And this code should (if it is correct) cover 99% of the cases if you randomly generate digits.

@wrandelshofer
Copy link
Owner

Great. 😀

Since I am malicious, I used a sequence of the character '7' though. 😈

@wrandelshofer
Copy link
Owner

wrandelshofer commented Dec 22, 2022

But what I was trying to get at was specifically performance differences between BigDecimal (which internally is 10-based, right?) and double (etc).

The significand of BigDecimal is 2-based, its exponent is 10-based. Therefore we still need to convert the significand from 10-base to 2-base.

For BigDecimal we can only drop leading zeroes of the significand; we have to convert all other digits of the significand. Because of this the conversion is very costly.

On my trusty Mac mini 2018, I get the following performance with new BigDecimal(String):

Digits ns/op ns/digit op/sec duration/op
24 117 5 8'562'670 0ms
1 12 12 80'173'174 0ms
10 21 2 48'132'461 0ms
100 481 5 2'079'793 0ms
1'000 14'833 15 67'419 0ms
10'000 1'200'633 120 833 1ms
100000 117'587'913 1'176 8.5043 118ms
1000000 12'027'995'151 12'028 0.0831 12s 28ms
10000000 1'281'686'999'490 128'169 0.0008 21m 21s 687ms
100000000 128'168'699'949'000 1'281'687 0.00000780 1d 11h 36m 8s 700ms
1000000000 12'816'869'994'900'000 12'816'870 0.00000008 21w 1d 8h 14m 29s 995ms
1292782621 21'420'666'987'609'800 16'569'427 0.00000005 35w 2d 22h 11m 6s 988ms

The values in italics are estimates.

@cowtowncoder
Copy link

Phew! Very interesting discussion, some of which I even understood. :)
(but more importantly, people that matter are talking and grok it)

From what I can gather, it does not sound like there would be any simple fixed (independent of textual representation) length to safely use to truncate, below something like 768 characters (varies b/w float, double, BigDecimal but that aside).

So in case of Jackson we do 2-phase processing:

  1. Tokenize valid number representations (as per JSON spec). At this point we are now adding basic configurable max length where there previously were none. I think initial limit is 1024 but I guess we could tune it to 768 safely. Regardless, at this point we do not know type at which user/app wants to access number as
  2. On accessing value as specific numeric type, we perform decoding. At this point we do know target type and could further limit length if we chose to -- but ideally we'd quietly drop whatever (if anything) we can.

But. From above numbers it also does seem like there was diminishing return -- the goal (for me & Jackson, I think) is not to try to optimize performance of sub-optimal cases for legit but misguided users, but to try to limit DoS attacks. If 1k characters would still give over 60k / sec, that does not look too worrisome in grand scheme of things.

Having said that, if there was a way to determine possible truncation, it seems to me that some basic heuristics ("only worry about truncation if charlen above 100; otherwise proceed as-is") could be useful.

@wrandelshofer
Copy link
Owner

wrandelshofer commented Dec 23, 2022

Do we need to make adjustmens in FastDoubleParser?

The current behavior is as follows:

float, double

  • asymptotic performance: O(n) where n is the number of input characters
  • throws NumberFormatException if n exceeds Integer.MAX_VALUE - 4
  • silently crops significand to fixed size of the binary mantissa
  • silently yields infinity if exponent is out of range

BigInteger

  • asymptotic performance: O(nc) where n is the number of input characters
  • throws NumberFormatException if n exceeds 1,292,782,622
  • throws NumberFormatException if the significand does not fit into 2^31 - 1 bits

BigDecimal

  • asymptotic performance: O(nc) where n is the number of input characters
  • throws NumberFormatException if n exceeds 1,292,782,635
  • throws NumberFormatException if the significand does not fit into 2^31 - 1 bits
  • throws NumberFormatException if the exponent is out of range

@wrandelshofer
Copy link
Owner

I am now integrating FFT multiplication from the project https://github.com/tbuktu/bigint/tree/floatfft

Using this multiplication algorithm, we can give very good guarantees for the maximal required computation time and memory usage.

The computation times are given for a Mac mini 2018 with Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz.

The memory usage seem large. But they are not really, considering that we have to perform multiplications of huge bit sequences for parsing a BigInteger/BigDecimal number.

Parser Result Type Maximal
input length
Memory usage
JVM -Xmx
Computation
Time
JavaDoubleParser java.lang.Double 2^31 - 5 10 gigabytes < 5 sec
JavaFloatParser java.lang.Float 2^31 - 5 10 gigabytes < 5 sec
JavaBigIntegerParser java.math.BigInteger 1,292,782,622 14 gigabytes < 7 min
JavaBigDecimalParser java.math.BigDecimal 1,292,782,635 14 gigabytes < 7 min

@wrandelshofer
Copy link
Owner

The BigInteger and BigDecimal parsers with improved worst-case performance are now available in release 0.6.0.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants