-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
possible performance issue with very big doubles #32
Comments
I get the following results on my Mac mini 2018:
The FastDoubleParser is slower for long digit sequences, because it falls back to Double.parseDouble(). Therefore it scans the digits twice. This is not a useful vector for a denial of service attack, because it is a linear overhead. For double numbers, we indeed stop parsing the significand after 768 digits¹. This is not visible in the code, because of the fallback to Double.parseDouble(). What you can see in the code, is that we scan all digits and do some multiplications that we then throw away. To demonstrate this, I have included the
We can see that only the I am aware that the ¹ See 'Daniel Lemire, Number Parsing at a Gigabyte per Second, Software: Practice and Experience 51 (8), 2021, Chapter 11 Processing Numbers Quickly': "it may be necessary to read tens or even hundreds of digits (up to 768 digits in the worst case)" |
Shouldn't JSON (I think Double.toString(double) never produces more than 24 characters. I haven't tested this though). |
Thanks @wrandelshofer - with jackson-core, we are introducing checks on number sizes. From your detailed analysis, it looks like we should probably have a few different sizes that we allow. So far, the unreleased dev code has one limit that defaults to 1000 chars. This is fine for BigDecimal/BigInteger. For Floats and Doubles, we should have different and much smaller limits. |
@wrandelshofer You are considering that JSON is limited to what Javascript supports but I think this is not really the case -- format is specified in terms of textual representation and thereby languages/platforms can and do support higher precision too: for Java But at the same time we do want to impose limits based on performance characteristics. |
Ok, another probably naive question: would it make sense, from Jackson side, to truncate "too long" input Strings for |
No, no, I just conjectured this from the topic of this issue 😅: this issue is about parsing double values with many digits, right? I believe it is sensible to assume that the producer and the conumser of data agree on the format of the data. Likewise, if someone needs to parse BigDecimal with hundreds of digits, then they have to provision their system for this workload. But they may not want to provision their system for BigDecimals with thousands or even millions of digits. |
These are crafted string representations. Libraries (like the JDK) do not produce strings with that many digits from a double. Therefore, In my opinion, this should not be accepted by default. |
No, because this is what They have to scan through all digits though, because there can be an exponent after the digits. |
Ok, thank you @wrandelshofer -- very good points. Now, as to maximums: one reason why I definitely would not want to add failing-beyond-24-digits case is that we have no idea of how many users would now get failure on previously working use cases -- that is, getting an exception where there used to be none. Never mind that usage was sub-optimal and most of, say, 60-digit value was ignored. However. I'd be quite interested in knowing about quietly truncating unused digits if there is specific limit -- that sounds like something that would be backwards-compatible approach. If we can quietly truncate reminder of useless tail for So.... what kind of limit would be safe? Since JSON does not allow leading zeroes in integral part, truncating beginning of the String would work. Although leading zeroes in fractional part would count so maybe it's not that trivial. |
Yes, but they may produce them from But what I was trying to get at was specifically performance differences between |
Truncating involves parsing the number. We have to truncate from the first non-zero digit. That digit can be before or after the decimal point in the significand. So, it is best done inside the parser. Otherwise we would have to scan and parse the number twice. Of course, we would have to be faster than
We can have up to The table is also interesting for assessing whether there is a need for imposing a maximal character length for numbers in Jackson.
However, if we are able to limit number values to 24 characters, we can guarantee a performance of 5.5 million values per second. |
https://www.exploringbinary.com/maximum-number-of-decimal-digits-in-binary-floating-point-numbers/ suggests that the number of digits for double could be well over 1000. (link provided by @plokhotnyuk). With floats, the limit is a lot lower (149 if I read the values correctly). In the end, I think jackson-core is better off enforcing limits on the number of chars in a number than trying to eke out better performance for edge case scenarios. |
If you have crazily long input strings, you can almost always determine exactly the right number with only the leading 19 digits... See section 11 of the paper at https://arxiv.org/pdf/2101.11408.pdf cited by @wrandelshofer above... It involves adding at most 2-3 lines of code... It is also conceptually trivial. Suppose that I have... 1.321342134212321321321...321312e10 and I want to parse it... suppose I pick a bunch of leading digits... 1.321342134212e10 now if I add one to these digits... 1.3213421342123e10 The exact number I need, is somewhat between these two, right? Well, if I parse both and they end up generating the same IEEE floating-point number, then it implies that I do not need to look further... I can discard the leftover digits. |
No, that page does not suggest that. The "Summary" section shows a table with the value 767 for row 'double / Max Fraction Digits Significant'. The paragraph below the table states: "It is the maximum number of digits in a fraction that determines the maximum number of digits for a given IEEE format." This is consistent with the 768 digits in Lemire's paper. |
This is awesome! 👍 However this is way more than 2-3 lines of code! In an earlier iteration, I had ported all code for the 'slow path' from your fast_float project. But at that time, the code was a lot slower than the one in This issue is about a malicious use of the parser. I expect that a malicious actor will use a number that can only be resolved by processing the maximal number of digits. |
Not really. It is much easier than you seem to imagine. Here is the pseudocode:
If you really don't see it, I can try to do a PR.
The trick that I describe will solve the issue that you encounter with the benchmarks and make it harder for malicious users to trick you. But even then... you need to rescale the running time with respect to the size of the input. Let us look at that...
So you have 1000 more digits... and the code gets... well... 1000 times slower. This means that on a per-byte basis, you have constant time performance. I mean... If I send you a JSON file that is 1000 x larger and you take 1000 x longer to process it, it is fine. It is not much of an attack opportunity. At that point, you are probably more vulnerable for out-of-memory errors... |
Okay. I understand now, why you wrote 'well over 1000' and I wrote 768. What I meant to say is that we need to perform the costly conversion from decimal to binary with up to 768 digits. Leading zero digits are cheap: we need to skip over leading zeroes that are before the decimal point, and we need to count leading zeros that are after the decimal point.
Yes. If a 'number' in a JSON has more than 24 characters, it is probably not a double value. |
I am sure you are right. 🤔 I imagined, that you proposed code hat progressively consumes additional digits until it is able to compute the proper value of the mantissa. However, when I look at your code snippet again, then I believe now that it corresponds to code that I have already ported from fast_float to Java. Lines 965 to 980 in 4c10752
Is this correct? |
Yes. And this code should (if it is correct) cover 99% of the cases if you randomly generate digits. |
Great. 😀 Since I am malicious, I used a sequence of the character '7' though. 😈 |
The significand of BigDecimal is 2-based, its exponent is 10-based. Therefore we still need to convert the significand from 10-base to 2-base. For BigDecimal we can only drop leading zeroes of the significand; we have to convert all other digits of the significand. Because of this the conversion is very costly. On my trusty Mac mini 2018, I get the following performance with
The values in italics are estimates. |
Phew! Very interesting discussion, some of which I even understood. :) From what I can gather, it does not sound like there would be any simple fixed (independent of textual representation) length to safely use to truncate, below something like 768 characters (varies b/w So in case of Jackson we do 2-phase processing:
But. From above numbers it also does seem like there was diminishing return -- the goal (for me & Jackson, I think) is not to try to optimize performance of sub-optimal cases for legit but misguided users, but to try to limit DoS attacks. If 1k characters would still give over 60k / sec, that does not look too worrisome in grand scheme of things. Having said that, if there was a way to determine possible truncation, it seems to me that some basic heuristics ("only worry about truncation if charlen above 100; otherwise proceed as-is") could be useful. |
Do we need to make adjustmens in FastDoubleParser? The current behavior is as follows: float, double
BigInteger
BigDecimal
|
I am now integrating FFT multiplication from the project https://github.com/tbuktu/bigint/tree/floatfft Using this multiplication algorithm, we can give very good guarantees for the maximal required computation time and memory usage. The computation times are given for a Mac mini 2018 with Intel(R) Core(TM) i7-8700B CPU @ 3.20GHz. The memory usage seem large. But they are not really, considering that we have to perform multiplications of huge bit sequences for parsing a BigInteger/BigDecimal number.
|
The BigInteger and BigDecimal parsers with improved worst-case performance are now available in release 0.6.0. |
JavaDoubleParser seems to be slower than Double.parseDouble for very large numbers (thousands of digits).
Malicious actors often create input files with large numbers to try to cause denial of service issues.
I have a jmh benchmark at https://github.com/pjfanning/jackson-number-parse-bench
./gradlew jmh
It's worth checking the build.gradle file as I have a param that controls which benchmark to run.
I'm wondering if it would be possible to disregard the least significant digits. If there are 1000 digits, only the first 30 or 40 digits should really impact the double value - even if you were conservative and limited it 100 or 200, this would limit the risk vector.
The text was updated successfully, but these errors were encountered: