Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixed Issue #187 - Change parse to record floating point representation #201

Closed
wants to merge 4 commits into from

Conversation

twelsby
Copy link
Contributor

@twelsby twelsby commented Jan 30, 2016

This pull request incorporates the following changes:

  1. New function get_integer() that is used instead of strtoull()/strtoll(). This function implements custom integer parsing that saves the number to the correct basic_json type and returns the type. This also eliminates the need for attempt_cast() as overflow is detected internally. Additionally details about the floating point representation are collected during the parse (in the event it is not an integer) such as the number of significant figures, the capitalization of 'e' and the use of the optional positive exponential sign. For floating point numbers this function is faster then the method currently used in 2.0.0 and only slightly slower than the method used in 1.1.0. For integers this method is significantly faster than both 2.0.0 and 1.1.0.
  2. New type type_data_t that incorporates the value_t enum into a bitfield with the floating point representation information. This ensures that no additional memory is used in any packing scenario except byte level packing (unlikely and even then only a small amount extra is used). The type includes operator overloads so that it can be used as an almost drop in replacement for the straight value_t (switch statements need to be changed to use the get() method). Using bitfields results in a small performance overhead but as these variables are accessed only infrequently it is not significant.
  3. Modification to the dumping of floating point numbers, which now uses snprintf() for the conversion to enable application of the recorded floating point representation. This results in massive performance improvements simply from the change to snprintf().
  4. Addition of extra roundtrip unit tests that target the new behaviour.

@twelsby
Copy link
Contributor Author

twelsby commented Jan 30, 2016

@gregmarr, in relation to "0.0" my rationale was that since a floating point 0.0 was itself a special case it would be better not to complicate the code with yet another special case for -0.0, although it may be slightly more efficient to do so. I've had another look at it and the complication is not as bad as I thought so I have made the change.

In relation to the comment - the mention of '15' digits has been there for some time even though std::numeric_limits<number_float_t>::digits10() was used instead of a literal 15. This was not introduced or changed in this pull. That said, you have a good point so I have also amended the comment.

@twelsby
Copy link
Contributor Author

twelsby commented Jan 30, 2016

I also merged get_integer() into get_number(). The code coverage decreased with the original commit and it appeared that the test of endptr after str_to_float_t() was never testing true. I tried to introduce a new unit test but it proved impossible to provide any input that the lexer identified as a number that would also cause str_to_float_t() to set endptr to anything other than m_cursor. This is therefore a redundant test which I have removed.

With this test removed, get_number() was just calling get_integer() and then str_to_float_t() so there didn't seem to be much point in having two separate functions.

@nlohmann
Copy link
Owner

Could you please put the additional test cases to a different place (best would be directly into unit.cpp to the regression tests), because we would have clashes if miloyip/nativejson-benchmark adds new roundtrip tests?

@gregmarr
Copy link
Contributor

I didn't realize that it also covered -0.0, since the comment didn't mention it.
The numeric limits still has the mismatch with the comment using the typedef and the code using double.

@nlohmann
Copy link
Owner

I am currently not sure whether it is a good idea to preserve optional + and the case of the exponent. In the end, it is still the same number, and JSON allows for these variations. After all, we also don't care about whitespace or the order in which object keys are parsed.

@twelsby
Copy link
Contributor Author

twelsby commented Jan 31, 2016

@gregmarr, you are correct, the numeric limits should be number_float_t, not double.

@nlohmann, it would reduce the code complexity to drop e/E and + preservation but this would not meaningfully affect performance as these are not computationally intensive operations - saving the representation is two comparisons that only need to be done once per number, while applying them involves two comparisons and a very fast loop if the '+' needs to be removed. But I agree it shouldn't be a priority to preserve such things... but in my view the precision needs to be preserved.

Really what we need though is a custom function that doesn't depend on snprintf(). One thing that snprintf() doesn't do, that would be useful is to round non-parsed floating point numbers into the shortest decimal representation that ensures accurate roundtrips (i.e. double->string->double). This is what rapidJSON does, although its implementation causes string->double->string roundtrips to fail in certain circumstances. The best of both worlds is precision preservation for string->double->string and shortest decimal for double->string->double, but that could only be accomplished with a custom double dumping function.

The problem I see it with custom double parsing/dumping functions is that both are very challenging to do efficiently. Integer parsing/dumping can easily and efficiently be done with minimal code (no more than 10 lines each) but I would not be surprised if adding custom floating point parsing/dumping added thousands of lines - but maybe I am just being pessimistic.

@nlohmann nlohmann self-assigned this Mar 30, 2016
@nlohmann
Copy link
Owner

I shall finally merge this this weekend.

nlohmann added a commit that referenced this pull request Apr 3, 2016
@nlohmann nlohmann added this to the Release 2.0.0 milestone Apr 3, 2016
@nlohmann
Copy link
Owner

nlohmann commented Apr 3, 2016

Thanks a lot! I merged the code and made some minor adjustments. I hope the tests all succeed.

@nlohmann
Copy link
Owner

nlohmann commented Apr 3, 2016

All tests are fine. Thanks so much @twelsby!

// Remove '+' sign from the exponent if necessary
if (!m_type.bits.exp_plus)
{
if (len > static_cast<int>(sizeof(buf))) len = sizeof(buf);
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @twelsby!

I am currently trying to improve the test coverage of the code, but I fail to create a test case where len is larger than the size of buf. The previous snprintf calls never seem to return anything greater than 263 - and if I understand snprintf correctly, it would only return a value larger than sizeof(buf) if writing to the buffer failed. In that case, capping it to sizeof(buf) and continue to write to it would make no sense.

Before I jump to wrong conclusions, I would like to ask you what situation you had in mind that would require the if.

All the best
Niels

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants