Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Floating point precision and missing integer fields #90

Open
marklombardi opened this issue Aug 5, 2018 · 13 comments
Open

Floating point precision and missing integer fields #90

marklombardi opened this issue Aug 5, 2018 · 13 comments

Comments

@marklombardi
Copy link

Thank-you for creating a great library. I am seeing a performance improvement of approximately 10:1 over JsonCpp. File sizes tested up to 3 MB.

Q1:
I have noticed that tables with integer fields with 0 values are not printed. Is this a bug or have I missed something in the documentation?

Q2:
I would like to specify the floating point precision (significant figures) to use when writing floats or doubles? Ideally through a configuration option. In my application I'd like to reduce the number of significant figures to 5. This will reduce the size of my JSON files by a factor of 2.

This is the relevant piece of code. The precision is hard coded. However, it is easily parameterized for the standard library use case. For the grisu3 library function grisu3_print_double, its unclear as to how to modify the function to limit the precision. Could you perhaps provide a code snippet that includes the changes by adding a precision parameter to the grisu3_print_double call?

flatcc/include/flatcc/portable/pprintfp.h

#ifdef grisu3_print_double_is_defined
/* Currently there is not special support for floats. */
#define print_float(n, p) grisu3_print_double((float)(n), (p))
#define print_double(n, p) grisu3_print_double((double)(n), (p))
#else
#include <stdio.h>
#define print_float(n, p) sprintf(p, "%.9g", (float)(n))
#define print_double(n, p) sprintf(p, "%.17g", (double)(n))
#endif

Thank-you!

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 5, 2018

Hi Mark,

I'm glad you find it useful.

To your first question (Q1):

Ignoring JSON, FlatBuffers have the concept of default values which are not stored in the buffer. By default, the default integer value is 0. Regardless of whether the value is stored or not, it is read back by the reader because it knows the default.

The JSON printer uses runtime flags to achieve similar behavior:

if (p) { \
x = flatbuffers_ ## TN ## _read_from_pe(p); \
if (x == v && ctx->skip_default) { \
return; \
} \
} else { \
if (!ctx->force_default) { \
return; \
} \
x = v; \
} \

The flags are not set by the init function, but immediately after using the set flags call:

/*
* May be called instead of setting operational modes individually.
* Formatting is strict quoted json witout pretty printing by default.
*
* flags are:
*
* `unquote`,
* `noenum`,
* `skip_default`,
* `force_default`,
* `pretty`,
* `nonstrict`
*
* `pretty` flag sets indentation to 2.
* `nonstrict` implies: `noenum`, `unquote`, `pretty`.
*/
static inline void flatcc_json_printer_set_flags(flatcc_json_printer_t *ctx, int flags)

There are two cases where the 0 could be omitted: either the field is not present at all in the buffer (either value was never written, or a default value was written to the buffer without being forced, as is the normal mode), or the field is present but the value matches the default value. The latter case is controlled by skip_default and the former case is controlled by force_default - print the default even if absent.

Now, is there a bug? possibly - but, from memory, I believe the intention is to print a value in the buffer if it is stored in the buffer, default or not but also not print default values that are not stored - otherwise a lot of fields could be printed that are of no interest.

So if the buffer actually contains the value, it would probably be a bug, but more likely the buffer is omitting the default value and so is the printer. In this case you need to either force the value into the buffer, or use force_default on the runtime flag, but then lots of other values might be printed as well.

Q2 follow in separate post.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 5, 2018

Q2:

This is also something that bothers me.

I have actually spent considerable time trying to fix this, several times, but it is very difficult if is to be done with grisu3 and remain fast, I also looked at googles grisu3 implementation in C++ but did not find it helpful.

I tried to print the full resolution to a buffer (this happens already), but then truncated the buffer using a rounding algorithm that is fairly trivial. This did work, but the time overhead was such that I did not want to release it. A proper solution need to stop the algorithm that produces new digits at a lower resolution. There is a computed error measure that controls this process, but it is a black art. Any modification could easily lead to incorrect results.

I would be really happy if someone would provide such a patch. As I recall the Google solution also works by rounding after the fact, which is slow.

For your case I'd recommend not using grisu3 and redefine the print methods so they use a different resolution. I can't provide a generic interface for configurable resolution because that would force the grisu3 implemtentation to support it if/when the fallback is replaced by an internal implementation.

While a bit hackish you could probably do it by pretending to be grisu3 implementation in the following:
https://github.com/dvidelabs/flatcc/blob/master/include/flatcc/portable/pprintfp.h

Other suggestions are welcome, but it is not an easy problem.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 5, 2018

Q2 v2

Another option is to use hex floating points if that is acceptable to the recepient. I'm not sure I have actually tested this, but it definitely is the intention that this should be supported.

This would both be very fast and presumably also compact with the added benefit of perfect numeric precision.

Here is a config flag for hex floats:

/*
* Print float and double values with C99 hexadecimal floating point
* notation. This option is not valid JSON but it avoids precision
* loss, correctly handles NaN, +/-Infinity and is significantly faster
* to parse and print. Some JSON parsers rely on strtod which does
* support hexadecimal floating points when C99 compliant.
*/
#ifndef FLATCC_JSON_PRINT_HEX_FLOAT
#define FLATCC_JSON_PRINT_HEX_FLOAT 0
#endif

Several JSON parsers will accept hex float unintentionally because they use strtod in the core implementation without verifying that the numeric format is valid JSON. The C++ FlatBuffer JSON parser should be supporting hex floats officially, as I understand, at least we had a discussion about it.

Please let me know if you go down that path.

@marklombardi
Copy link
Author

Hi Mike,
Thank-you for the responses.

Q1:
In my use case, the 0 values are being set explicitly using the generated create functions. I assume this means the 0 value fields are present in the buffer. I was able to have these 0 value fields (also default values) printed by adding the following lines of code.

flatcc_json_printer_flags print_flags = flatcc_json_printer_f_force_default;
flatcc_json_printer_set_flags(&printer_ctx, print_flags);

This logic sounds correct to me too. "I believe the intention is to print a value in the buffer if it is stored in the buffer, default or not but also not print default values that are not stored" However I don't think it works this way at present. I guess skip_defaults is on by default. It should probably default to false. Then the printing of default fields could be disabled by turning skip_defaults on.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 6, 2018

Hi Mark,
The flatbuffer API's compare the set value to the default value and do not store the value if they match. You have to use a specific "force" option, which depends on the language that you are using to create the buffer and might not be supported universally.

You could try to parse a manually created buffer with flatcc and set the force flag on the parser, then see how it prints.

/*
* If a value is stored, even if default, it is also printed.
* This option can also be flagged compile time for better performance.
*/
TEST_FLAGS(flatcc_json_parser_f_force_add, 0,
"{ \"name\": \"Monster\", \"color\": 8}",
"{\"name\":\"Monster\",\"color\":\"Blue\"}");
TEST_FLAGS(0, flatcc_json_printer_f_noenum,
"{ \"name\": \"Monster\", \"color\": \"Green\"}",
"{\"name\":\"Monster\",\"color\":2}");
TEST_FLAGS(flatcc_json_parser_f_force_add, flatcc_json_printer_f_skip_default,
"{ \"name\": \"Monster\", \"color\": 8}",
"{\"name\":\"Monster\"}");
TEST_FLAGS(0, flatcc_json_printer_f_force_default,
"{ \"name\": \"Monster\", \"testf\":3.0}",
"{\"mana\":150,\"hp\":100,\"name\":\"Monster\",\"color\":\"Blue\",\"testbool\":true,\"testhashs32_fnv1\":0,\"testhashu32_fnv1\":0,\"testhashs64_fnv1\":0,\"testhashu64_fnv1\":0,\"testhashs32_fnv1a\":0,\"testhashu32_fnv1a\":0,\"testhashs64_fnv1a\":0,\"testhashu64_fnv1a\":0,\"testf\":3,\"testf2\":3,\"testf3\":0}");

See also
https://google.github.io/flatbuffers/md__cpp_usage.html
search for "ForceDefaults"

and
https://github.com/dvidelabs/flatcc/blob/master/doc/builder.md
search for "force_add"

Please do not hestitate to raise further concerns - it might be a bug although I suspect that it is not.

@marklombardi
Copy link
Author

Q2:
Thanks for the responses again.

I tried using the standard library sprintf( ) method and its about twice as slow as the grisu3 implementation. These are the results for comparison purposes.

-- grisu3_print_double   61ms
-- sprintf(p, "%.5g",v)  106ms
-- sprintf(p, "%.5e, v)  119ms

Using hex floats won't work in our situation as we are using the JSON data in various different systems and languages. Transmitting the data in binary format is an option though. We could then post-process the binary data to JSON elsewhere.

I appreciate this is a difficult problem to solve. While looking for another solution I came across this 2018 ACM DL paper which I though may be of interest. It claims to improve on the Grisu3 method and is possibly simpler to implement (according to the authors).

Q1: Thanks for clarifying and pointing me to the location in the documentation. If I call the *_force_add( ) methods, the default values get added to the buffer as you have described.

Thanks,
Mark

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 6, 2018

Thanks for the numbers. sprintf performance is not too bad in truncated form.

I don't recall the performance numbers of my rouding experiment, but considering it needs to pass over 17 digits several times, I don't think it would be much of an improvement.

Interesting paper - if you (or anyone else reading this) come across a quality C implementation with test cases and performance tests against grisu3, I'd seriously consider adopting it, especially if it can efficiently control resolution.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 6, 2018

There seems to be an Apache 2 licensed C implementation

https://github.com/ulfjack/ryu/tree/master/ryu

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 6, 2018

The implementation changes a lot atm and has not solved the resolution problem fully:

ulfjack/ryu#27

But it is worth keeping an eye on.

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 6, 2018

BTW: on performance: if you are measuring the overall performance of JSON processing and the difference is a factor 2, then of course sprintf is significantly more than twice slower than grisu3 and worth optimizing further. Floating point printing is truly a signficant factor in overall JSON processing.

@marklombardi
Copy link
Author

marklombardi commented Aug 7, 2018

Yes you are correct. The conversion from the flatbuffer to the memory buffer (JSON Printing) is 3 (91 vs 33 ms) times slower. The float to string conversion is the bottleneck.

json-float-cmp

@mikkelfj
Copy link
Contributor

mikkelfj commented Aug 7, 2018

sprintf is actually even slower because the factor 3 is on the overall JSON printing that also does other work. So it is likely closer to x4-x6. The new Ryu algorithm should be a factor 3 faster than grisu3 on average, and with fewer outliers. So that would make it roughly 10x faster than sprintf, is my guess.

The Ryu author is very responsive so it is likely that we will see solution in the not so distant future. The core also appears stable with most changes to the Ryu code base being architectural. This means that a given snapshot of the code will likely be correct since Ryu has been verified with exhaustive search in 32-bit floats.

I suggest you follow progress on Ryu and make a test implementation with flatcc once formatting is supported. We can then use those findings to evaluate if a port to Ryu makes sense.

For reference, grisu3 would still be used for parsing and it has a dependency on strtod for a few percentage of cases, as I recall.

@mikkelfj
Copy link
Contributor

The Ryu project does not have immediate plans on supporting custom printing precision. Keeping this issue open as a reminder to check status in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants