Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

locale-independent num-to-str #378

Conversation

TurpentineDistillery
Copy link

This implements #362
In order to determine whether a floating-point string representation needs ".0" appended, we need to be able to examine the string content before dumping it into stream. One way of doing that is to reuse thread-local classic-locale-imbued stringstream, but that has additional overhead, and also clang on my mac does not seem to know about thread_local. So I went with a different approach - use snprintf and then fix the result to be POSIX-locale compliant (erase decimal grouping characters and replace decimal separator with '.').

Also, this gives 2x speedup to dump numbers/signed_ints.json benchmarks. Others are unchanged.

@coveralls
Copy link

Coverage Status

Coverage decreased (-0.1%) to 99.864% when pulling 509447b on TurpentineDistillery:feature/locale_independent_num_to_str into bc28942 on nlohmann:develop.

…ocales (localeconv) rather than std::locale
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.3%) to 99.728% when pulling 738d462 on TurpentineDistillery:feature/locale_independent_num_to_str into bc28942 on nlohmann:develop.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 0193035 on TurpentineDistillery:feature/locale_independent_num_to_str into bc28942 on nlohmann:develop.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 65b9b0c on TurpentineDistillery:feature/locale_independent_num_to_str into bc28942 on nlohmann:develop.

@nlohmann
Copy link
Owner

nlohmann commented Jan 2, 2017

I shall have a look at the PR this week.

@nlohmann nlohmann self-assigned this Jan 2, 2017
@nlohmann nlohmann added this to the Release 2.1.0 milestone Jan 2, 2017
@nlohmann
Copy link
Owner

nlohmann commented Jan 4, 2017

Hey @TurpentineDistillery, thanks for the PR!

I can confirm some of your numbers - here is the output of the benchmark (mean of 100 runs):

Test develop this PR
dump jeopardy.json 508.225 ms 561.056 ms
dump jeopardy.json with indent 563.668 ms 599.749 ms
dump numbers/floats.json 729.318 ms 693.462 ms
dump numbers/signed_ints.json 255.533 ms 123.513 ms

There is a 2x speedup for integers, a slight speedup for float, but for some reason, the jeopardy-example is slower, which makes no real sense.


Edit: I reran the benchmarks and now have these times:

Test develop this PR
dump jeopardy.json 574.94 ms 569.415 ms

I think switching to Nonius is not a good idea...

@nlohmann nlohmann added the state: please discuss please discuss the issue or vote for your favorite option label Jan 4, 2017
@TurpentineDistillery
Copy link
Author

Can you modify benchmarking to use harmonic mean (as it should) instead of arithmetic mean? That should make benchmarking robust to outlier runs.

@coveralls
Copy link

Coverage Status

Coverage remained the same at 100.0% when pulling 9490610 on TurpentineDistillery:feature/locale_independent_num_to_str into 9f6c86f on nlohmann:develop.

@nlohmann
Copy link
Owner

nlohmann commented Jan 5, 2017

I ran another 100 repetitions. Here are the harmonic means:

Test develop this PR
dump jeopardy.json 569.3 ms 573.2 ms

@TurpentineDistillery
Copy link
Author

Based on the previous runs, this appears to be within the experimental error (maybe add CIs to the timing outputs?) . By the way, jeopardy.json contains no numbers, only strings.

@nlohmann
Copy link
Owner

nlohmann commented Jan 5, 2017

I know - that's why I am so puzzled that the numbers are so fragile...

@whackashoe
Copy link
Contributor

Try running perf stat -Bddd ./whatever and compare off a few runs of each?

On #337 I remember jeopardy-indented was consistently worse too but it was very slight.

@TurpentineDistillery
Copy link
Author

TurpentineDistillery commented Jan 6, 2017

Another strange observation:
Harmonic mean is biased toward the minimum of sample of values (arithmetic mean of a sample with non-zero variance is always larger). However, in the data above we see the opposite: the arithmetic mean for develop was 508.225 ms, and the harmonic mean was 569.3 ms, which is indicative of the latter execution being consistently slower - the computer was was doing something else during the latter benchmarking, causing context-switches or frequency-scaling, or IO-related slowdowns. Either that or a bug ; )

Somewhat related to benchmarking in general: https://stackoverflow.com/questions/9006596/is-the-unix-time-command-accurate-enough-for-benchmarks

@nlohmann
Copy link
Owner

nlohmann commented Jan 8, 2017

@TurpentineDistillery You are right - this all makes no sense. In the develop branch, I am using benchpress with some additional tweaks to avoid cache-related outliers. Then I had a look at nonius which has a nicer API and output options, but comes to these strange results.

I shall try Google Benchmark...

Edit: https://github.com/DigitalInBlue/Celero also sounds promising.

@nlohmann
Copy link
Owner

I added Google Benchmark to a feature branch and let the serialization benchmarks run (to is the code from the develop branch, bottom the code from this PR):

benchmarks

In this benchmark, the PR is consistently more than twice as fast as the code from develop! 👍


snprintf(m_buf.data(), m_buf.size(), fmt, x);

#if 0
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this #if 0 branch still relevant?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes - this is a here-be-dragons note to a future clever contributor (or code peruser) who'll think "let's use c++-flavor locales here!" Feel free to get rid of it.

@TurpentineDistillery
Copy link
Author

In this benchmark, the PR is consistently more than twice as fast as the code from develop!

Huh? How is that possible?? The only thing that's supposed to be faster is the writing of integer types that is done "by hand" - I'd expect others to be unaffected, or minimally affected.

@nlohmann
Copy link
Owner

I think the fact that the locale is not reset with every dump call may make a difference.

This code is not executed for every dump:

        // fix locale problems
        ss.imbue(std::locale::classic());

        // 6, 15 or 16 digits of precision allows round-trip IEEE 754
        // string->float->string, string->double->string or string->long
        // double->string; to be safe, we read this value from
        // std::numeric_limits<number_float_t>::digits10
        ss.precision(std::numeric_limits<double>::digits10);

@TurpentineDistillery
Copy link
Author

I have not observed similar speedup with running json_benchmarks, however.

@nlohmann
Copy link
Owner

I currently have no real way to produce reproducible benchmark results :(

@nlohmann
Copy link
Owner

Google Benchmark feels plausible, but I have no real way to verify the results.

@TurpentineDistillery
Copy link
Author

I ran my own 'ghetto' benchmarking, which is, as you might have expected by now, is in disagreement with all of the results obtained with various approaches above. It is about as simple as "hello world", measuring the time to dump json record in a loop until it totals 100000000 bytes (I didn't bother factoring out the bootstrap time to initially deserialize the record). It shows that PR is about the same as develop in all cases; maybe just marginally faster. I ran it a few times and got consistent numbers.

>>g++ --version
Configured with: --prefix=/Applications/Xcode.app/Contents/Developer/usr --with-gxx-include-dir=/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.10.sdk/usr/include/c++/4.2.1
Apple LLVM version 6.0 (clang-600.0.57) (based on LLVM 3.5svn)
Target: x86_64-apple-darwin13.4.0
Thread model: posix

>>cat main.cpp
#include "src/json.hpp"
#include <iostream>

int main()
{
    nlohmann::json j;
    std::cin >> j;
    while(true) std::cout << j;
    return 0;
}

>>git checkout develop
>>g++ -std=c++11 -O2 main.cpp -DNDEBUG -o a.out.develop

>>git checkout feature/locale_independent_num_to_str
>>g++ -std=c++11 -O2 main.cpp -DNDEBUG -o a.out.pr


>>for file in `find benchmarks/ -name *.json`; do echo $file; cat $file | time ./a.out.pr | head -c 100000000 >/dev/null;  done
benchmarks//files/jeopardy/jeopardy.json
Command terminated abnormally.
       13.02 real        12.83 user         0.18 sys
benchmarks//files/nativejson-benchmark/canada.json
Command terminated abnormally.
       11.97 real        11.94 user         0.02 sys
benchmarks//files/nativejson-benchmark/citm_catalog.json
Command terminated abnormally.
        8.78 real         8.76 user         0.02 sys
benchmarks//files/nativejson-benchmark/twitter.json
Command terminated abnormally.
        8.10 real         8.08 user         0.01 sys
benchmarks//files/numbers/floats.json
Command terminated abnormally.
       13.63 real        13.59 user         0.03 sys
benchmarks//files/numbers/signed_ints.json
Command terminated abnormally.
        9.44 real         9.41 user         0.02 sys
benchmarks//files/numbers/unsigned_ints.json
Command terminated abnormally.
        9.41 real         9.38 user         0.02 sys


>>for file in `find benchmarks/ -name *.json`; do echo $file; cat $file | time ./a.out.develop | head -c 100000000 >/dev/null;  done
benchmarks//files/jeopardy/jeopardy.json
Command terminated abnormally.
       13.05 real        12.85 user         0.18 sys
benchmarks//files/nativejson-benchmark/canada.json
Command terminated abnormally.
       14.25 real        14.22 user         0.02 sys
benchmarks//files/nativejson-benchmark/citm_catalog.json
Command terminated abnormally.
        9.48 real         9.46 user         0.02 sys
benchmarks//files/nativejson-benchmark/twitter.json
Command terminated abnormally.
        8.20 real         8.18 user         0.02 sys
benchmarks//files/numbers/floats.json
Command terminated abnormally.
       14.75 real        14.72 user         0.03 sys
benchmarks//files/numbers/signed_ints.json
Command terminated abnormally.
       10.59 real        10.56 user         0.03 sys
benchmarks//files/numbers/unsigned_ints.json
Command terminated abnormally.
       10.52 real        10.49 user         0.03 sys

@TurpentineDistillery
Copy link
Author

I think the fact that the locale is not reset with every dump call may make a difference.

The profiler did not identify these as hot-spots. The locale instantiation (prior to using classic) was super expensive, if you remember, but imbueing and setting precision once per dump was not expensive.

@nlohmann
Copy link
Owner

Sigh... I had another look at the benchmarks. Apart of changing the header file (develop/PR), the numbers stay the same

$ ./benchmark-develop --benchmark_filter="dump.*" --benchmark_repetitions=30 --benchmark_report_aggregates_only=true
Run on (8 X 2900 MHz CPU s)
2017-01-11 19:34:34
Benchmark                                          Time           CPU Iterations
--------------------------------------------------------------------------------
dump data/jeopardy/jeopardy.json_mean            271 ns        270 ns    2625706   14.1525MB/s normal
dump data/jeopardy/jeopardy.json_stddev           11 ns         10 ns          0   566.813kB/s normal
dump data/jeopardy/jeopardy.json_mean            267 ns        267 ns    2537979   14.3072MB/s pretty
dump data/jeopardy/jeopardy.json_stddev           10 ns         10 ns          0   566.589kB/s pretty
dump data/numbers/floats.json_mean               270 ns        270 ns    2643914   14.1578MB/s normal
dump data/numbers/floats.json_stddev              10 ns         10 ns          0   563.347kB/s normal
dump data/numbers/floats.json_mean               267 ns        266 ns    2761069   14.3434MB/s pretty
dump data/numbers/floats.json_stddev              12 ns         12 ns          0    667.08kB/s pretty
dump data/numbers/signed_ints.json_mean          264 ns        264 ns    2781078   14.4766MB/s normal
dump data/numbers/signed_ints.json_stddev         13 ns         12 ns          0   708.379kB/s normal
dump data/numbers/signed_ints.json_mean          268 ns        267 ns    2760285   14.3026MB/s pretty
dump data/numbers/signed_ints.json_stddev         12 ns         12 ns          0   665.562kB/s pretty

$ ./benchmark-pr --benchmark_filter="dump.*" --benchmark_repetitions=30 --benchmark_report_aggregates_only=true
Run on (8 X 2900 MHz CPU s)
2017-01-11 19:39:48
Benchmark                                          Time           CPU Iterations
--------------------------------------------------------------------------------
dump data/jeopardy/jeopardy.json_mean            116 ns        116 ns    5679375   33.0805MB/s normal
dump data/jeopardy/jeopardy.json_stddev            6 ns          6 ns          0   1.64533MB/s normal
dump data/jeopardy/jeopardy.json_mean            118 ns        118 ns    5818545   32.3685MB/s pretty
dump data/jeopardy/jeopardy.json_stddev            3 ns          3 ns          0   951.512kB/s pretty
dump data/numbers/floats.json_mean               120 ns        119 ns    5542447   31.9539MB/s normal
dump data/numbers/floats.json_stddev               2 ns          2 ns          0   496.115kB/s normal
dump data/numbers/floats.json_mean               119 ns        119 ns    6097083   32.1939MB/s pretty
dump data/numbers/floats.json_stddev               1 ns          1 ns          0   401.886kB/s pretty
dump data/numbers/signed_ints.json_mean          119 ns        118 ns    6104687   32.2889MB/s normal
dump data/numbers/signed_ints.json_stddev          2 ns          2 ns          0   610.253kB/s normal
dump data/numbers/signed_ints.json_mean          118 ns        118 ns    5893595   32.3966MB/s pretty
dump data/numbers/signed_ints.json_stddev          3 ns          3 ns          0   816.329kB/s pretty

@TurpentineDistillery
Copy link
Author

Well, at least we can conclusively say that PR is not worse performance-wise : )

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
state: please discuss please discuss the issue or vote for your favorite option
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants