Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

microbenchmarks for double<->string conversion, serialisation improvements #447

Merged
merged 13 commits into from
Mar 20, 2020
Merged

microbenchmarks for double<->string conversion, serialisation improvements #447

merged 13 commits into from
Mar 20, 2020

Conversation

Crzyrndm
Copy link
Collaborator

@Crzyrndm Crzyrndm commented Mar 1, 2020

This is a continuation of some of the work I did for #421 / #422 a few months ago

Adds a microbenchmark project (enabled by passing -DBENCHMARKS=ON -DXLNT_MICROBENCH_ENABLED=ON to cmake). It uses google banchmark. Currently used to compare a variety of ways of (de)serialising numbers.

  • gbenchmark dependency aquired using cmake FetchContent module (requires cmake 3.11 as indicated by the script). find_package is nearly useless on windows, so it was the most hassle free method for a "developer only" tool I could find. I'm open to suggestions on acquisition method

image
Note: the std::<to/from>_chars and _comma benchmarks are currently ifdef'd behind _MSC_VER. This is because

  • to_chars/from_chars only exists in full in MSVC (to my knowledge)
  • the locale names are different between OS's and the _comma benchmarks use the german locale "de-DE" which uses the ',' as the decimal seperator to benchmark the alternate path in the strtod/snprintf implementations. Not entirely sure how to deal with this across operating systems (I believe Ubuntu requires installing a locale package and the names are different)

The added benchmarks prove that the changes made in #421 around deserialising numbers were justified

  • stringstream is slow
  • replacing the comma has minmal impact on perf (<10%)
  • conversion ended up 3-4x faster using strtod
  • std::from_chars is currently only a "relatively" minor improvement (30-50% further improvement in throughput up to ~5x stringstream) dwarfed by other factors during file load

Additionally, I added some benchmarks for the serialisation of numbers which was not touched during the previous work. These show a roughly 2x improvement in using snprintf instead of stringstream and then a further 5-6x improvement using std::to_chars (>10x stringstream). As such I implemented the snprintf version in xlsx_producer along with a few other tweaks based on some basic profiling. These tweaks reduced time to save a spreadsheet to ~85-90% of current master.

I'm not certain the doubling of throughput from #421 can be recreated in the serialisation routines. Majority of the improvements would have to come from changes to how the xml serialiser is used, but unlike with deserialisation, there doesn't appear to be major inefficiency with the current serialisation method

relating to previous work #422
Results are matching what was observed at the time ^^ was being worked on
std::from_chars is included as the target to beat, but since only MSVC has it for floating point it's not
hugely useful yet

uniform real distribution is probably a horrible choice, and it might be good to randomise the number
of sf in each string also (currently the y all end up at max length)

Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 6291K (x1)
----------------------------------------------------------------------------------------------------------
Benchmark                                                                Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------
RandFloats/double_from_string_sstream                                  804 ns          820 ns       896000
RandFloats/double_from_string_strtod                                   163 ns          162 ns      5973333
RandFloats/double_from_string_strtod_fixed                             175 ns          172 ns      5352107
RandFloats/double_from_string_strtod_fixed_const_ref                   150 ns          152 ns      5352107
RandFloats/double_from_string_std_from_chars                          87.1 ns         88.3 ns      9557333
RandFloatsComma/double_from_string_strtod_fixed_comma_ref              172 ns          173 ns      5146257
RandFloatsComma/double_from_string_strtod_fixed_comma_const_ref        180 ns          175 ns      5352107
* input randomiser was feeding a constant value previously, now actually randomising
* start to_string with the current method (sstream), an faster more correct version (sstream_cached), snprintf, and std::to_chars
** NOTE: only std::to_chars and sstream_cached are correct in the face of locales

Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 6291K (x1)
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
RandFloatStrs/double_from_string_sstream                                 1012 ns         1001 ns       640000
RandFloatStrs/double_from_string_strtod                                   276 ns          276 ns      2488889
RandFloatStrs/double_from_string_strtod_fixed                             312 ns          308 ns      2133333
RandFloatStrs/double_from_string_strtod_fixed_const_ref                   307 ns          300 ns      2240000
RandFloatStrs/double_from_string_std_from_chars                           194 ns          188 ns      3733333
RandFloatCommaStrs/double_from_string_strtod_fixed_comma_ref              315 ns          314 ns      2240000
RandFloatCommaStrs/double_from_string_strtod_fixed_comma_const_ref        306 ns          305 ns      2357895
RandFloats/string_from_double_sstream                                    1372 ns         1381 ns       497778
RandFloats/string_from_double_sstream_cached                             1136 ns         1123 ns       640000
RandFloats/string_from_double_snprintf                                    536 ns          516 ns      1000000
RandFloats/string_from_double_std_to_chars                                116 ns          115 ns      6400000
Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 6291K (x1)
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
RandFloatStrs/double_from_string_sstream                                  969 ns          977 ns       640000
RandFloatStrs/double_from_string_strtod                                   274 ns          270 ns      2488889
RandFloatStrs/double_from_string_strtod_fixed                             273 ns          273 ns      2635294
RandFloatStrs/double_from_string_strtod_fixed_const_ref                   274 ns          276 ns      2488889
RandFloatStrs/double_from_string_std_from_chars                           193 ns          193 ns      3733333
RandFloatCommaStrs/double_from_string_strtod_fixed_comma_ref              273 ns          267 ns      2635294
RandFloatCommaStrs/double_from_string_strtod_fixed_comma_const_ref        273 ns          273 ns      2635294
RandFloats/string_from_double_sstream                                    1323 ns         1311 ns       560000
RandFloats/string_from_double_sstream_cached                             1074 ns         1074 ns       640000
RandFloats/string_from_double_snprintf                                    519 ns          516 ns      1000000
RandFloats/string_from_double_snprintf_fixed                              517 ns          516 ns      1000000
RandFloats/string_from_double_std_to_chars                                118 ns          117 ns      5600000
RandFloatsComma/string_from_double_snprintf_fixed_comma                   520 ns          516 ns      1120000
serialisation ends up roughly 2x improvement going from sstream to snprintf

Run on (4 X 3500 MHz CPU s)
CPU Caches:
  L1 Data 32K (x4)
  L1 Instruction 32K (x4)
  L2 Unified 262K (x4)
  L3 Unified 6291K (x1)
-------------------------------------------------------------------------------------------------------------
Benchmark                                                                   Time             CPU   Iterations
-------------------------------------------------------------------------------------------------------------
RandFloatStrs/double_from_string_sstream                                  968 ns          977 ns       640000
RandFloatStrs/double_from_string_strtod                                   272 ns          270 ns      2488889
RandFloatStrs/double_from_string_strtod_fixed                             272 ns          270 ns      2488889
RandFloatStrs/double_from_string_strtod_fixed_const_ref                   273 ns          270 ns      2488889
RandFloatStrs/double_from_string_std_from_chars                           193 ns          195 ns      3446154
RandFloatCommaStrs/double_from_string_strtod_fixed_comma_ref              272 ns          273 ns      2635294
RandFloatCommaStrs/double_from_string_strtod_fixed_comma_const_ref        276 ns          273 ns      2635294
RandFloats/string_from_double_sstream                                    1311 ns         1318 ns       497778
RandFloats/string_from_double_sstream_cached                             1076 ns         1050 ns       640000
RandFloats/string_from_double_snprintf                                    601 ns          600 ns      1120000
RandFloats/string_from_double_snprintf_fixed                              600 ns          600 ns      1120000
RandFloats/string_from_double_std_to_chars                                117 ns          117 ns      5600000
RandFloatsComma/string_from_double_snprintf_fixed_comma                   600 ns          600 ns      1120000
… work

user defined copy operators suppress compiler creation of move operations, and not having all of copy/move/dtor
defined (rule of 0/5) is suspicious. Also happens to be very slightly slower
@Crzyrndm Crzyrndm requested a review from tfussell March 1, 2020 11:05
@coveralls
Copy link

coveralls commented Mar 1, 2020

Coverage Status

Coverage decreased (-0.04%) to 83.572% when pulling f4d00ac on Crzyrndm:feature/benchmark into ae6f9d2 on tfussell:master.

@Crzyrndm
Copy link
Collaborator Author

Crzyrndm commented Mar 2, 2020

Had a bit more of a play to see if I could identify any major slowdowns in the serialisation code (can be seen here).

Hard work would be the best summary. Ignoring correctness somewhat to make things a bit easier to write, the "very-large.xlsx" benchmark was down to ~5.4s (master ~7.0s, this PR ~6.3s) and there was still some low hanging items (lowest/highest column/row functions are definitely worth avoiding). It's hard work though because it's many relatively minor inefficiencies rather than one obvious issue (death by paper cut).

Going to leave this as is, it quickly gets quite invasive

@tfussell
Copy link
Owner

This is looking good. Thanks for all the effort. I will have to trace through the code a bit more before merging. My only real concern would be breaking changes from removing some of those constructors. I added them for convenience--not sure how many people actually use those.

@Crzyrndm
Copy link
Collaborator Author

Crzyrndm commented Mar 14, 2020

They're not actually removed, just left to the compiler / defaulted (they're all copy / assignment operations that do nothing special so can let the compiler generate them automatically)

By letting the compiler generate them, you gain slightly in three aspects

  • Generation of move operations (ctor/assignment) are suppressed if copy operations are user defined. This is because any compiler defined operation is assumed broken (for back compat reasons, the reverse is not true, but is deprecated)
  • The compiler / optimiser is now free to implement copy operations as it sees fit
  • It's easy to guarantee correctness of code that doesn't exist 👍

see rule of 5/0: https://en.cppreference.com/w/cpp/language/rule_of_three

  • was rule of 3/0 before C++11 added move ctors. The summary is that you either want to define all or none of the special member functions

tl;dr
No functional change, just a consistency improvement

PS
This doesn't fully resolve localisation issues during serialisation. The previous state would write numberic values in the global locale so it is an improvement, but numeric attributes need to be updated as well. I figure that should come as a second PR to minimise scope creep.

@tfussell
Copy link
Owner

Gotcha. Thanks for the info. I've been writing mostly Go lately so my C++ is getting rusty. I'll try to merge this later today or tomorrow.

@tfussell tfussell merged commit 2f5934f into tfussell:master Mar 20, 2020
@Crzyrndm Crzyrndm deleted the feature/benchmark branch April 24, 2020 22:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants