-
Notifications
You must be signed in to change notification settings - Fork 429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
microbenchmarks for double<->string conversion, serialisation improvements #447
Conversation
relating to previous work #422 Results are matching what was observed at the time ^^ was being worked on std::from_chars is included as the target to beat, but since only MSVC has it for floating point it's not hugely useful yet uniform real distribution is probably a horrible choice, and it might be good to randomise the number of sf in each string also (currently the y all end up at max length) Run on (4 X 3500 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 262K (x4) L3 Unified 6291K (x1) ---------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ---------------------------------------------------------------------------------------------------------- RandFloats/double_from_string_sstream 804 ns 820 ns 896000 RandFloats/double_from_string_strtod 163 ns 162 ns 5973333 RandFloats/double_from_string_strtod_fixed 175 ns 172 ns 5352107 RandFloats/double_from_string_strtod_fixed_const_ref 150 ns 152 ns 5352107 RandFloats/double_from_string_std_from_chars 87.1 ns 88.3 ns 9557333 RandFloatsComma/double_from_string_strtod_fixed_comma_ref 172 ns 173 ns 5146257 RandFloatsComma/double_from_string_strtod_fixed_comma_const_ref 180 ns 175 ns 5352107
* input randomiser was feeding a constant value previously, now actually randomising * start to_string with the current method (sstream), an faster more correct version (sstream_cached), snprintf, and std::to_chars ** NOTE: only std::to_chars and sstream_cached are correct in the face of locales Run on (4 X 3500 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 262K (x4) L3 Unified 6291K (x1) ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- RandFloatStrs/double_from_string_sstream 1012 ns 1001 ns 640000 RandFloatStrs/double_from_string_strtod 276 ns 276 ns 2488889 RandFloatStrs/double_from_string_strtod_fixed 312 ns 308 ns 2133333 RandFloatStrs/double_from_string_strtod_fixed_const_ref 307 ns 300 ns 2240000 RandFloatStrs/double_from_string_std_from_chars 194 ns 188 ns 3733333 RandFloatCommaStrs/double_from_string_strtod_fixed_comma_ref 315 ns 314 ns 2240000 RandFloatCommaStrs/double_from_string_strtod_fixed_comma_const_ref 306 ns 305 ns 2357895 RandFloats/string_from_double_sstream 1372 ns 1381 ns 497778 RandFloats/string_from_double_sstream_cached 1136 ns 1123 ns 640000 RandFloats/string_from_double_snprintf 536 ns 516 ns 1000000 RandFloats/string_from_double_std_to_chars 116 ns 115 ns 6400000
Run on (4 X 3500 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 262K (x4) L3 Unified 6291K (x1) ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- RandFloatStrs/double_from_string_sstream 969 ns 977 ns 640000 RandFloatStrs/double_from_string_strtod 274 ns 270 ns 2488889 RandFloatStrs/double_from_string_strtod_fixed 273 ns 273 ns 2635294 RandFloatStrs/double_from_string_strtod_fixed_const_ref 274 ns 276 ns 2488889 RandFloatStrs/double_from_string_std_from_chars 193 ns 193 ns 3733333 RandFloatCommaStrs/double_from_string_strtod_fixed_comma_ref 273 ns 267 ns 2635294 RandFloatCommaStrs/double_from_string_strtod_fixed_comma_const_ref 273 ns 273 ns 2635294 RandFloats/string_from_double_sstream 1323 ns 1311 ns 560000 RandFloats/string_from_double_sstream_cached 1074 ns 1074 ns 640000 RandFloats/string_from_double_snprintf 519 ns 516 ns 1000000 RandFloats/string_from_double_snprintf_fixed 517 ns 516 ns 1000000 RandFloats/string_from_double_std_to_chars 118 ns 117 ns 5600000 RandFloatsComma/string_from_double_snprintf_fixed_comma 520 ns 516 ns 1120000
serialisation ends up roughly 2x improvement going from sstream to snprintf Run on (4 X 3500 MHz CPU s) CPU Caches: L1 Data 32K (x4) L1 Instruction 32K (x4) L2 Unified 262K (x4) L3 Unified 6291K (x1) ------------------------------------------------------------------------------------------------------------- Benchmark Time CPU Iterations ------------------------------------------------------------------------------------------------------------- RandFloatStrs/double_from_string_sstream 968 ns 977 ns 640000 RandFloatStrs/double_from_string_strtod 272 ns 270 ns 2488889 RandFloatStrs/double_from_string_strtod_fixed 272 ns 270 ns 2488889 RandFloatStrs/double_from_string_strtod_fixed_const_ref 273 ns 270 ns 2488889 RandFloatStrs/double_from_string_std_from_chars 193 ns 195 ns 3446154 RandFloatCommaStrs/double_from_string_strtod_fixed_comma_ref 272 ns 273 ns 2635294 RandFloatCommaStrs/double_from_string_strtod_fixed_comma_const_ref 276 ns 273 ns 2635294 RandFloats/string_from_double_sstream 1311 ns 1318 ns 497778 RandFloats/string_from_double_sstream_cached 1076 ns 1050 ns 640000 RandFloats/string_from_double_snprintf 601 ns 600 ns 1120000 RandFloats/string_from_double_snprintf_fixed 600 ns 600 ns 1120000 RandFloats/string_from_double_std_to_chars 117 ns 117 ns 5600000 RandFloatsComma/string_from_double_snprintf_fixed_comma 600 ns 600 ns 1120000
… work user defined copy operators suppress compiler creation of move operations, and not having all of copy/move/dtor defined (rule of 0/5) is suspicious. Also happens to be very slightly slower
Had a bit more of a play to see if I could identify any major slowdowns in the serialisation code (can be seen here). Hard work would be the best summary. Ignoring correctness somewhat to make things a bit easier to write, the "very-large.xlsx" benchmark was down to ~5.4s (master ~7.0s, this PR ~6.3s) and there was still some low hanging items (lowest/highest column/row functions are definitely worth avoiding). It's hard work though because it's many relatively minor inefficiencies rather than one obvious issue (death by paper cut). Going to leave this as is, it quickly gets quite invasive |
This is looking good. Thanks for all the effort. I will have to trace through the code a bit more before merging. My only real concern would be breaking changes from removing some of those constructors. I added them for convenience--not sure how many people actually use those. |
They're not actually removed, just left to the compiler / defaulted (they're all copy / assignment operations that do nothing special so can let the compiler generate them automatically) By letting the compiler generate them, you gain slightly in three aspects
see rule of 5/0: https://en.cppreference.com/w/cpp/language/rule_of_three
tl;dr PS |
Gotcha. Thanks for the info. I've been writing mostly Go lately so my C++ is getting rusty. I'll try to merge this later today or tomorrow. |
This is a continuation of some of the work I did for #421 / #422 a few months ago
Adds a microbenchmark project (enabled by passing -DBENCHMARKS=ON -DXLNT_MICROBENCH_ENABLED=ON to cmake). It uses google banchmark. Currently used to compare a variety of ways of (de)serialising numbers.
Note: the std::<to/from>_chars and _comma benchmarks are currently ifdef'd behind _MSC_VER. This is because
The added benchmarks prove that the changes made in #421 around deserialising numbers were justified
Additionally, I added some benchmarks for the serialisation of numbers which was not touched during the previous work. These show a roughly 2x improvement in using snprintf instead of stringstream and then a further 5-6x improvement using std::to_chars (>10x stringstream). As such I implemented the snprintf version in xlsx_producer along with a few other tweaks based on some basic profiling. These tweaks reduced time to save a spreadsheet to ~85-90% of current master.
I'm not certain the doubling of throughput from #421 can be recreated in the serialisation routines. Majority of the improvements would have to come from changes to how the xml serialiser is used, but unlike with deserialisation, there doesn't appear to be major inefficiency with the current serialisation method