Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of expression evaluation #9210

Merged
merged 14 commits into from
Sep 16, 2021

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Sep 9, 2021

This PR does some minor reworking of the internals of expression evaluation to improve performance. The largest performance improvements come from passing device data references down the call stack by reference rather than by value. The nullable kernel template experiences significantly higher register pressure and these changes do not seem to be as effective at increasing occupancy on the benchmarks with null data, but in general we see performance improvements across the board for non-null data and in some cases for nulllable data, with improvements ranging up to 40%. This PR also does some minor cleanup: removing some unused functions, replacing __device__ with CUDA_DEVICE_CALLABLE to ensure compatibility with host compilers, and fixing the templating of various functions to ensure proper usage of CRTP. These changes are intended to facilitate future redesigning of the internals of the device_data_references to reduce the depth of these call stacks, simplify the code, and reduce register pressure.

@vyasr vyasr requested a review from a team as a code owner September 9, 2021 22:39
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Sep 9, 2021
@vyasr vyasr self-assigned this Sep 9, 2021
@vyasr vyasr added 3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Performance Performance related issue labels Sep 9, 2021
@vyasr vyasr added this to the Conditional Joins milestone Sep 9, 2021
@vyasr
Copy link
Contributor Author

vyasr commented Sep 9, 2021

Here are some benchmarks

Conditional joins on a Tesla T4:

Benchmarks

Before:

----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time                   996 ms          996 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time                  3368 ms         3368 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time                 8386 ms         8385 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/100000/manual_time                   877 ms          877 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/400000/manual_time                  3505 ms         3505 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time                 8713 ms         8713 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/100000/manual_time            1591 ms         1591 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/400000/manual_time            6008 ms         6008 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time          14905 ms        14904 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/100000/manual_time            1644 ms         1644 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/400000/manual_time            6241 ms         6241 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time          15555 ms        15554 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/100000/manual_time                    876 ms          876 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/400000/manual_time                   3503 ms         3503 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/1000000/manual_time                  8662 ms         8662 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/100000/manual_time                    901 ms          901 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/400000/manual_time                   3604 ms         3604 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/1000000/manual_time                  8928 ms         8927 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/100000/manual_time             1621 ms         1621 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/400000/manual_time             6149 ms         6148 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/1000000/manual_time           15182 ms        15182 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/100000/manual_time             1654 ms         1654 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/400000/manual_time             6303 ms         6303 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/1000000/manual_time           15621 ms        15621 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/100000/manual_time                    877 ms          877 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/400000/manual_time                   3510 ms         3510 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/1000000/manual_time                  8661 ms         8661 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/100000/manual_time                    903 ms          903 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/400000/manual_time                   3599 ms         3599 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/1000000/manual_time                  8900 ms         8900 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/100000/manual_time             1617 ms         1617 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/400000/manual_time             6116 ms         6115 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/1000000/manual_time           15083 ms        15082 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/100000/manual_time             1648 ms         1648 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/400000/manual_time             6269 ms         6269 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/1000000/manual_time           15527 ms        15526 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time               872 ms          872 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time              3484 ms         3484 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time             8612 ms         8611 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time               897 ms          897 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time              3579 ms         3579 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time             8857 ms         8856 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time        1617 ms         1617 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time        6077 ms         6076 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time      14981 ms        14980 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time        1630 ms         1630 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time        6250 ms         6250 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time      15428 ms        15427 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time               869 ms          869 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time              3475 ms         3475 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time             8593 ms         8592 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time               895 ms          895 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time              3567 ms         3567 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time             8830 ms         8830 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time        1601 ms         1601 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time        6063 ms         6063 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time      14941 ms        14941 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time        1640 ms         1640 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time        6220 ms         6220 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time      15418 ms        15417 ms            1

After:

----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time                   818 ms          818 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time                  2465 ms         2465 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time                 6088 ms         6088 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/100000/manual_time                   677 ms          677 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/400000/manual_time                  2541 ms         2541 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time                 6305 ms         6305 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/100000/manual_time            1470 ms         1470 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/400000/manual_time            5584 ms         5584 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time          13852 ms        13851 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/100000/manual_time            1601 ms         1601 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/400000/manual_time            6034 ms         6033 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time          15085 ms        15085 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/100000/manual_time                    690 ms          690 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/400000/manual_time                   2607 ms         2607 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/1000000/manual_time                  6431 ms         6431 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/100000/manual_time                    709 ms          710 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/400000/manual_time                   2679 ms         2679 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/1000000/manual_time                  6629 ms         6629 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/100000/manual_time             1563 ms         1563 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/400000/manual_time             5856 ms         5856 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/1000000/manual_time           14455 ms        14454 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/100000/manual_time             1655 ms         1655 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/400000/manual_time             6258 ms         6257 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/1000000/manual_time           15513 ms        15512 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/100000/manual_time                    705 ms          705 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/400000/manual_time                   2687 ms         2687 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/1000000/manual_time                  6602 ms         6602 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/100000/manual_time                    723 ms          723 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/400000/manual_time                   2746 ms         2746 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/1000000/manual_time                  6779 ms         6779 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/100000/manual_time             1579 ms         1579 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/400000/manual_time             5968 ms         5967 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/1000000/manual_time           14658 ms        14657 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/100000/manual_time             1670 ms         1670 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/400000/manual_time             6338 ms         6338 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/1000000/manual_time           15621 ms        15621 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time               708 ms          708 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time              2691 ms         2691 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time             6623 ms         6623 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time               725 ms          725 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time              2752 ms         2752 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time             6781 ms         6781 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time        1575 ms         1575 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time        5944 ms         5944 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time      14639 ms        14638 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time        1667 ms         1667 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time        6322 ms         6322 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time      15561 ms        15558 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time               708 ms          708 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time              2683 ms         2683 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time             6596 ms         6595 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time               722 ms          722 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time              2739 ms         2739 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time             6755 ms         6755 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time        1569 ms         1569 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time        5944 ms         5943 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time      14550 ms        14550 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time        1659 ms         1659 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time        6273 ms         6273 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time      15508 ms        15508 ms            1

compute_column on a Tesla T4:

Benchmarks

Before:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/1/manual_time               0.024 ms        0.048 ms        29405 bytes_per_second=31.6867G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/5/manual_time               0.038 ms        0.063 ms        18501 bytes_per_second=59.0205G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/10/manual_time              0.057 ms        0.076 ms        12347 bytes_per_second=72.29G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/1/manual_time              0.073 ms        0.094 ms         9631 bytes_per_second=102.042G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/5/manual_time              0.211 ms        0.231 ms         3355 bytes_per_second=106.132G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/10/manual_time             0.387 ms        0.408 ms         1820 bytes_per_second=105.755G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/1/manual_time             0.595 ms        0.621 ms         1250 bytes_per_second=125.245G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/5/manual_time              2.00 ms         2.03 ms          356 bytes_per_second=111.526G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/10/manual_time             3.81 ms         3.83 ms          186 bytes_per_second=107.692G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/1/manual_time             5.96 ms         5.98 ms          128 bytes_per_second=125.078G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/5/manual_time             21.9 ms         22.0 ms           28 bytes_per_second=101.913G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/10/manual_time            43.2 ms         43.3 ms           14 bytes_per_second=94.7566G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/1/manual_time                 0.021 ms        0.047 ms        32567 bytes_per_second=34.6683G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/5/manual_time                 0.032 ms        0.057 ms        22250 bytes_per_second=70.8781G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/10/manual_time                0.044 ms        0.069 ms        15982 bytes_per_second=93.7077G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/1/manual_time                0.062 ms        0.083 ms        11531 bytes_per_second=120.126G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/5/manual_time                0.166 ms        0.189 ms         4279 bytes_per_second=134.252G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/10/manual_time               0.299 ms        0.321 ms         2366 bytes_per_second=137.043G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/1/manual_time               0.513 ms        0.541 ms         1000 bytes_per_second=145.206G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/5/manual_time                1.59 ms         1.61 ms          462 bytes_per_second=140.934G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/10/manual_time               2.86 ms         2.89 ms          247 bytes_per_second=143.212G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/1/manual_time               5.20 ms         5.23 ms          100 bytes_per_second=143.171G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/5/manual_time               16.1 ms         16.1 ms           45 bytes_per_second=139.148G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/10/manual_time              29.1 ms         29.1 ms           24 bytes_per_second=141.032G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/1/manual_time               0.028 ms        0.051 ms        25047 bytes_per_second=53.5262G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/5/manual_time               0.048 ms        0.068 ms        14671 bytes_per_second=93.6405G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/10/manual_time              0.069 ms        0.090 ms        10152 bytes_per_second=118.549G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/1/manual_time              0.113 ms        0.141 ms         6212 bytes_per_second=132.294G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/5/manual_time              0.257 ms        0.281 ms         2730 bytes_per_second=173.795G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/10/manual_time             0.460 ms        0.484 ms         1525 bytes_per_second=178.184G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/1/manual_time             0.950 ms        0.978 ms          733 bytes_per_second=156.803G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/5/manual_time              2.43 ms         2.45 ms          291 bytes_per_second=184.079G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/10/manual_time             4.50 ms         4.52 ms          156 bytes_per_second=182.24G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/1/manual_time             9.34 ms         9.37 ms           75 bytes_per_second=159.456G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/5/manual_time             26.0 ms         26.0 ms           25 bytes_per_second=172.051G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/10/manual_time            51.6 ms         51.6 ms           10 bytes_per_second=158.816G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/1/manual_time          0.151 ms        0.177 ms         4644 bytes_per_second=4.94385G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/5/manual_time          0.182 ms        0.208 ms         3841 bytes_per_second=12.2687G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/10/manual_time         0.217 ms        0.242 ms         3225 bytes_per_second=18.8557G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/1/manual_time          1.38 ms         1.41 ms          505 bytes_per_second=5.38255G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/5/manual_time          1.65 ms         1.68 ms          424 bytes_per_second=13.5302G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/10/manual_time         1.77 ms         1.80 ms          395 bytes_per_second=23.1168G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/1/manual_time         13.8 ms         13.8 ms           51 bytes_per_second=5.41347G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/5/manual_time         16.4 ms         16.4 ms           43 bytes_per_second=13.6399G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/10/manual_time        17.2 ms         17.2 ms           41 bytes_per_second=23.7978G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/1/manual_time         139 ms          139 ms            5 bytes_per_second=5.37103G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/5/manual_time         165 ms          165 ms            4 bytes_per_second=13.5175G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/10/manual_time        182 ms          182 ms            3 bytes_per_second=22.5238G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/1/manual_time            0.147 ms        0.174 ms         4745 bytes_per_second=5.0547G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/5/manual_time            0.169 ms        0.195 ms         4138 bytes_per_second=13.2076G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/10/manual_time           0.192 ms        0.218 ms         3653 bytes_per_second=21.3304G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/1/manual_time            1.35 ms         1.38 ms          517 bytes_per_second=5.50975G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/5/manual_time            1.48 ms         1.51 ms          472 bytes_per_second=15.0832G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/10/manual_time           1.52 ms         1.55 ms          462 bytes_per_second=26.9769G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/1/manual_time           13.4 ms         13.4 ms           52 bytes_per_second=5.55305G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/5/manual_time           14.6 ms         14.6 ms           48 bytes_per_second=15.3218G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/10/manual_time          14.7 ms         14.7 ms           48 bytes_per_second=27.8853G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/1/manual_time           135 ms          135 ms            5 bytes_per_second=5.52185G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/5/manual_time           147 ms          147 ms            5 bytes_per_second=15.2194G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/10/manual_time          153 ms          153 ms            4 bytes_per_second=26.8201G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/1/manual_time          0.157 ms        0.184 ms         4458 bytes_per_second=9.46888G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/5/manual_time          0.195 ms        0.221 ms         3587 bytes_per_second=22.8932G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/10/manual_time         0.236 ms        0.259 ms         2971 bytes_per_second=34.7536G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/1/manual_time          1.44 ms         1.47 ms          485 bytes_per_second=10.3147G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/5/manual_time          1.78 ms         1.80 ms          394 bytes_per_second=25.1771G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/10/manual_time         1.94 ms         1.96 ms          361 bytes_per_second=42.2891G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/1/manual_time         14.4 ms         14.4 ms           49 bytes_per_second=10.3804G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/5/manual_time         17.7 ms         17.7 ms           40 bytes_per_second=25.3129G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/10/manual_time        18.7 ms         18.7 ms           37 bytes_per_second=43.8417G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/1/manual_time         144 ms          144 ms            5 bytes_per_second=10.335G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/5/manual_time         179 ms          179 ms            4 bytes_per_second=24.9982G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/10/manual_time        204 ms          204 ms            3 bytes_per_second=40.1968G/s

After:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/1/manual_time               0.022 ms        0.047 ms        31868 bytes_per_second=33.6156G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/5/manual_time               0.035 ms        0.060 ms        20084 bytes_per_second=63.4536G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/10/manual_time              0.053 ms        0.072 ms        13094 bytes_per_second=76.6553G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/1/manual_time              0.065 ms        0.086 ms        10752 bytes_per_second=114.595G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/5/manual_time              0.181 ms        0.201 ms         3976 bytes_per_second=123.401G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/10/manual_time             0.339 ms        0.360 ms         2089 bytes_per_second=121.034G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/1/manual_time             0.491 ms        0.516 ms         1428 bytes_per_second=151.831G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/5/manual_time              1.73 ms         1.75 ms          419 bytes_per_second=129.351G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/10/manual_time             3.33 ms         3.35 ms          211 bytes_per_second=122.954G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/1/manual_time             4.80 ms         4.82 ms          146 bytes_per_second=155.305G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/5/manual_time             17.0 ms         17.0 ms           33 bytes_per_second=131.383G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/10/manual_time            37.5 ms         37.5 ms           17 bytes_per_second=109.232G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/1/manual_time                 0.020 ms        0.045 ms        34611 bytes_per_second=36.9208G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/5/manual_time                 0.030 ms        0.054 ms        23542 bytes_per_second=75.0012G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/10/manual_time                0.041 ms        0.066 ms        16841 bytes_per_second=98.8923G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/1/manual_time                0.053 ms        0.073 ms        13497 bytes_per_second=141.05G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/5/manual_time                0.145 ms        0.166 ms         4916 bytes_per_second=153.814G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/10/manual_time               0.264 ms        0.285 ms         2689 bytes_per_second=155.244G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/1/manual_time               0.426 ms        0.454 ms         1640 bytes_per_second=174.823G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/5/manual_time                1.40 ms         1.43 ms          516 bytes_per_second=159.786G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/10/manual_time               2.58 ms         2.61 ms          275 bytes_per_second=158.773G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/1/manual_time               4.33 ms         4.35 ms          163 bytes_per_second=172.247G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/5/manual_time               14.2 ms         14.2 ms           51 bytes_per_second=157.504G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/10/manual_time              26.2 ms         26.3 ms           27 bytes_per_second=156.204G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/1/manual_time               0.027 ms        0.050 ms        25665 bytes_per_second=55.2473G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/5/manual_time               0.044 ms        0.063 ms        15868 bytes_per_second=101.172G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/10/manual_time              0.067 ms        0.087 ms        10454 bytes_per_second=122.134G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/1/manual_time              0.110 ms        0.137 ms         6343 bytes_per_second=135.267G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/5/manual_time              0.242 ms        0.264 ms         2893 bytes_per_second=184.582G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/10/manual_time             0.420 ms        0.442 ms         1665 bytes_per_second=195.302G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/1/manual_time             0.954 ms        0.980 ms          731 bytes_per_second=156.198G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/5/manual_time              2.22 ms         2.25 ms          316 bytes_per_second=200.999G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/10/manual_time             4.07 ms         4.09 ms          173 bytes_per_second=201.413G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/1/manual_time             9.38 ms         9.41 ms           74 bytes_per_second=158.858G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/5/manual_time             23.8 ms         23.9 ms           29 bytes_per_second=187.562G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/10/manual_time            46.8 ms         46.8 ms           11 bytes_per_second=175.204G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/1/manual_time          0.124 ms        0.150 ms         5643 bytes_per_second=6.00484G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/5/manual_time          0.145 ms        0.171 ms         4830 bytes_per_second=15.439G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/10/manual_time         0.174 ms        0.199 ms         4044 bytes_per_second=23.6024G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/1/manual_time          1.12 ms         1.15 ms          623 bytes_per_second=6.62801G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/5/manual_time          1.32 ms         1.34 ms          531 bytes_per_second=16.9638G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/10/manual_time         1.44 ms         1.47 ms          485 bytes_per_second=28.395G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/1/manual_time         11.1 ms         11.2 ms           63 bytes_per_second=6.69174G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/5/manual_time         13.1 ms         13.1 ms           53 bytes_per_second=17.0651G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/10/manual_time        14.1 ms         14.1 ms           50 bytes_per_second=29.1228G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/1/manual_time         112 ms          112 ms            6 bytes_per_second=6.67213G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/5/manual_time         131 ms          131 ms            5 bytes_per_second=17.0685G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/10/manual_time        147 ms          147 ms            4 bytes_per_second=27.788G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/1/manual_time            0.123 ms        0.148 ms         5707 bytes_per_second=6.07466G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/5/manual_time            0.133 ms        0.159 ms         5280 bytes_per_second=16.7961G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/10/manual_time           0.149 ms        0.174 ms         4717 bytes_per_second=27.5107G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/1/manual_time            1.09 ms         1.12 ms          641 bytes_per_second=6.82037G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/5/manual_time            1.14 ms         1.17 ms          612 bytes_per_second=19.5383G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/10/manual_time           1.19 ms         1.22 ms          591 bytes_per_second=34.4335G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/1/manual_time           10.8 ms         10.8 ms           65 bytes_per_second=6.89978G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/5/manual_time           11.2 ms         11.3 ms           62 bytes_per_second=19.8729G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/10/manual_time          11.6 ms         11.6 ms           61 bytes_per_second=35.4668G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/1/manual_time           108 ms          108 ms            6 bytes_per_second=6.88419G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/5/manual_time           114 ms          114 ms            6 bytes_per_second=19.6802G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/10/manual_time          127 ms          127 ms            5 bytes_per_second=32.2368G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/1/manual_time          0.133 ms        0.159 ms         5281 bytes_per_second=11.2309G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/5/manual_time          0.159 ms        0.185 ms         4407 bytes_per_second=28.1315G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/10/manual_time         0.190 ms        0.215 ms         3687 bytes_per_second=43.0933G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/1/manual_time          1.20 ms         1.22 ms          585 bytes_per_second=12.4671G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/5/manual_time          1.42 ms         1.45 ms          491 bytes_per_second=31.4099G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/10/manual_time         1.59 ms         1.61 ms          440 bytes_per_second=51.5389G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/1/manual_time         11.8 ms         11.9 ms           59 bytes_per_second=12.5862G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/5/manual_time         14.1 ms         14.1 ms           50 bytes_per_second=31.7019G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/10/manual_time        15.5 ms         15.5 ms           45 bytes_per_second=53.0097G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/1/manual_time         118 ms          119 ms            6 bytes_per_second=12.5758G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/5/manual_time         141 ms          141 ms            5 bytes_per_second=31.7441G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/10/manual_time        159 ms          159 ms            4 bytes_per_second=51.5739G/s

Conditional joins on an RTX 8000:

Benchmarks

Before:

----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time                   435 ms          435 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time                  1409 ms         1409 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time                 3473 ms         3473 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/100000/manual_time                   468 ms          468 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/400000/manual_time                  1476 ms         1476 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time                 3627 ms         3626 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/100000/manual_time             743 ms          743 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/400000/manual_time            2546 ms         2546 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time           6195 ms         6195 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/100000/manual_time             778 ms          778 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/400000/manual_time            2624 ms         2624 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time           6475 ms         6475 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/100000/manual_time                    453 ms          453 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/400000/manual_time                   1467 ms         1467 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/1000000/manual_time                  3614 ms         3614 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/100000/manual_time                    483 ms          483 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/400000/manual_time                   1535 ms         1535 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/1000000/manual_time                  3766 ms         3766 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/100000/manual_time              774 ms          774 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/400000/manual_time             2613 ms         2613 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/1000000/manual_time            6387 ms         6386 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/100000/manual_time              785 ms          785 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/400000/manual_time             2688 ms         2688 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/1000000/manual_time            6625 ms         6625 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/100000/manual_time                    462 ms          462 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/400000/manual_time                   1497 ms         1497 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/1000000/manual_time                  3689 ms         3689 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/100000/manual_time                    491 ms          491 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/400000/manual_time                   1555 ms         1555 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/1000000/manual_time                  3813 ms         3813 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/100000/manual_time              779 ms          779 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/400000/manual_time             2653 ms         2653 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/1000000/manual_time            6462 ms         6462 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/100000/manual_time              783 ms          783 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/400000/manual_time             2717 ms         2715 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/1000000/manual_time            6671 ms         6665 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time               464 ms          464 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time              1503 ms         1503 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time             3702 ms         3702 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time               493 ms          493 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time              1560 ms         1560 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time             3826 ms         3826 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time         781 ms          781 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time        2650 ms         2649 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time       6483 ms         6483 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time         803 ms          803 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time        2723 ms         2723 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time       6696 ms         6696 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time               466 ms          466 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time              1506 ms         1506 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time             3710 ms         3710 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time               494 ms          494 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time              1563 ms         1563 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time             3834 ms         3834 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time         781 ms          781 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time        2656 ms         2656 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time       6515 ms         6515 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time         800 ms          800 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time        2729 ms         2729 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time       6694 ms         6694 ms            1

After:

----------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                    Time             CPU   Iterations
----------------------------------------------------------------------------------------------------------------------------------------------
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/100000/manual_time                   317 ms          317 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/400000/manual_time                  1053 ms         1053 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit/100000/1000000/manual_time                 2561 ms         2561 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/100000/manual_time                   320 ms          320 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/400000/manual_time                  1073 ms         1073 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit/100000/1000000/manual_time                 2610 ms         2610 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/100000/manual_time             741 ms          741 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/400000/manual_time            2487 ms         2487 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_inner_join_32bit_nulls/100000/1000000/manual_time           6100 ms         6100 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/100000/manual_time             780 ms          780 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/400000/manual_time            2662 ms         2662 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_inner_join_64bit_nulls/100000/1000000/manual_time           6548 ms         6548 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/100000/manual_time                    325 ms          325 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/400000/manual_time                   1090 ms         1090 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit/100000/1000000/manual_time                  2648 ms         2648 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/100000/manual_time                    329 ms          329 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/400000/manual_time                   1112 ms         1112 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit/100000/1000000/manual_time                  2703 ms         2703 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/100000/manual_time              760 ms          760 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/400000/manual_time             2561 ms         2561 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_join_32bit_nulls/100000/1000000/manual_time            6281 ms         6281 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/100000/manual_time              817 ms          817 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/400000/manual_time             2743 ms         2743 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_join_64bit_nulls/100000/1000000/manual_time            6729 ms         6729 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/100000/manual_time                    333 ms          333 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/400000/manual_time                   1116 ms         1116 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit/100000/1000000/manual_time                  2717 ms         2717 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/100000/manual_time                    337 ms          337 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/400000/manual_time                   1131 ms         1131 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit/100000/1000000/manual_time                  2761 ms         2761 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/100000/manual_time              763 ms          763 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/400000/manual_time             2604 ms         2604 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_full_join_32bit_nulls/100000/1000000/manual_time            6368 ms         6368 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/100000/manual_time              829 ms          829 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/400000/manual_time             2779 ms         2779 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_full_join_64bit_nulls/100000/1000000/manual_time            6802 ms         6802 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/100000/manual_time               335 ms          335 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/400000/manual_time              1127 ms         1128 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit/100000/1000000/manual_time             2737 ms         2737 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/100000/manual_time               339 ms          339 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/400000/manual_time              1141 ms         1141 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit/100000/1000000/manual_time             2775 ms         2775 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/100000/manual_time         776 ms          776 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/400000/manual_time        2616 ms         2616 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_anti_join_32bit_nulls/100000/1000000/manual_time       6406 ms         6406 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/100000/manual_time         820 ms          820 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/400000/manual_time        2796 ms         2796 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_anti_join_64bit_nulls/100000/1000000/manual_time       6836 ms         6836 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/100000/manual_time               337 ms          337 ms            2
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/400000/manual_time              1131 ms         1131 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit/100000/1000000/manual_time             2747 ms         2747 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/100000/manual_time               340 ms          340 ms            2
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/400000/manual_time              1144 ms         1144 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit/100000/1000000/manual_time             2783 ms         2783 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/100000/manual_time         784 ms          784 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/400000/manual_time        2634 ms         2634 ms            1
ConditionalJoin<int32_t, int32_t>/conditional_left_semi_join_32bit_nulls/100000/1000000/manual_time       6423 ms         6423 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/100000/manual_time         837 ms          837 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/400000/manual_time        2799 ms         2799 ms            1
ConditionalJoin<int64_t, int64_t>/conditional_left_semi_join_64bit_nulls/100000/1000000/manual_time       6837 ms         6837 ms            1

compute_column on an RTX 8000:

Benchmarks

Before:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/1/manual_time               0.017 ms        0.035 ms        40355 bytes_per_second=42.7788G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/5/manual_time               0.025 ms        0.042 ms        28088 bytes_per_second=89.7042G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/10/manual_time              0.035 ms        0.051 ms        20029 bytes_per_second=115.79G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/1/manual_time              0.042 ms        0.056 ms        16584 bytes_per_second=176.803G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/5/manual_time              0.104 ms        0.116 ms         6754 bytes_per_second=215.771G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/10/manual_time             0.180 ms        0.194 ms         3887 bytes_per_second=227.226G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/1/manual_time             0.289 ms        0.306 ms         2432 bytes_per_second=257.975G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/5/manual_time             0.890 ms        0.905 ms          800 bytes_per_second=251.245G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/10/manual_time             1.63 ms         1.65 ms          432 bytes_per_second=250.746G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/1/manual_time             2.79 ms         2.81 ms          251 bytes_per_second=266.873G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/5/manual_time             8.77 ms         8.79 ms           81 bytes_per_second=254.766G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/10/manual_time            16.9 ms         17.0 ms           39 bytes_per_second=241.8G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/1/manual_time                 0.016 ms        0.033 ms        42905 bytes_per_second=46.0117G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/5/manual_time                 0.022 ms        0.039 ms        32365 bytes_per_second=103.672G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/10/manual_time                0.028 ms        0.045 ms        24811 bytes_per_second=146.19G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/1/manual_time                0.037 ms        0.051 ms        18957 bytes_per_second=201.642G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/5/manual_time                0.084 ms        0.098 ms         8408 bytes_per_second=266.792G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/10/manual_time               0.144 ms        0.159 ms         4865 bytes_per_second=284.223G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/1/manual_time               0.246 ms        0.265 ms         2851 bytes_per_second=302.273G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/5/manual_time               0.706 ms        0.725 ms         1020 bytes_per_second=316.478G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/10/manual_time               1.29 ms         1.31 ms          548 bytes_per_second=317.688G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/1/manual_time               2.37 ms         2.39 ms          297 bytes_per_second=314.026G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/5/manual_time               6.96 ms         6.98 ms          105 bytes_per_second=321.037G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/10/manual_time              12.8 ms         12.8 ms           55 bytes_per_second=321.292G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/1/manual_time               0.019 ms        0.035 ms        36229 bytes_per_second=77.7191G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/5/manual_time               0.028 ms        0.043 ms        25348 bytes_per_second=161.856G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/10/manual_time              0.039 ms        0.052 ms        17834 bytes_per_second=211.466G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/1/manual_time              0.056 ms        0.074 ms        12416 bytes_per_second=264.937G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/5/manual_time              0.124 ms        0.139 ms         5646 bytes_per_second=361.033G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/10/manual_time             0.210 ms        0.224 ms         3343 bytes_per_second=390.933G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/1/manual_time             0.435 ms        0.453 ms         1611 bytes_per_second=342.872G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/5/manual_time              1.06 ms         1.08 ms          661 bytes_per_second=419.994G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/10/manual_time             1.91 ms         1.92 ms          368 bytes_per_second=429.351G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/1/manual_time             4.22 ms         4.24 ms          166 bytes_per_second=353.26G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/5/manual_time             10.6 ms         10.6 ms           59 bytes_per_second=422.222G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/10/manual_time            19.0 ms         19.0 ms           33 bytes_per_second=430.92G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/1/manual_time          0.071 ms        0.088 ms         9862 bytes_per_second=10.5656G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/5/manual_time          0.088 ms        0.105 ms         7968 bytes_per_second=25.5201G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/10/manual_time         0.107 ms        0.123 ms         6537 bytes_per_second=38.3573G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/1/manual_time         0.609 ms        0.627 ms         1148 bytes_per_second=12.2335G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/5/manual_time         0.740 ms        0.757 ms          945 bytes_per_second=30.1973G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/10/manual_time        0.816 ms        0.831 ms          858 bytes_per_second=50.2436G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/1/manual_time         6.01 ms         6.03 ms          117 bytes_per_second=12.392G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/5/manual_time         7.32 ms         7.34 ms           96 bytes_per_second=30.5408G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/10/manual_time        7.78 ms         7.80 ms           90 bytes_per_second=52.6683G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/1/manual_time        60.4 ms         60.4 ms           11 bytes_per_second=12.3333G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/5/manual_time        73.2 ms         73.2 ms            9 bytes_per_second=30.5271G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/10/manual_time       77.7 ms         77.8 ms            9 bytes_per_second=52.7103G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/1/manual_time            0.069 ms        0.086 ms        10084 bytes_per_second=10.8022G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/5/manual_time            0.082 ms        0.099 ms         8540 bytes_per_second=27.3647G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/10/manual_time           0.094 ms        0.112 ms         7399 bytes_per_second=43.4133G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/1/manual_time           0.596 ms        0.614 ms         1175 bytes_per_second=12.5019G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/5/manual_time           0.685 ms        0.703 ms         1022 bytes_per_second=32.6204G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/10/manual_time          0.709 ms        0.727 ms          987 bytes_per_second=57.7866G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/1/manual_time           5.87 ms         5.89 ms          119 bytes_per_second=12.7008G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/5/manual_time           6.69 ms         6.71 ms          105 bytes_per_second=33.4301G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/10/manual_time          6.76 ms         6.78 ms          103 bytes_per_second=60.5839G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/1/manual_time          58.8 ms         58.8 ms           12 bytes_per_second=12.6722G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/5/manual_time          67.0 ms         67.0 ms           10 bytes_per_second=33.3588G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/10/manual_time         67.7 ms         67.7 ms           10 bytes_per_second=60.5043G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/1/manual_time          0.073 ms        0.090 ms         9575 bytes_per_second=20.4972G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/5/manual_time          0.092 ms        0.109 ms         7552 bytes_per_second=48.3304G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/10/manual_time         0.115 ms        0.130 ms         6092 bytes_per_second=71.5001G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/1/manual_time         0.631 ms        0.650 ms         1109 bytes_per_second=23.602G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/5/manual_time         0.785 ms        0.800 ms          891 bytes_per_second=56.9726G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/10/manual_time        0.879 ms        0.895 ms          797 bytes_per_second=93.2723G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/1/manual_time         6.25 ms         6.26 ms          112 bytes_per_second=23.8598G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/5/manual_time         7.75 ms         7.77 ms           90 bytes_per_second=57.6699G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/10/manual_time        8.34 ms         8.35 ms           84 bytes_per_second=98.3146G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/1/manual_time        62.5 ms         62.5 ms           11 bytes_per_second=23.8524G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/5/manual_time        77.3 ms         77.3 ms            9 bytes_per_second=57.835G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/10/manual_time       83.2 ms         83.3 ms            8 bytes_per_second=98.4651G/s

After:

--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Benchmark                                                                                                                Time             CPU   Iterations UserCounters...
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/1/manual_time               0.017 ms        0.034 ms        41233 bytes_per_second=44.3096G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/5/manual_time               0.024 ms        0.041 ms        28830 bytes_per_second=92.1236G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000/10/manual_time              0.034 ms        0.050 ms        20594 bytes_per_second=120.613G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/1/manual_time              0.036 ms        0.050 ms        19608 bytes_per_second=208.437G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/5/manual_time              0.083 ms        0.096 ms         8400 bytes_per_second=268.161G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/1000000/10/manual_time             0.145 ms        0.158 ms         4857 bytes_per_second=283.234G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/1/manual_time             0.225 ms        0.242 ms         3111 bytes_per_second=331.708G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/5/manual_time             0.692 ms        0.707 ms         1031 bytes_per_second=323.039G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/10000000/10/manual_time             1.30 ms         1.32 ms          543 bytes_per_second=314.882G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/1/manual_time             2.11 ms         2.12 ms          332 bytes_per_second=353.792G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/5/manual_time             6.81 ms         6.83 ms          105 bytes_per_second=328.202G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, false>/ast_int32_imbalanced_unique/100000000/10/manual_time            13.2 ms         13.2 ms           47 bytes_per_second=310.972G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/1/manual_time                 0.016 ms        0.033 ms        43389 bytes_per_second=46.5727G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/5/manual_time                 0.022 ms        0.039 ms        32431 bytes_per_second=103.489G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000/10/manual_time                0.028 ms        0.045 ms        24640 bytes_per_second=147.201G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/1/manual_time                0.030 ms        0.044 ms        23022 bytes_per_second=245.339G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/5/manual_time                0.068 ms        0.083 ms         9338 bytes_per_second=327.076G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/1000000/10/manual_time               0.118 ms        0.133 ms         5951 bytes_per_second=346.315G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/1/manual_time               0.183 ms        0.201 ms         3838 bytes_per_second=406.631G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/5/manual_time               0.572 ms        0.591 ms         1284 bytes_per_second=390.62G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/10000000/10/manual_time               1.07 ms         1.09 ms          667 bytes_per_second=382.649G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/1/manual_time               1.75 ms         1.77 ms          405 bytes_per_second=425.752G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/5/manual_time               5.68 ms         5.70 ms          132 bytes_per_second=393.412G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, false>/ast_int32_imbalanced_reuse/100000000/10/manual_time              10.7 ms         10.7 ms           67 bytes_per_second=382.674G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/1/manual_time               0.020 ms        0.036 ms        35083 bytes_per_second=76.3483G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/5/manual_time               0.029 ms        0.044 ms        24377 bytes_per_second=156.563G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000/10/manual_time              0.041 ms        0.054 ms        17236 bytes_per_second=201.716G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/1/manual_time              0.057 ms        0.075 ms        12167 bytes_per_second=259.757G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/5/manual_time              0.118 ms        0.134 ms         5907 bytes_per_second=377.469G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/1000000/10/manual_time             0.192 ms        0.207 ms         3648 bytes_per_second=426.508G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/1/manual_time             0.437 ms        0.456 ms         1602 bytes_per_second=341.008G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/5/manual_time              1.01 ms         1.02 ms          694 bytes_per_second=443.266G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/10000000/10/manual_time             1.75 ms         1.76 ms          401 bytes_per_second=469.641G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/1/manual_time             4.22 ms         4.24 ms          166 bytes_per_second=352.73G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/5/manual_time             9.95 ms         9.97 ms           69 bytes_per_second=449.274G/s
AST<double, TreeType::IMBALANCED_LEFT, false, false>/ast_double_imbalanced_unique/100000000/10/manual_time            17.4 ms         17.4 ms           40 bytes_per_second=470.516G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/1/manual_time          0.061 ms        0.078 ms        11430 bytes_per_second=12.2366G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/5/manual_time          0.073 ms        0.091 ms         9590 bytes_per_second=30.6413G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000/10/manual_time         0.088 ms        0.106 ms         7928 bytes_per_second=46.494G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/1/manual_time         0.506 ms        0.524 ms         1380 bytes_per_second=14.7183G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/5/manual_time         0.589 ms        0.607 ms         1189 bytes_per_second=37.9447G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/1000000/10/manual_time        0.640 ms        0.657 ms         1093 bytes_per_second=63.994G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/1/manual_time         4.98 ms         5.00 ms          140 bytes_per_second=14.9566G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/5/manual_time         5.82 ms         5.84 ms          120 bytes_per_second=38.4111G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/10000000/10/manual_time        6.18 ms         6.20 ms          113 bytes_per_second=66.2789G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/1/manual_time        50.3 ms         50.3 ms           14 bytes_per_second=14.8223G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/5/manual_time        58.4 ms         58.4 ms           12 bytes_per_second=38.2726G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, false, true>/ast_int32_imbalanced_unique_nulls/100000000/10/manual_time       61.7 ms         61.7 ms           11 bytes_per_second=66.395G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/1/manual_time            0.060 ms        0.078 ms        11549 bytes_per_second=12.3777G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/5/manual_time            0.066 ms        0.083 ms        10578 bytes_per_second=34.0038G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000/10/manual_time           0.076 ms        0.093 ms         9167 bytes_per_second=53.9394G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/1/manual_time           0.496 ms        0.514 ms         1412 bytes_per_second=15.0317G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/5/manual_time           0.520 ms        0.538 ms         1346 bytes_per_second=43.0231G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/1000000/10/manual_time          0.535 ms        0.553 ms         1300 bytes_per_second=76.6226G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/1/manual_time           4.82 ms         4.84 ms          145 bytes_per_second=15.453G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/5/manual_time           5.05 ms         5.07 ms          139 bytes_per_second=44.2888G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/10000000/10/manual_time          5.12 ms         5.14 ms          137 bytes_per_second=80.0502G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/1/manual_time          48.5 ms         48.5 ms           14 bytes_per_second=15.3753G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/5/manual_time          50.7 ms         50.7 ms           14 bytes_per_second=44.1212G/s
AST<int32_t, TreeType::IMBALANCED_LEFT, true, true>/ast_int32_imbalanced_reuse_nulls/100000000/10/manual_time         51.3 ms         51.3 ms           13 bytes_per_second=79.884G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/1/manual_time          0.063 ms        0.080 ms        11031 bytes_per_second=23.6327G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/5/manual_time          0.078 ms        0.095 ms         8987 bytes_per_second=57.6675G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000/10/manual_time         0.095 ms        0.111 ms         7374 bytes_per_second=86.5906G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/1/manual_time         0.535 ms        0.556 ms         1309 bytes_per_second=27.8354G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/5/manual_time         0.627 ms        0.645 ms         1116 bytes_per_second=71.2564G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/1000000/10/manual_time        0.691 ms        0.708 ms         1013 bytes_per_second=118.617G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/1/manual_time         5.25 ms         5.27 ms          133 bytes_per_second=28.3849G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/5/manual_time         6.18 ms         6.20 ms          113 bytes_per_second=72.3473G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/10000000/10/manual_time        6.63 ms         6.65 ms          105 bytes_per_second=123.568G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/1/manual_time        52.7 ms         52.7 ms           13 bytes_per_second=28.2584G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/5/manual_time        61.7 ms         61.8 ms           11 bytes_per_second=72.4033G/s
AST<double, TreeType::IMBALANCED_LEFT, false, true>/ast_double_imbalanced_unique_nulls/100000000/10/manual_time       66.2 ms         66.2 ms           11 bytes_per_second=123.87G/s

__device__ possibly_null_value_t<Element, has_nulls> resolve_input(
detail::device_data_reference device_data_reference,
CUDA_DEVICE_CALLABLE possibly_null_value_t<Element, has_nulls> resolve_input(
detail::device_data_reference const& input_reference,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Out of curiosity, did you try seeing if passing by const value had the same impact?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did test that, and it seemed to work in some cases but not others. In particular the nullable template had different performance characteristics from the non-nullable one, w.r.t to both this change and a number of others. The obvious culprit is the fact that the nullable template simply has more local state, so I need to reduce it even further to see measurable improvements in occupancy, but I also observed more subtle differences that I'd probably attribute to the compiler just not doing as good a job at some point. I have some ideas for redesigning the data references that might hopefully make these issues moot, but I'll need to play with those a bit and I figured that this perf boost was a nice short-term win while I do that.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work, this is much cleaner and the performance boost is nice. I have a few minor suggestions.

@vyasr vyasr requested a review from bdice September 13, 2021 18:28
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving once comments are fixed.

cpp/include/cudf/ast/detail/expression_evaluator.cuh Outdated Show resolved Hide resolved
cpp/include/cudf/ast/detail/expression_evaluator.cuh Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Sep 13, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@8a78196). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 7f06065 differs from pull request most recent head 7e7012a. Consider uploading reports for the commit 7e7012a to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.10    #9210   +/-   ##
===============================================
  Coverage                ?   10.84%           
===============================================
  Files                   ?      115           
  Lines                   ?    19173           
  Branches                ?        0           
===============================================
  Hits                    ?     2080           
  Misses                  ?    17093           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a78196...7e7012a. Read the comment docs.

@vyasr
Copy link
Contributor Author

vyasr commented Sep 13, 2021

rerun tests

@vyasr vyasr requested a review from davidwendt September 14, 2021 17:25
@vyasr vyasr requested a review from jrhemstad September 15, 2021 16:23
@vyasr
Copy link
Contributor Author

vyasr commented Sep 15, 2021

rerun tests

1 similar comment
@vyasr
Copy link
Contributor Author

vyasr commented Sep 15, 2021

rerun tests

@vyasr
Copy link
Contributor Author

vyasr commented Sep 16, 2021

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants