Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add row bitmask as a detail::hash_join member #10248

Merged
merged 18 commits into from
May 2, 2022

Conversation

PointKernel
Copy link
Member

@PointKernel PointKernel commented Feb 8, 2022

When working on #8934, we observed a performance regression when nulls are unequal. One major reason is that the new hash map uses a CG-based double hashing algorithm. This algorithm is dedicated to improving hash collision handling. The existing implementation determines hash map size by the number of rows in the build table regardless of how many rows are valid. In the case of nulls being unequal, the actual map occupancy is, therefore, lower than the default 50% thus resulting in fewer hash collisions. The old scalar linear probing is more efficient in this case due to less CG-related overhead and the probe will mostly end at the first probe slot.

To improve this situation, the original idea of this PR was to construct the hash map based on the number of valid rows. There are supposed to be two benefits:

  1. Increases map occupancy to benefit more from CG-based double hashing thus improving runtime efficiency
  2. Reduces peak memory usage: for 1'000 elements with 75% nulls, the new capacity would be 500 (1000 * 0.25 * 2) as opposed to 2000 (1000 * 2)

During this work, however, we noticed the first assumption is improper since it didn't consider the performance degradation along with reduced capacity (see #10248 (comment)). Though this effort will reduce peak memory usage, it seems Python/Spark workflows would never benefit from it since they tend to drop nulls before any join operations.

Finally, all changes related to map size reduction are discarded. This PR only adds _composite_bitmask as a detail::hash_join member which is a preparation step for #9151

@PointKernel PointKernel added libcudf Affects libcudf (C++/CUDA) code. 0 - Blocked Cannot progress due to external reasons Performance Performance related issue 5 - Merge After Dependencies improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 8, 2022
@PointKernel PointKernel self-assigned this Feb 8, 2022
cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented Feb 12, 2022

Codecov Report

Merging #10248 (7b2f5f6) into branch-22.06 (84f88ce) will increase coverage by 0.02%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.06   #10248      +/-   ##
================================================
+ Coverage         86.40%   86.43%   +0.02%     
================================================
  Files               143      143              
  Lines             22444    22444              
================================================
+ Hits              19393    19399       +6     
+ Misses             3051     3045       -6     
Impacted Files Coverage Δ
python/cudf/cudf/comm/gpuarrow.py 79.76% <ø> (ø)
python/cudf/cudf/core/column/string.py 89.21% <ø> (+0.12%) ⬆️
python/cudf/cudf/core/frame.py 93.41% <ø> (ø)
python/cudf/cudf/core/series.py 95.16% <ø> (ø)
python/cudf/cudf/core/dataframe.py 93.74% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
python/cudf/cudf/core/tools/datetimes.py 84.49% <0.00%> (+0.30%) ⬆️
python/cudf/cudf/core/column/lists.py 92.91% <0.00%> (+0.83%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3c208a6...7b2f5f6. Read the comment docs.

@PointKernel
Copy link
Member Author

PointKernel commented Feb 18, 2022

@jrhemstad Here are benchmark results of cuco::static_multimap::pair_retrieve:

Key Value Multiplicity Scheme NumInputs Occupancy Capacity Memory (GB) GPU Time (ms)
I32 I32 2 Linear Probing 100 M 0.5 200 M 1.6 53.88
I32 I32 2 Double Hashing 100 M 0.5 200 M 1.6 40.83
I32 I32 2 Linear Probing 100 M 0.25 400 M 3.2 38.40

If occupancy is fixed at 50%, double hashing outperforms linear probing by ~27%. When we increase the hash map capacity to have an actual occupancy of 25%, linear probing is 5% more efficient. This is consistent with the result we get with this PR.

In hash join benchmarks where nulls are unequal and 75% are nulls, the actual occupancy is 25%. With this PR, using bitmask_and to determine the hash table capacity can ignore null elements thus improve the occupancy to 50%. However, this will not make the execution faster since double hashing with 50% occupancy cannot beat linear probing with 25% occupancy.

@PointKernel PointKernel removed 0 - Blocked Cannot progress due to external reasons 5 - Merge After Dependencies labels Feb 18, 2022
@PointKernel PointKernel added the 3 - Ready for Review Ready for review by team label Feb 22, 2022
@PointKernel PointKernel marked this pull request as ready for review February 22, 2022 21:55
@PointKernel PointKernel requested a review from a team as a code owner February 22, 2022 21:55
@PointKernel PointKernel changed the title Improve hash join Improve hash join when nulls are unequal Feb 22, 2022
@PointKernel
Copy link
Member Author

PointKernel commented Mar 29, 2022

benchmark shows that this effort makes the execution 5~10% slower by requiring 25% of the memory

Can you elaborate a bit on the memory reduction? This just means the hash table is 25% of the size it would otherwise need to be, right? So in terms of total memory needed for the join operation, the reduction is less, right?

inner_join_32bit_nulls benchmark results by adding the peak memory usage measurement:

  • Baseline
Key Type Payload Type Nullable Build Table Size Probe Table Size Samples CPU Time GPU Time Peak Memory
I32 I32 1 100000000 100000000 12x 45.032 ms 45.027 ms 1617404728
I32 I32 1 10000000 240000000 11x 58.151 ms 58.146 ms 199587032
I32 I32 1 80000000 240000000 11x 73.723 ms 73.718 ms 1319627976
I32 I32 1 100000000 240000000 11x 78.108 ms 78.103 ms 1638403320
  • Optimization:
Key Type Payload Type Nullable Build Table Size Probe Table Size Samples CPU Time GPU Time Peak Memory
I32 I32 1 100000000 100000000 11x 49.535 ms 49.529 ms 430194360
I32 I32 1 10000000 240000000 11x 70.833 ms 70.828 ms 79186840
I32 I32 1 80000000 240000000 11x 87.539 ms 87.534 ms 369024712
I32 I32 1 100000000 240000000 11x 91.814 ms 91.809 ms 451192952

The peak bytes used is about 2.5x to 3.7x less. Unlike inner_join_64bit_nulls, the optimized implementation can be 20% slower when dealing with large probe tables.

@jrhemstad
Copy link
Contributor

@PointKernel what is the percentage of nulls in that test?

@PointKernel
Copy link
Member Author

@PointKernel what is the percentage of nulls in that test?

@jrhemstad It's 75% nulls for both probe and build tables.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that I'm missing some context. What motivated this change? Is there a particular reason that we're willing to increase the runtime in order to decrease the memory footprint? The code looks good and the behavior aligns with expectations, but I don't know why we want this.

cpp/src/join/hash_join.cu Outdated Show resolved Hide resolved
@jrhemstad
Copy link
Contributor

Is there a particular reason that we're willing to increase the runtime in order to decrease the memory footprint?

See #9151

We'd actually hoped this would improve runtime. The fact that it reduces runtime was an unfortunate surprise, so now we have to decide what we want to do with it.

@jrhemstad
Copy link
Contributor

@PointKernel what (if any) is the impact on runtime when very few nulls are present? What about if no nulls are present?

Is the overhead from constructing the composite row bitmask? Or is it from increasing the relative load factor on the hash map by using a smaller capacity? (or both?)

@PointKernel
Copy link
Member Author

PointKernel commented Apr 13, 2022

what (if any) is the impact on runtime when very few nulls are present? What about if no nulls are present?

The fewer nulls are present, the less performance loss we get. i.e. there is no performance loss if no nulls are present.

Is the overhead from constructing the composite row bitmask?

No, it's not. We are constructing the row bitmask anyway when nulls are present since it's being used by insert_if.

Or is it from increasing the relative load factor on the hash map by using a smaller capacity?

Yes, that's the reason.

If we want to keep the current performance, I will suggest setting the hashmap size based on the number of rows in the input table regardless of whether nulls are present. That is, the only change in this PR is to add a new row bitmask member in hash join so it can be used for insert_if and eventually count_if and retrieve_if in a PR resolving #9151.

@PointKernel PointKernel changed the title Improve hash join when nulls are unequal Add row bitmask as a detail::hash_join member Apr 29, 2022
@PointKernel PointKernel requested a review from vyasr April 29, 2022 19:55
@PointKernel PointKernel added tech debt and removed Performance Performance related issue labels Apr 29, 2022
@PointKernel
Copy link
Member Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 0ddb3d9 into rapidsai:branch-22.06 May 2, 2022
@PointKernel PointKernel deleted the improve-hash-join branch May 2, 2022 21:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants