Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Update bitmask_and to also compute count of valid bits #9176

Closed
jrhemstad opened this issue Sep 3, 2021 · 0 comments · Fixed by #9616
Closed

[FEA] Update bitmask_and to also compute count of valid bits #9176

jrhemstad opened this issue Sep 3, 2021 · 0 comments · Fixed by #9616
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue

Comments

@jrhemstad
Copy link
Contributor

Is your feature request related to a problem? Please describe.

bitmask_and is used inside of the hash_join implementation to build a bitmask indicating the presence of one or more null values in a row.

This bitmask is used to filter out inserting rows that contain a null element. If we knew the count of rows without null elements, we could use that to more accurately set the capacity of the hash map we build. This could improve performance as well as reduce memory overhead.

Describe the solution you'd like

Update bitmask_and to return both the resulting device_buffer and the count of valid bits. This is similar to how valid_if works.

It should be as simple as adding a __popc to the computed bitmask word and doing a block reduction w/ global atomic for the final result.

@jrhemstad jrhemstad added feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue labels Sep 3, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 2, 2021
This PR refactors the existing hash join implementation by using cuCollections `static_multimap`. Related functions now invoke `cuco::static_multimap` host-bulk APIs thus join-specific kernels are no longer needed. Compared to the current cudf multimap, `cuco::static_multimap` applied a list of optimizations to improve overall performance:
- Using CUDA Cooperative Groups & parallel retrieval to leverage multiple threads for a single hash map operation
- Double hashing probe method instead of linear probing to reduce the chance of collisions
- Relaxed atomic operations instead of expensive sequential memory order atomic
- Using vector loads for "packable" pairs (no larger than 4B/4B) to coarsen the workload granularity for each thread

`inner_join` benchmarks are used to evaluate new hash join performance: 
- multiplicity = 1
- matching rate = 0.3
- occupancy = 0.5
- double hashing
- CG size = 2
- 75% nulls if nulls are present:
<details><summary>inner_join_32bit</summary>
<p>

## [0] Tesla V100-SXM2-32GB
|  Key Type  |  Payload Type  |  Nullable  |  Build Table Size  |  Probe Table Size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|------------|----------------|------------|--------------------|--------------------|------------|-------------|------------|-------------|------------|---------|----------|
|    I32     |      I32       |     0      |       100000       |       100000       | 135.391 us |      15.35% | 268.328 us |      18.26% | 132.937 us |  98.19% |   PASS   |
|    I32     |      I32       |     0      |       100000       |       400000       | 248.937 us |       5.55% | 364.678 us |      10.10% | 115.740 us |  46.49% |   PASS   |
|    I32     |      I32       |     0      |      10000000      |      10000000      |  12.635 ms |       0.10% |  13.580 ms |       0.13% | 945.092 us |   7.48% |   PASS   |
|    I32     |      I32       |     0      |      10000000      |      40000000      |  37.934 ms |       0.05% |  43.083 ms |       0.08% |   5.148 ms |  13.57% |   PASS   |
|    I32     |      I32       |     0      |      10000000      |     100000000      |  88.486 ms |       0.20% | 108.750 ms |       0.05% |  20.264 ms |  22.90% |   PASS   |
|    I32     |      I32       |     0      |      80000000      |     100000000      | 124.207 ms |       0.04% | 130.181 ms |       0.04% |   5.974 ms |   4.81% |   PASS   |
|    I32     |      I32       |     0      |     100000000      |     100000000      | 134.720 ms |       0.12% | 139.181 ms |       0.03% |   4.462 ms |   3.31% |   PASS   |
|    I32     |      I32       |     0      |      10000000      |     240000000      | 218.672 ms |       0.04% | 295.274 ms |       0.03% |  76.602 ms |  35.03% |   PASS   |
|    I32     |      I32       |     0      |      80000000      |     240000000      | 242.677 ms |       0.03% | 266.421 ms |       0.04% |  23.744 ms |   9.78% |   PASS   |
|    I32     |      I32       |     0      |     100000000      |     240000000      | 253.245 ms |       0.04% | 273.783 ms |       0.04% |  20.538 ms |   8.11% |   PASS   |
</p>
</details>

<details><summary>inner_join_64bit</summary>
<p>

## [0] Tesla V100-SXM2-32GB

|  Key Type  |  Payload Type  |  Nullable  |  Build Table Size  |  Probe Table Size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|------------|----------------|------------|--------------------|--------------------|------------|-------------|------------|-------------|-----------|---------|----------|
|    I64     |      I64       |     0      |      40000000      |      50000000      |  62.728 ms |       0.04% |  65.638 ms |       0.07% |  2.910 ms |   4.64% |   PASS   |
|    I64     |      I64       |     0      |      50000000      |      50000000      |  68.142 ms |       0.05% |  70.164 ms |       0.05% |  2.022 ms |   2.97% |   PASS   |
|    I64     |      I64       |     0      |      40000000      |     120000000      | 122.424 ms |       0.04% | 134.553 ms |       0.05% | 12.129 ms |   9.91% |   PASS   |
|    I64     |      I64       |     0      |      50000000      |     120000000      | 127.979 ms |       0.04% | 138.379 ms |       0.04% | 10.400 ms |   8.13% |   PASS   |
</p>
</details>

<details><summary>inner_join_32bit_nulls</summary>
<p>

## [0] Tesla V100-SXM2-32GB

|  Key Type  |  Payload Type  |  Nullable  |  Build Table Size  |  Probe Table Size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |          Diff |   %Diff |  Status  |
|------------|----------------|------------|--------------------|--------------------|------------|-------------|------------|-------------|---------------|---------|----------|
|    I32     |      I32       |     1      |       100000       |       100000       | 118.400 us |      10.77% | 237.164 us |      10.76% |    118.764 us | 100.31% |   PASS   |
|    I32     |      I32       |     1      |       100000       |       400000       | 167.385 us |       7.46% | 275.823 us |       6.00% |    108.438 us |  64.78% |   PASS   |
|    I32     |      I32       |     1      |      10000000      |      10000000      |   3.699 ms |       0.28% |   3.053 ms |       0.24% |   -646.276 us | -17.47% |   PASS   |
|    I32     |      I32       |     1      |      10000000      |      40000000      |  10.574 ms |       0.15% |   8.634 ms |       0.08% |  -1939.859 us | -18.35% |   PASS   |
|    I32     |      I32       |     1      |      10000000      |     100000000      |  24.461 ms |       0.11% |  19.744 ms |       0.05% |  -4716.875 us | -19.28% |   PASS   |
|    I32     |      I32       |     1      |      80000000      |     100000000      |  35.013 ms |       0.11% |  27.576 ms |       0.05% |  -7437.653 us | -21.24% |   PASS   |
|    I32     |      I32       |     1      |     100000000      |     100000000      |  37.870 ms |       0.42% |  29.800 ms |       0.06% |  -8069.778 us | -21.31% |   PASS   |
|    I32     |      I32       |     1      |      10000000      |     240000000      |  58.529 ms |       0.19% |  46.316 ms |       0.02% | -12213.277 us | -20.87% |   PASS   |
|    I32     |      I32       |     1      |      80000000      |     240000000      |  67.443 ms |       0.12% |  53.721 ms |       0.49% | -13722.214 us | -20.35% |   PASS   |
|    I32     |      I32       |     1      |     100000000      |     240000000      |  70.809 ms |       0.50% |  55.938 ms |       0.03% | -14870.980 us | -21.00% |   PASS   |
</p>
</details>

<details><summary>inner_join_64bit_nulls</summary>
<p>

## [0] Tesla V100-SXM2-32GB

|  Key Type  |  Payload Type  |  Nullable  |  Build Table Size  |  Probe Table Size  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |         Diff |   %Diff |  Status  |
|------------|----------------|------------|--------------------|--------------------|------------|-------------|------------|-------------|--------------|---------|----------|
|    I64     |      I64       |     1      |      40000000      |      50000000      |  17.560 ms |       0.21% |  13.926 ms |       0.09% | -3634.422 us | -20.70% |   PASS   |
|    I64     |      I64       |     1      |      50000000      |      50000000      |  19.070 ms |       0.14% |  15.082 ms |       0.07% | -3987.592 us | -20.91% |   PASS   |
|    I64     |      I64       |     1      |      40000000      |     120000000      |  34.253 ms |       0.13% |  27.170 ms |       0.05% | -7082.957 us | -20.68% |   PASS   |
|    I64     |      I64       |     1      |      50000000      |     120000000      |  35.807 ms |       0.19% |  28.340 ms |       0.03% | -7467.177 us | -20.85% |   PASS   |

</p>
</details>

When nulls are not present, the map has an actual occupancy of 0.5. The new hash join implementation (`Cmp`) outperforms the existing one (`Ref`) by 20% (can be achieved by tuning CG size) to 100%. When nulls are present and treated as unequal, however, the actual occupancy is 0.125 and `Cmp` is always about 20% slower than `Ref`. Using the number of valid rows (#9176) to build hash map can solve this performance issue. Note that the above results show the minimum speedups brought by the new hash join since multiplicity and load factor are relatively low in this case. 

Also, this PR simplifies and improves the current join benchmark by adding multiplicity control. It fixs bugs in join cpp tests and pytests where outputs were not sorted before compare.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Ashwin Srinath (https://github.com/shwina)
  - Robert Maynard (https://github.com/robertmaynard)
  - Devavret Makkar (https://github.com/devavret)

URL: #8934
@PointKernel PointKernel self-assigned this Nov 4, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 11, 2021
…ask and count of unset bits (#9616)

Closes #9176

- [x] Update `bitmask_and` and `bitmask_or` to return both resulting mask and count of unset bits
- [x] Refactor related implementations to use new `bitmask_and/or`
- [x] Update unit tests

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Bradley Dice (https://github.com/bdice)
  - Jason Lowe (https://github.com/jlowe)

URL: #9616
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Performance Performance related issue
Projects
None yet
2 participants