Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace cudf::detail::copy_if logic with thrust::copy_if and gather #17520

Merged
merged 6 commits into from
Dec 10, 2024

Conversation

davidwendt
Copy link
Contributor

Description

Replaces the custom kernels for cudf::detail::copy_if with a call to thrust::copy_if to build indices to call cudf::detail::gather.
This is easier to maintain and faster for some cases but slower in others.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Dec 4, 2024
@davidwendt davidwendt self-assigned this Dec 4, 2024
Copy link

copy-pr-bot bot commented Dec 4, 2024

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

davidwendt commented Dec 6, 2024

I modified the apply_boolean_mask benchmarks to include more hit rates and column count. I also included a STRING column whenever there was more than 1 column to make the results a bit more realistic without hiding any slow-downs.
Breaking down the 216 rows of results as follows.
First the 100% case was universally slower. This is the case where the entire mask is true and the result is just a copy of the input. The explanation is that the original custom kernel does a count and a scan to determine the total hit count whereas this PR optimizes for the non-100% case and directly builds a list of indices. The input table copy logic is the same. I feel the 100% case would likely be uncommon and not worth the overhead to proactively check.

| type | cols |   rows   | hits_% |   Ref Time |   Cmp Time |        Diff |   %Diff |
|------|------|--------- |--------|------------|------------|-------------|---------|
|  F64 |  1   | 1000000  |  100   |  46.053 us |  50.713 us |    4.660 us |  10.12% |
|  F64 |  4   | 1000000  |  100   | 197.977 us | 201.649 us |    3.672 us |   1.85% |
|  F64 |  9   | 1000000  |  100   | 390.244 us | 395.312 us |    5.068 us |   1.30% |
|  F64 |  1   | 10000000 |  100   | 225.289 us | 269.711 us |   44.422 us |  19.72% |
|  F64 |  4   | 10000000 |  100   |   1.391 ms |   1.442 ms |   50.096 us |   3.60% |
|  F64 |  9   | 10000000 |  100   |   2.947 ms |   2.993 ms |   45.761 us |   1.55% |

The string-only columns were all faster (except for the 100% case). The original code performed a copy-if/gather on the individual columns where this change employs copy-if/gather on the entire table. I gather that gather handles multiple columns more efficiently and there is no reason duplicate that logic.

|  type  | cols |   rows   | hits% |   Ref Time |   Cmp Time |        Diff |   %Diff |
|--------|------|--------- |-------|------------|------------|-------------|---------|
| string |  1   |  100000  |   10  |  84.765 us |  63.940 us |  -20.825 us | -24.57% |
| string |  4   |  100000  |   10  | 268.172 us | 193.701 us |  -74.471 us | -27.77% |
| string |  9   |  100000  |   10  | 571.640 us | 413.998 us | -157.643 us | -27.58% |
| string |  1   | 1000000  |   10  | 109.353 us |  88.849 us |  -20.504 us | -18.75% |
| string |  4   | 1000000  |   10  | 358.014 us | 274.490 us |  -83.524 us | -23.33% |
| string |  9   | 1000000  |   10  | 770.853 us | 580.561 us | -190.292 us | -24.69% |
| string |  1   | 10000000 |   10  | 356.914 us | 318.078 us |  -38.836 us | -10.88% |
| string |  4   | 10000000 |   10  |   1.292 ms |   1.063 ms | -228.795 us | -17.71% |
| string |  9   | 10000000 |   10  |   2.840 ms |   2.299 ms | -541.374 us | -19.06% |
| string |  1   |  100000  |   20  |  85.627 us |  66.427 us |  -19.200 us | -22.42% |
| string |  4   |  100000  |   20  | 275.295 us | 201.190 us |  -74.105 us | -26.92% |
| string |  9   |  100000  |   20  | 586.805 us | 430.918 us | -155.887 us | -26.57% |
| string |  1   | 1000000  |   20  | 130.772 us | 110.380 us |  -20.392 us | -15.59% |
| string |  4   | 1000000  |   20  | 437.493 us | 353.476 us |  -84.017 us | -19.20% |
| string |  9   | 1000000  |   20  | 947.104 us | 756.615 us | -190.488 us | -20.11% |
| string |  1   | 10000000 |   20  | 557.012 us | 518.233 us |  -38.779 us |  -6.96% |
| string |  4   | 10000000 |   20  |   2.093 ms |   1.870 ms | -223.269 us | -10.67% |
| string |  9   | 10000000 |   20  |   4.658 ms |   4.124 ms | -533.912 us | -11.46% |
| string |  1   |  100000  |   50  |  91.498 us |  72.275 us |  -19.223 us | -21.01% |
| string |  4   |  100000  |   50  | 300.739 us | 227.699 us |  -73.040 us | -24.29% |
| string |  9   |  100000  |   50  | 646.304 us | 489.413 us | -156.891 us | -24.28% |
| string |  1   | 1000000  |   50  | 187.616 us | 167.027 us |  -20.589 us | -10.97% |
| string |  4   | 1000000  |   50  | 665.751 us | 581.723 us |  -84.028 us | -12.62% |
| string |  9   | 1000000  |   50  |   1.464 ms |   1.271 ms | -193.011 us | -13.18% |
| string |  1   | 10000000 |   50  |   1.174 ms |   1.137 ms |  -37.202 us |  -3.17% |
| string |  4   | 10000000 |   50  |   4.553 ms |   4.349 ms | -204.357 us |  -4.49% |
| string |  9   | 10000000 |   50  |  10.190 ms |   9.706 ms | -483.354 us |  -4.74% |
| string |  1   |  100000  |   80  |  96.317 us |  77.444 us |  -18.873 us | -19.60% |
| string |  4   |  100000  |   80  | 317.110 us | 245.229 us |  -71.881 us | -22.67% |
| string |  9   |  100000  |   80  | 684.223 us | 524.881 us | -159.341 us | -23.29% |
| string |  1   | 1000000  |   80  | 243.527 us | 222.878 us |  -20.649 us |  -8.48% |
| string |  4   | 1000000  |   80  | 892.492 us | 805.781 us |  -86.712 us |  -9.72% |
| string |  9   | 1000000  |   80  |   1.971 ms |   1.774 ms | -197.016 us |  -9.99% |
| string |  1   | 10000000 |   80  |   1.782 ms |   1.747 ms |  -34.227 us |  -1.92% |
| string |  4   | 10000000 |   80  |   7.001 ms |   6.793 ms | -207.835 us |  -2.97% |
| string |  9   | 10000000 |   80  |  15.683 ms |  15.195 ms | -488.097 us |  -3.11% |
| string |  1   |  100000  |   90  |  98.984 us |  79.725 us |  -19.260 us | -19.46% |
| string |  4   |  100000  |   90  | 330.567 us | 257.294 us |  -73.273 us | -22.17% |
| string |  9   |  100000  |   90  | 711.469 us | 553.336 us | -158.133 us | -22.23% |
| string |  1   | 1000000  |   90  | 266.019 us | 245.235 us |  -20.784 us |  -7.81% |
| string |  4   | 1000000  |   90  | 979.027 us | 893.633 us |  -85.394 us |  -8.72% |
| string |  9   | 1000000  |   90  |   2.168 ms |   1.972 ms | -195.729 us |  -9.03% |
| string |  1   | 10000000 |   90  |   1.985 ms |   1.947 ms |  -37.808 us |  -1.90% |
| string |  4   | 10000000 |   90  |   7.809 ms |   7.594 ms | -215.842 us |  -2.76% |
| string |  9   | 10000000 |   90  |  17.506 ms |  17.001 ms | -505.584 us |  -2.89% |
| string |  1   |  100000  |  100  |  44.368 us |  42.190 us |   -2.178 us |  -4.91% |
| string |  4   |  100000  |  100  | 101.728 us |  99.858 us |   -1.870 us |  -1.84% |
| string |  9   |  100000  |  100  | 194.977 us | 193.972 us |   -1.005 us |  -0.52% |
| string |  1   | 1000000  |  100  |  73.571 us |  80.542 us |    6.971 us |   9.47% |
| string |  4   | 1000000  |  100  | 228.605 us | 228.489 us |   -0.116 us |  -0.05% |
| string |  9   | 1000000  |  100  | 473.240 us | 482.245 us |    9.005 us |   1.90% |
| string |  1   | 10000000 |  100  | 379.812 us | 425.788 us |   45.976 us |  12.11% |
| string |  4   | 10000000 |  100  |   1.421 ms |   1.467 ms |   46.356 us |   3.26% |
| string |  9   | 10000000 |  100  |   3.152 ms |   3.201 ms |   49.440 us |   1.57% |

Finally, when the hits % is low (<50%), the new code runs faster overall.

| type | cols  |   rows   | hits_% |   Ref Time |   Cmp Time |        Diff |   %Diff |
|------|-------|----------|--------|------------|------------|-------------|---------|
|  I64 |  1    |  100000  |   10   |  29.671 us |  25.088 us |   -4.584 us | -15.45% |
|  I64 |  4    |  100000  |   10   | 167.588 us | 122.130 us |  -45.458 us | -27.12% |
|  I64 |  9    |  100000  |   10   | 324.006 us | 228.883 us |  -95.122 us | -29.36% |
|  I64 |  1    | 1000000  |   10   |  43.703 us |  33.026 us |  -10.677 us | -24.43% |
|  I64 |  4    | 1000000  |   10   | 254.378 us | 186.434 us |  -67.944 us | -26.71% |
|  I64 |  9    | 1000000  |   10   | 506.109 us | 348.492 us | -157.617 us | -31.14% |
|  I64 |  1    | 10000000 |   10   | 153.615 us |  95.589 us |  -58.026 us | -37.77% |
|  I64 |  4    | 10000000 |   10   |   1.067 ms | 794.854 us | -271.878 us | -25.49% |
|  I64 |  9    | 10000000 |   10   |   2.199 ms |   1.543 ms | -656.274 us | -29.85% |
|  I64 |  1    |  100000  |   20   |  29.959 us |  25.222 us |   -4.737 us | -15.81% |
|  I64 |  4    |  100000  |   20   | 172.093 us | 126.666 us |  -45.426 us | -26.40% |
|  I64 |  9    |  100000  |   20   | 333.055 us | 238.588 us |  -94.467 us | -28.36% |
|  I64 |  1    | 1000000  |   20   |  44.912 us |  36.287 us |   -8.625 us | -19.21% |
|  I64 |  4    | 1000000  |   20   | 314.320 us | 247.796 us |  -66.525 us | -21.16% |
|  I64 |  9    | 1000000  |   20   | 623.280 us | 471.589 us | -151.691 us | -24.34% |
|  I64 |  1    | 10000000 |   20   | 163.607 us | 118.397 us |  -45.210 us | -27.63% |
|  I64 |  4    | 10000000 |   20   |   1.655 ms |   1.419 ms | -236.726 us | -14.30% |
|  I64 |  9    | 10000000 |   20   |   3.388 ms |   2.819 ms | -569.138 us | -16.80% |
|  I64 |  1    |  100000  |   50   |  31.019 us |  26.506 us |   -4.513 us | -14.55% |
|  I64 |  4    |  100000  |   50   | 192.739 us | 147.554 us |  -45.186 us | -23.44% |
|  I64 |  9    |  100000  |   50   | 375.088 us | 279.854 us |  -95.233 us | -25.39% |
|  I64 |  1    | 1000000  |   50   |  48.798 us |  42.210 us |   -6.588 us | -13.50% |
|  I64 |  4    | 1000000  |   50   | 480.969 us | 421.960 us |  -59.009 us | -12.27% |
|  I64 |  9    | 1000000  |   50   | 962.334 us | 824.617 us | -137.717 us | -14.31% |
|  I64 |  1    | 10000000 |   50   | 192.644 us | 181.034 us |  -11.610 us |  -6.03% |
|  I64 |  4    | 10000000 |   50   |   3.440 ms |   3.299 ms | -140.991 us |  -4.10% |
|  I64 |  9    | 10000000 |   50   |   6.950 ms |   6.667 ms | -282.801 us |  -4.07% |

Above 50%, the performance improvement is not as significant and less consistent:

| type | cols |   rows   | hits_% |   Ref Time |   Cmp Time |        Diff |   %Diff |
|------|------|----------|--------|------------|------------|-------------|---------|
|  I32 |  1   |  100000  |   80   |  30.020 us |  26.221 us |   -3.799 us | -12.66% |
|  I32 |  4   |  100000  |   80   | 204.985 us | 160.009 us |  -44.976 us | -21.94% |
|  I32 |  9   |  100000  |   80   | 397.809 us | 302.387 us |  -95.422 us | -23.99% |
|  I32 |  1   | 1000000  |   80   |  46.585 us |  40.893 us |   -5.692 us | -12.22% |
|  I32 |  4   | 1000000  |   80   | 642.876 us | 578.936 us |  -63.940 us |  -9.95% |
|  I32 |  9   | 1000000  |   80   |   1.280 ms |   1.133 ms | -146.882 us | -11.47% |
|  I32 |  1   | 10000000 |   80   | 162.567 us | 158.709 us |   -3.859 us |  -2.37% |
|  I32 |  4   | 10000000 |   80   |   5.109 ms |   4.981 ms | -127.845 us |  -2.50% |
|  I32 |  9   | 10000000 |   80   |  10.302 ms |  10.034 ms | -268.434 us |  -2.61% |
|  I32 |  1   |  100000  |   90   |  30.016 us |  26.180 us |   -3.836 us | -12.78% |
|  I32 |  4   |  100000  |   90   | 213.033 us | 167.519 us |  -45.514 us | -21.36% |
|  I32 |  9   |  100000  |   90   | 412.672 us | 319.072 us |  -93.600 us | -22.68% |
|  I32 |  1   | 1000000  |   90   |  46.706 us |  41.227 us |   -5.479 us | -11.73% |
|  I32 |  4   | 1000000  |   90   | 704.705 us | 639.684 us |  -65.022 us |  -9.23% |
|  I32 |  9   | 1000000  |   90   |   1.404 ms |   1.258 ms | -145.294 us | -10.35% |
|  I32 |  1   | 10000000 |   90   | 165.283 us | 172.295 us |    7.012 us |   4.24% |
|  I32 |  4   | 10000000 |   90   |   5.677 ms |   5.582 ms |  -94.745 us |  -1.67% |
|  I32 |  9   | 10000000 |   90   |  11.442 ms |  11.234 ms | -207.914 us |  -1.82% |

And there are specific cases where fixed-width only input tables are not faster for hits >= 80% and 10M rows.
Samples of these can be seen from the columns==1 cases.

| type | cols |   rows   | hits_% |   Ref Time |   Cmp Time |        Diff |   %Diff |
|------|------|----------|--------|------------|------------|-------------|---------|
|  I64 |  1   | 10000000 |   80   | 230.003 us | 250.863 us |   20.860 us |   9.07% |
|  I64 |  1   | 10000000 |   90   | 246.956 us | 272.577 us |   25.621 us |  10.37% |
|  F64 |  1   | 10000000 |   80   | 229.953 us | 250.705 us |   20.752 us |   9.02% |
|  F64 |  1   | 10000000 |   90   | 246.915 us | 272.507 us |   25.592 us |  10.36% |

Note the percentages are high here but the diff time is only 20-25 microseconds.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 6, 2024
@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt
Copy link
Contributor Author

/ok to test

@davidwendt davidwendt marked this pull request as ready for review December 9, 2024 18:58
@davidwendt davidwendt requested a review from a team as a code owner December 9, 2024 18:58
Copy link
Member

@PointKernel PointKernel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small question otherwise looks great.

I’m not too concerned about the slight performance regression, as thrust will take care of it if it really matters.

cpp/include/cudf/detail/copy_if.cuh Show resolved Hide resolved
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this is great indeed. The approach is much cleaner 👍

@davidwendt
Copy link
Contributor Author

/merge

@rapids-bot rapids-bot bot merged commit 3468e92 into rapidsai:branch-25.02 Dec 10, 2024
104 of 105 checks passed
@davidwendt davidwendt deleted the copy-if-use-thrust branch December 10, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants