Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to Thrust 1.12.0 #7923

Merged
merged 2 commits into from
Apr 9, 2021
Merged

Conversation

cwharris
Copy link
Contributor

@cwharris cwharris commented Apr 9, 2021

Update to Thrust 1.12.0 to take advantage of radix sort performance improvements.

@cwharris cwharris added libcudf Affects libcudf (C++/CUDA) code. 4 - Needs Review Waiting for reviewer to review or respond improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 9, 2021
@cwharris cwharris requested review from a team as code owners April 9, 2021 03:19
@github-actions github-actions bot added the CMake CMake build issue label Apr 9, 2021
@cwharris cwharris requested a review from trxcllnt April 9, 2021 03:19
@davidwendt
Copy link
Contributor

Do you know if the thrust patch here https://github.com/rapidsai/cudf/blob/branch-0.19/cpp/cmake/thrust.patch for sort.h needs to be updated too?

@cwharris
Copy link
Contributor Author

cwharris commented Apr 9, 2021

@davidwendt tested locally, everything builds and tests pass. I'll add some benchmarks for sort here later.

@cwharris
Copy link
Contributor Author

cwharris commented Apr 9, 2021

We'll probably want to update rmm's thrust version too.

@codecov
Copy link

codecov bot commented Apr 9, 2021

Codecov Report

Merging #7923 (11f889e) into branch-0.20 (599f62d) will increase coverage by 0.41%.
The diff coverage is 88.73%.

❗ Current head 11f889e differs from pull request most recent head e545e63. Consider uploading reports for the commit e545e63 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.20    #7923      +/-   ##
===============================================
+ Coverage        82.30%   82.71%   +0.41%     
===============================================
  Files              101      103       +2     
  Lines            17053    17702     +649     
===============================================
+ Hits             14035    14643     +608     
- Misses            3018     3059      +41     
Impacted Files Coverage Δ
python/cudf/cudf/utils/utils.py 83.25% <ø> (-1.81%) ⬇️
python/cudf/cudf/utils/dtypes.py 83.44% <46.66%> (-6.45%) ⬇️
python/cudf/cudf/core/groupby/groupby.py 92.41% <78.57%> (-1.04%) ⬇️
python/cudf/cudf/core/column/lists.py 87.41% <80.00%> (+0.19%) ⬆️
python/dask_cudf/dask_cudf/backends.py 89.58% <85.71%> (-0.05%) ⬇️
python/cudf/cudf/core/column/struct.py 96.29% <86.66%> (-3.71%) ⬇️
python/cudf/cudf/core/index.py 93.04% <88.09%> (+0.01%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.92% <91.48%> (-0.92%) ⬇️
python/cudf/cudf/core/column/interval.py 91.11% <92.30%> (+0.48%) ⬆️
python/cudf/cudf/core/column/column.py 87.99% <92.59%> (+0.56%) ⬆️
... and 67 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c026556...e545e63. Read the comment docs.

@davidwendt
Copy link
Contributor

davidwendt commented Apr 9, 2021

@davidwendt tested locally, everything builds and tests pass. I'll add some benchmarks for sort here later.

The patch command here is the following:

PATCH_COMMAND   patch --reject-file=- -p1 -N < ${CUDF_SOURCE_DIR}/cmake/thrust.patch || true )

So if the patch fails, the true here masks the error. The compile, gtests, and gbenchmarks will all succeed and the runtime will not show much change. However, without this patch, the compile time will be about 10x longer (for sort.cu and maybe others) and the libcudf.so will be significantly larger.

Could you verify the patch actually takes effect? And can you provide before and after compile times for this change?

@trxcllnt
Copy link
Contributor

trxcllnt commented Apr 9, 2021

@davidwendt I think with FetchContent we'd still see the error if the patch failed, but it wouldn't cause configuration to fail. I don't know if CPM hides FetchContent errors, but I don't see any when configuring.

Manually inspecting _deps/thrust-src/thrust/system/cuda/detail/sort.hsort.h shows the patch still succeeds:

image

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for checking on this Paul. I know thrust made improvements to sort in 1.11 and was concerned these changes modified this header file.

@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs Review Waiting for reviewer to review or respond labels Apr 9, 2021
@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 80afff3 into rapidsai:branch-0.20 Apr 9, 2021
@trxcllnt
Copy link
Contributor

trxcllnt commented Apr 9, 2021

@davidwendt no problem, it was an entirely reasonable request 🙂. We're excited about the sort improvements that came in v0.11.0, but double checked last night that your work on the patch wasn't lost either.

Btw in case I haven't said this already thanks for all your investigations into build performance, it makes the constant RAPIDS rebuilds less painful than it otherwise would be.

@cwharris
Copy link
Contributor Author

0.20 with Thrust 1.10.0

real    32m42.994s
user    305m54.735s
sys     12m34.393s

(rapids) rapids@compose:~$ ./cudf/cpp/build/release/gbenchmarks/SORT_BENCH 
2021-04-09T13:01:54-05:00
Running ./cudf/cpp/build/release/gbenchmarks/SORT_BENCH
Run on (12 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 19712 KiB (x1)
Load Average: 6.80, 11.37, 10.65
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Sort<false>/unstable_no_nulls/1024/1/manual_time          0.049 ms        0.067 ms        14417
Sort<false>/unstable_no_nulls/4096/1/manual_time          0.051 ms        0.069 ms        13647
Sort<false>/unstable_no_nulls/32768/1/manual_time         0.181 ms        0.197 ms         3763
Sort<false>/unstable_no_nulls/262144/1/manual_time        0.194 ms        0.211 ms         3040
Sort<false>/unstable_no_nulls/2097152/1/manual_time       0.716 ms        0.735 ms          998
Sort<false>/unstable_no_nulls/16777216/1/manual_time       4.70 ms         4.73 ms          148
Sort<false>/unstable_no_nulls/67108864/1/manual_time       18.4 ms         18.4 ms           38
Sort<false>/unstable_no_nulls/1024/8/manual_time          0.576 ms        0.596 ms         1146
Sort<false>/unstable_no_nulls/4096/8/manual_time          0.689 ms        0.710 ms          984
Sort<false>/unstable_no_nulls/32768/8/manual_time         0.805 ms        0.824 ms          853
Sort<false>/unstable_no_nulls/262144/8/manual_time         1.08 ms         1.08 ms          662
Sort<false>/unstable_no_nulls/2097152/8/manual_time        24.5 ms         24.5 ms           28
Sort<false>/unstable_no_nulls/16777216/8/manual_time        358 ms          358 ms            2
Sort<false>/unstable_no_nulls/67108864/8/manual_time       1937 ms         1937 ms            1
Sort<true>/stable_no_nulls/1024/1/manual_time             0.049 ms        0.067 ms        12626
Sort<true>/stable_no_nulls/4096/1/manual_time             0.051 ms        0.069 ms        12643
Sort<true>/stable_no_nulls/32768/1/manual_time            0.185 ms        0.202 ms         3819
Sort<true>/stable_no_nulls/262144/1/manual_time           0.193 ms        0.210 ms         3668
Sort<true>/stable_no_nulls/2097152/1/manual_time          0.714 ms        0.736 ms          988
Sort<true>/stable_no_nulls/16777216/1/manual_time          4.74 ms         4.76 ms          147
Sort<true>/stable_no_nulls/67108864/1/manual_time          18.4 ms         18.5 ms           38
Sort<true>/stable_no_nulls/1024/8/manual_time             0.585 ms        0.604 ms         1119
Sort<true>/stable_no_nulls/4096/8/manual_time             0.706 ms        0.723 ms          964
Sort<true>/stable_no_nulls/32768/8/manual_time            0.811 ms        0.828 ms          834
Sort<true>/stable_no_nulls/262144/8/manual_time            1.06 ms         1.08 ms          661
Sort<true>/stable_no_nulls/2097152/8/manual_time           24.7 ms         24.8 ms           28
Sort<true>/stable_no_nulls/16777216/8/manual_time           362 ms          362 ms            2
Sort<true>/stable_no_nulls/67108864/8/manual_time          1963 ms         1963 ms            1
Sort<false>/unstable/1024/1/manual_time                   0.101 ms        0.121 ms         6910
Sort<false>/unstable/4096/1/manual_time                   0.160 ms        0.180 ms         4313
Sort<false>/unstable/32768/1/manual_time                  0.228 ms        0.248 ms         3046
Sort<false>/unstable/262144/1/manual_time                 0.349 ms        0.371 ms         2004
Sort<false>/unstable/2097152/1/manual_time                 3.95 ms         3.98 ms          179
Sort<false>/unstable/16777216/1/manual_time                33.2 ms         33.2 ms           21
Sort<false>/unstable/67108864/1/manual_time                 142 ms          142 ms            5
Sort<false>/unstable/1024/8/manual_time                   0.727 ms        0.751 ms          924
Sort<false>/unstable/4096/8/manual_time                   0.920 ms        0.942 ms          702
Sort<false>/unstable/32768/8/manual_time                   1.21 ms         1.24 ms          565
Sort<false>/unstable/262144/8/manual_time                  1.64 ms         1.66 ms          421
Sort<false>/unstable/2097152/8/manual_time                 27.9 ms         27.7 ms           25
Sort<false>/unstable/16777216/8/manual_time                 456 ms          456 ms            2
Sort<false>/unstable/67108864/8/manual_time                2753 ms         2753 ms            1
Sort<true>/stable/1024/1/manual_time                      0.103 ms        0.122 ms         6503
Sort<true>/stable/4096/1/manual_time                      0.159 ms        0.179 ms         4427
Sort<true>/stable/32768/1/manual_time                     0.229 ms        0.249 ms         3049
Sort<true>/stable/262144/1/manual_time                    0.347 ms        0.368 ms         2001
Sort<true>/stable/2097152/1/manual_time                    3.97 ms         4.01 ms          176
Sort<true>/stable/16777216/1/manual_time                   33.5 ms         33.5 ms           21
Sort<true>/stable/67108864/1/manual_time                    144 ms          144 ms            5
Sort<true>/stable/1024/8/manual_time                      0.745 ms        0.764 ms          912
Sort<true>/stable/4096/8/manual_time                      0.910 ms        0.933 ms          726
Sort<true>/stable/32768/8/manual_time                      1.24 ms         1.26 ms          562
Sort<true>/stable/262144/8/manual_time                     1.66 ms         1.69 ms          425
Sort<true>/stable/2097152/8/manual_time                    27.7 ms         27.8 ms           25
Sort<true>/stable/16777216/8/manual_time                    455 ms          455 ms            2
Sort<true>/stable/67108864/8/manual_time                   2757 ms         2757 ms            1
Sort/strings/1024/manual_time                              1.02 ms         1.04 ms          615
Sort/strings/4096/manual_time                              1.71 ms         1.73 ms          405
Sort/strings/32768/manual_time                             2.34 ms         2.36 ms          299
Sort/strings/262144/manual_time                            4.10 ms         4.12 ms          170
Sort/strings/2097152/manual_time                           53.7 ms         53.7 ms           13
Sort/strings/16777216/manual_time                           509 ms          509 ms            1
0.20 with Thrust 1.12.0

real    31m54.881s
user    303m57.892s
sys     11m47.145s

(rapids) rapids@compose:~$ ./cudf/cpp/build/release/gbenchmarks/SORT_BENCH 
2021-04-09T13:01:54-05:00
Running ./cudf/cpp/build/release/gbenchmarks/SORT_BENCH
Run on (12 X 3700 MHz CPU s)
CPU Caches:
  L1 Data 32 KiB (x6)
  L1 Instruction 32 KiB (x6)
  L2 Unified 1024 KiB (x6)
  L3 Unified 19712 KiB (x1)
Load Average: 0.95, 0.70, 0.36
***WARNING*** CPU scaling is enabled, the benchmark real time measurements may be noisy and will incur extra overhead.
-----------------------------------------------------------------------------------------------
Benchmark                                                     Time             CPU   Iterations
-----------------------------------------------------------------------------------------------
Sort<false>/unstable_no_nulls/1024/1/manual_time          0.049 ms        0.067 ms        12438
Sort<false>/unstable_no_nulls/4096/1/manual_time          0.050 ms        0.068 ms        13402
Sort<false>/unstable_no_nulls/32768/1/manual_time         0.112 ms        0.129 ms         6066
Sort<false>/unstable_no_nulls/262144/1/manual_time        0.119 ms        0.136 ms         5859
Sort<false>/unstable_no_nulls/2097152/1/manual_time       0.502 ms        0.526 ms         1000
Sort<false>/unstable_no_nulls/16777216/1/manual_time       3.25 ms         3.27 ms          214
Sort<false>/unstable_no_nulls/67108864/1/manual_time       12.7 ms         12.7 ms           55
Sort<false>/unstable_no_nulls/1024/8/manual_time          0.566 ms        0.579 ms         1214
Sort<false>/unstable_no_nulls/4096/8/manual_time          0.695 ms        0.714 ms          942
Sort<false>/unstable_no_nulls/32768/8/manual_time         0.790 ms        0.808 ms          885
Sort<false>/unstable_no_nulls/262144/8/manual_time         1.07 ms         1.06 ms          677
Sort<false>/unstable_no_nulls/2097152/8/manual_time        24.5 ms         24.5 ms           28
Sort<false>/unstable_no_nulls/16777216/8/manual_time        356 ms          356 ms            2
Sort<false>/unstable_no_nulls/67108864/8/manual_time       1942 ms         1942 ms            1
Sort<true>/stable_no_nulls/1024/1/manual_time             0.049 ms        0.067 ms        13938
Sort<true>/stable_no_nulls/4096/1/manual_time             0.050 ms        0.068 ms        13841
Sort<true>/stable_no_nulls/32768/1/manual_time            0.111 ms        0.128 ms         6236
Sort<true>/stable_no_nulls/262144/1/manual_time           0.121 ms        0.138 ms         5821
Sort<true>/stable_no_nulls/2097152/1/manual_time          0.499 ms        0.522 ms         1392
Sort<true>/stable_no_nulls/16777216/1/manual_time          3.25 ms         3.28 ms          217
Sort<true>/stable_no_nulls/67108864/1/manual_time          12.7 ms         12.7 ms           55
Sort<true>/stable_no_nulls/1024/8/manual_time             0.573 ms        0.589 ms         1213
Sort<true>/stable_no_nulls/4096/8/manual_time             0.694 ms        0.713 ms         1010
Sort<true>/stable_no_nulls/32768/8/manual_time            0.790 ms        0.803 ms          882
Sort<true>/stable_no_nulls/262144/8/manual_time            1.08 ms         1.07 ms          667
Sort<true>/stable_no_nulls/2097152/8/manual_time           24.7 ms         24.7 ms           28
Sort<true>/stable_no_nulls/16777216/8/manual_time           358 ms          358 ms            2
Sort<true>/stable_no_nulls/67108864/8/manual_time          1947 ms         1947 ms            1
Sort<false>/unstable/1024/1/manual_time                   0.101 ms        0.120 ms         6867
Sort<false>/unstable/4096/1/manual_time                   0.156 ms        0.177 ms         4443
Sort<false>/unstable/32768/1/manual_time                  0.226 ms        0.245 ms         3141
Sort<false>/unstable/262144/1/manual_time                 0.344 ms        0.364 ms         2031
Sort<false>/unstable/2097152/1/manual_time                 3.93 ms         3.97 ms          179
Sort<false>/unstable/16777216/1/manual_time                33.2 ms         33.3 ms           21
Sort<false>/unstable/67108864/1/manual_time                 144 ms          144 ms            5
Sort<false>/unstable/1024/8/manual_time                   0.720 ms        0.738 ms          962
Sort<false>/unstable/4096/8/manual_time                   0.903 ms        0.923 ms          780
Sort<false>/unstable/32768/8/manual_time                   1.21 ms         1.24 ms          573
Sort<false>/unstable/262144/8/manual_time                  1.62 ms         1.65 ms          432
Sort<false>/unstable/2097152/8/manual_time                 27.8 ms         27.8 ms           25
Sort<false>/unstable/16777216/8/manual_time                 450 ms          450 ms            2
Sort<false>/unstable/67108864/8/manual_time                2708 ms         2707 ms            1
Sort<true>/stable/1024/1/manual_time                      0.101 ms        0.120 ms         6957
Sort<true>/stable/4096/1/manual_time                      0.157 ms        0.176 ms         4469
Sort<true>/stable/32768/1/manual_time                     0.222 ms        0.239 ms         3143
Sort<true>/stable/262144/1/manual_time                    0.345 ms        0.365 ms         2041
Sort<true>/stable/2097152/1/manual_time                    3.94 ms         3.98 ms          173
Sort<true>/stable/16777216/1/manual_time                   33.2 ms         33.3 ms           21
Sort<true>/stable/67108864/1/manual_time                    144 ms          144 ms            5
Sort<true>/stable/1024/8/manual_time                      0.724 ms        0.741 ms          952
Sort<true>/stable/4096/8/manual_time                      0.908 ms        0.928 ms          732
Sort<true>/stable/32768/8/manual_time                      1.22 ms         1.24 ms          567
Sort<true>/stable/262144/8/manual_time                     1.63 ms         1.67 ms          427
Sort<true>/stable/2097152/8/manual_time                    27.8 ms         27.8 ms           25
Sort<true>/stable/16777216/8/manual_time                    452 ms          452 ms            2
Sort<true>/stable/67108864/8/manual_time                   2716 ms         2716 ms            1
Sort/strings/1024/manual_time                              1.00 ms         1.02 ms          692
Sort/strings/4096/manual_time                              1.69 ms         1.71 ms          408
Sort/strings/32768/manual_time                             2.30 ms         2.32 ms          305
Sort/strings/262144/manual_time                            4.05 ms         4.07 ms          173
Sort/strings/2097152/manual_time                           53.7 ms         53.7 ms           13
Sort/strings/16777216/manual_time                           506 ms          506 ms            2

rapids-bot bot pushed a commit to rapidsai/cuspatial that referenced this pull request Apr 12, 2021
[cuDF updated to Thrust v1.12.0 today](rapidsai/cudf#7923), which means we need to update too.

* Fixes GCC 9 RVO compile warning/error
* Removes our `make_zip_iterator` convenience fn because it's now part of Thrust

Thrust 1.11.0 included a new radix sort in CUB, so hopefully we see some nice speedups in our routines as well. Hooray!

Authors:
  - Paul Taylor (https://github.com/trxcllnt)

Approvers:
  - Keith Kraus (https://github.com/kkraus14)
  - Christopher Harris (https://github.com/cwharris)

URL: #379
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants