Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce compile time/size for scan.cu #7516

Merged
merged 4 commits into from
Mar 9, 2021

Conversation

davidwendt
Copy link
Contributor

This PR reduces the number of calls to inclusive_scan and exclusive_scan by using a null_replace_accessor that allows non-nullable columns. This reduces the compile time and size of scan.cu by half. This PR also includes a scan gbenchmark that shows no change in performance from the original implementation.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 4, 2021
@davidwendt davidwendt self-assigned this Mar 4, 2021
@davidwendt davidwendt requested review from a team as code owners March 4, 2021 20:56
@davidwendt davidwendt requested review from trxcllnt and jrhemstad March 4, 2021 20:56
@github-actions github-actions bot added the CMake CMake build issue label Mar 4, 2021
@davidwendt
Copy link
Contributor Author

baseline refactored
compile of scan.cu 194 s 113 s
scan.cu.o size 6.42MB 3.54MB
libcudf.so size 353MB 351MB
debug compile scan.cu 27.3 min 13.6 min
debug scan.cu.o size 361MB 188MB

@codecov
Copy link

codecov bot commented Mar 6, 2021

Codecov Report

Merging #7516 (8eabdc1) into branch-0.19 (7871e7a) will increase coverage by 0.43%.
The diff coverage is 93.75%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7516      +/-   ##
===============================================
+ Coverage        81.86%   82.29%   +0.43%     
===============================================
  Files              101      101              
  Lines            16884    17264     +380     
===============================================
+ Hits             13822    14208     +386     
+ Misses            3062     3056       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/column.py 87.80% <75.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/decimal.py 95.83% <100.00%> (+0.96%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_orc.py 100.00% <0.00%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update cc104ef...6bf0175. Read the comment docs.

@hyperbolic2346
Copy link
Contributor

This improvement is great and much needed. Do you know how this impacts the run-time performance of scan?

@davidwendt
Copy link
Contributor Author

For nullable columns the performance does not change -- the code is almost identical so that is expected. The main change is for non-nullable columns which did not require a null-replacement-iterator. We are adding an extra non-divergent runtime if-check when retrieving the element. I reran my benchmarks and found this approach does slow a bit with smaller rows (10K) but never more than 1-2 microseconds. For larger rows (1M or more) the slow down is in nanoseconds and less than 1.25%.

Here are some benchmark results for some non-null fixed-width columns. I used the gbenchmark included in this PR on my desktop which is Ubuntu 18.04 with CUDA 11.0 on a GV100.

type size us diff % change
int32 10K 1.12 3.76%
int32 100K 0.743 2.36%
int32 1M 0.498 1.21%
int32 10M 0.692 0.42%
int32 100M -0.482 -0.03%
uint64 10K 0.825 2.72%
uint64 100K 0.470 1.44%
uint64 1M 0.362 0.62%
uint64 10M 0.359 0.12%
uint64 100M -0.208 -0.01%
float 10K 1.061 3.55%
float 100K 0.533 1.68%
float 1M 0.269 0.65%
float 10M 0.701 0.43%
float 100M -0.102 -0.01%

The negative values means that the null-replace implementation in this PR was technically faster though within a reasonable measurement fluctuation error. All of the smaller row columns run in about 32us. Only the 100M row benchmarks are above 1ms.

Honestly, I was hoping it would be faster since we are generating half the code. But thrust's scan is very fast with fixed-width types and trivial predicates like the ones in scan.cu.

@kkraus14
Copy link
Collaborator

kkraus14 commented Mar 9, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 444d9f2 into rapidsai:branch-0.19 Mar 9, 2021
@davidwendt davidwendt deleted the compile-scan branch March 9, 2021 14:53
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
This PR reduces the number of calls to `inclusive_scan` and `exclusive_scan` by using a `null_replace_accessor` that allows non-nullable columns. This reduces the compile time and size of `scan.cu` by half. This PR also includes a scan gbenchmark that shows no change in performance from the original implementation.

Authors:
  - David (@davidwendt)

Approvers:
  - Paul Taylor (@trxcllnt)
  - Jake Hemstad (@jrhemstad)

URL: rapidsai#7516
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants