Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve string gather performance #7433

Merged
merged 5 commits into from
Mar 2, 2021

Conversation

jlowe
Copy link
Member

@jlowe jlowe commented Feb 24, 2021

This improves string gather performance by using an algorithm with character-level parallelism. A string gather benchmark with varying row counts and row lengths has also been added.

@jlowe jlowe added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Feb 24, 2021
@jlowe jlowe requested a review from davidwendt February 24, 2021 01:18
@jlowe jlowe self-assigned this Feb 24, 2021
@jlowe jlowe requested a review from a team as a code owner February 24, 2021 01:18
@jlowe jlowe requested a review from harrism February 24, 2021 01:18
@jlowe
Copy link
Member Author

jlowe commented Feb 24, 2021

Benchmark numbers.
Before:

StringCopy/gather/4096/32/manual_time           0.108 ms        0.128 ms         5990 bytes_per_second=639.892M/s
StringCopy/gather/4096/128/manual_time          0.155 ms        0.175 ms         4285 bytes_per_second=1.70927G/s
StringCopy/gather/4096/512/manual_time          0.378 ms        0.396 ms         1842 bytes_per_second=2.83254G/s
StringCopy/gather/4096/2048/manual_time          1.22 ms         1.24 ms          568 bytes_per_second=3.49513G/s
StringCopy/gather/4096/8192/manual_time          4.59 ms         4.61 ms          153 bytes_per_second=3.69724G/s
StringCopy/gather/32768/32/manual_time          0.121 ms        0.140 ms         5509 bytes_per_second=4.46532G/s
StringCopy/gather/32768/128/manual_time         0.203 ms        0.221 ms         3371 bytes_per_second=10.503G/s
StringCopy/gather/32768/512/manual_time         0.574 ms        0.597 ms         1162 bytes_per_second=14.8195G/s
StringCopy/gather/32768/2048/manual_time         1.97 ms         1.99 ms          355 bytes_per_second=17.3396G/s
StringCopy/gather/32768/8192/manual_time         7.66 ms         7.68 ms           91 bytes_per_second=17.7785G/s
StringCopy/gather/262144/32/manual_time         0.255 ms        0.273 ms         2727 bytes_per_second=16.9522G/s
StringCopy/gather/262144/128/manual_time         4.15 ms         4.17 ms          169 bytes_per_second=4.11681G/s
StringCopy/gather/262144/512/manual_time         36.9 ms         36.9 ms           19 bytes_per_second=1.84783G/s
StringCopy/gather/262144/2048/manual_time         188 ms          188 ms            4 bytes_per_second=1.44939G/s
StringCopy/gather/2097152/32/manual_time         7.21 ms         7.23 ms           97 bytes_per_second=4.78411G/s
StringCopy/gather/2097152/128/manual_time        43.9 ms         44.0 ms           16 bytes_per_second=3.10625G/s
StringCopy/gather/2097152/512/manual_time         309 ms          309 ms            2 bytes_per_second=1.76693G/s
StringCopy/gather/16777216/32/manual_time        76.2 ms         76.2 ms            9 bytes_per_second=3.62227G/s

After:

-----------------------------------------------------------------------------------------------------
Benchmark                                           Time             CPU   Iterations UserCounters...
-----------------------------------------------------------------------------------------------------
StringCopy/gather/4096/32/manual_time           0.096 ms        0.116 ms         6959 bytes_per_second=723.264M/s
StringCopy/gather/4096/128/manual_time          0.101 ms        0.120 ms         6198 bytes_per_second=2.62843G/s
StringCopy/gather/4096/512/manual_time          0.129 ms        0.148 ms         4996 bytes_per_second=8.27184G/s
StringCopy/gather/4096/2048/manual_time         0.234 ms        0.252 ms         2927 bytes_per_second=18.2411G/s
StringCopy/gather/4096/8192/manual_time         0.651 ms        0.671 ms         1040 bytes_per_second=26.0267G/s
StringCopy/gather/32768/32/manual_time          0.121 ms        0.140 ms         5523 bytes_per_second=4.44146G/s
StringCopy/gather/32768/128/manual_time         0.182 ms        0.200 ms         3737 bytes_per_second=11.7394G/s
StringCopy/gather/32768/512/manual_time         0.423 ms        0.442 ms         1650 bytes_per_second=20.1458G/s
StringCopy/gather/32768/2048/manual_time         1.40 ms         1.42 ms          496 bytes_per_second=24.3906G/s
StringCopy/gather/32768/8192/manual_time         5.28 ms         5.30 ms          131 bytes_per_second=25.7635G/s
StringCopy/gather/262144/32/manual_time         0.309 ms        0.328 ms         2245 bytes_per_second=13.9549G/s
StringCopy/gather/262144/128/manual_time        0.860 ms        0.879 ms          805 bytes_per_second=19.8374G/s
StringCopy/gather/262144/512/manual_time         3.07 ms         3.09 ms          228 bytes_per_second=22.194G/s
StringCopy/gather/262144/2048/manual_time        11.9 ms         12.0 ms           59 bytes_per_second=22.8614G/s
StringCopy/gather/2097152/32/manual_time         2.04 ms         2.05 ms          344 bytes_per_second=16.9527G/s
StringCopy/gather/2097152/128/manual_time        6.97 ms         6.99 ms          100 bytes_per_second=19.5739G/s
StringCopy/gather/2097152/512/manual_time        26.8 ms         26.8 ms           26 bytes_per_second=20.3467G/s
StringCopy/gather/16777216/32/manual_time        18.9 ms         18.9 ms           37 bytes_per_second=14.6127G/s

@jrhemstad
Copy link
Contributor

Very nice improvement. There's room for even further improvement by doing larger than 1B read/writes when possible as well, but that would be quite a bit more complex and I'm guessing the current speedup is satisfactory.

@kkraus14
Copy link
Collaborator

Very nice improvement. There's room for even further improvement by doing larger than 1B read/writes when possible as well, but that would be quite a bit more complex and I'm guessing the current speedup is satisfactory.

I think this is something that @gaohao95 is planning on looking into in the near future.

@codecov
Copy link

codecov bot commented Feb 24, 2021

Codecov Report

Merging #7433 (c75bb85) into branch-0.19 (43b44e1) will increase coverage by 0.43%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7433      +/-   ##
===============================================
+ Coverage        81.80%   82.24%   +0.43%     
===============================================
  Files              101      101              
  Lines            16695    17069     +374     
===============================================
+ Hits             13658    14039     +381     
+ Misses            3037     3030       -7     
Impacted Files Coverage Δ
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
python/cudf/cudf/core/abc.py 87.23% <0.00%> (-1.14%) ⬇️
python/cudf/cudf/io/feather.py 100.00% <0.00%> (ø)
python/cudf/cudf/comm/serialize.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/struct.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/hash_vocab_utils.py 100.00% <0.00%> (ø)
python/dask_cudf/dask_cudf/io/tests/test_csv.py 100.00% <0.00%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 2234554...c75bb85. Read the comment docs.

Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work Jason.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 26, 2021

Great work. But since your work in mostly related to include/cudf/strings/*.* and src/strings/*.*, should we change cpp/tests/copying/gather_str_tests.cu into cpp/tests/strings/gather_str_tests.cu?

@jlowe
Copy link
Member Author

jlowe commented Feb 26, 2021

should we change cpp/tests/copying/gather_str_tests.cu into cpp/tests/strings/gather_str_tests.cu?

This test is organized by how the public API is organized. cudf::gather is in copying.hpp so IMO it makes sense that the tests for it are under copying. If we want to shuffle test files around, I think that's better left to a PR dedicated to that.

Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks!

@harrism
Copy link
Member

harrism commented Mar 2, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit fb7f7c3 into rapidsai:branch-0.19 Mar 2, 2021
@jlowe jlowe deleted the string-gather-perf branch September 10, 2021 15:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants