Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize cudf::make_strings_column for long strings #7576

Merged
merged 15 commits into from
Mar 18, 2021

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Mar 11, 2021

Reference #7571
This improves the performance of the cudf::make_strings_column for long strings. It uses a similar approach from cudf::strings::detail::gather and also use thresholding as in the optimized cudf::strings::replace.
This may not be the right solution for overall optimizing #7571 but may be helpful in other places where long strings are used for created a strings column in libcudf.
This PR also includes a gbenchmark to help measure the performance results of this factory function. The results of the benchmark are that longer strings (~ >64 bytes on average) showed about a 10x improvement. I can post benchmark results here if needed. The character-parallel algorithm was slower for shorter strings so the existing algorithm is used based on the a threshold calculation.
I also added an additional gtest with a mixture of nulls and empty strings to make sure the new algorithm handles these correctly.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Mar 11, 2021
@davidwendt davidwendt self-assigned this Mar 11, 2021
@davidwendt davidwendt requested review from a team as code owners March 11, 2021 20:59
@github-actions github-actions bot added the CMake CMake build issue label Mar 11, 2021
@davidwendt davidwendt added the Performance Performance related issue label Mar 11, 2021
@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #7576 (6163360) into branch-0.19 (7871e7a) will increase coverage by 0.13%.
The diff coverage is 93.78%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7576      +/-   ##
===============================================
+ Coverage        81.86%   82.00%   +0.13%     
===============================================
  Files              101      101              
  Lines            16884    16991     +107     
===============================================
+ Hits             13822    13933     +111     
+ Misses            3062     3058       -4     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 92.86% <ø> (ø)
python/cudf/cudf/core/column/numerical.py 94.83% <87.50%> (-0.20%) ⬇️
python/cudf/cudf/core/frame.py 89.00% <89.47%> (-0.02%) ⬇️
python/cudf/cudf/core/column/column.py 87.77% <90.00%> (+0.01%) ⬆️
python/cudf/cudf/core/column/decimal.py 92.75% <90.32%> (-2.12%) ⬇️
python/cudf/cudf/core/dataframe.py 90.45% <95.65%> (-0.01%) ⬇️
python/cudf/cudf/core/series.py 91.25% <95.83%> (+0.47%) ⬆️
python/cudf/cudf/core/column/categorical.py 91.62% <100.00%> (+0.23%) ⬆️
python/cudf/cudf/core/column/datetime.py 89.09% <100.00%> (ø)
python/cudf/cudf/core/column/string.py 86.58% <100.00%> (+0.08%) ⬆️
... and 15 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0b766c5...6163360. Read the comment docs.

cpp/benchmarks/string/factory_benchmark.cu Outdated Show resolved Hide resolved
#include "string_bench_args.hpp"

namespace {
using string_pair = thrust::pair<char const*, cudf::size_type>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"string_pair" should mean a pair of strings, so should we use some other name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most usages of 'pair' indicate two different types.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, I don't agree with that statement... but I also don't have a better name suggestion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pair contains data pointer and data size, similar to a span. How about string_span?

cpp/benchmarks/string/factory_benchmark.cu Show resolved Hide resolved
@davidwendt davidwendt requested a review from jlowe March 16, 2021 13:27
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, @davidwendt! I have a small comment suggestion, but otherwise this looks good to me. I ran this on the dataset from #7545 and make_strings_column is now 28.5 msec instead of 371.2 msec. Not too shabby! The cost is still a lot larger than what is spent decompressing and decoding the strings from Parquet for this dataset, but it's a huge improvement from where it was.

@davidwendt
Copy link
Contributor Author

... The cost is still a lot larger than what is spent decompressing and decoding the strings from Parquet for this dataset, but it's a huge improvement from where it was.

We discussed that maybe a better solution is for the Parquet reader to just build the chars and offsets directly instead of using this factory function. But this improvement will also be good for other places that use the factory.

@jrhemstad
Copy link
Contributor

We discussed that maybe a better solution is for the Parquet reader to just build the chars and offsets directly instead of using this factory function. But this improvement will also be good for other places that use the factory.

I discussed this with @devavret and there really isn't anything better they can do in the Parquet reader other then reinventing the implementation of this factory. The string data in the Parquet layout is stored as {string, size, string, size, ...}. So there's no better way to do it than coalescing a bunch of disparate string_views.

@davidwendt davidwendt requested a review from jrhemstad March 17, 2021 21:27
@kkraus14 kkraus14 added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Mar 17, 2021
@kkraus14
Copy link
Collaborator

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 9aa33ef into rapidsai:branch-0.19 Mar 18, 2021
@davidwendt davidwendt deleted the make-strings-column branch March 18, 2021 08:35
rapids-bot bot pushed a commit that referenced this pull request Mar 24, 2021
Reference #5696
Creates gbenchmarks for `nvtext::tokenize()`, `nvtext::count_tokens()` and `nvtext::ngrams_tokenize()` functions.
The benchmarks measures various string lengths and number of rows.

These functions use the `make_strings_column` factory optimized in #7576

Authors:
  - David (@davidwendt)

Approvers:
  - Conor Hoekstra (@codereport)
  - Nghia Truong (@ttnghia)
  - Mark Harris (@harrism)

URL: #7684
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Performance Performance related issue strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants