Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix cudf::strings:split logic for many columns #4922

Merged

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Apr 16, 2020

Closes #4885
The cudf::strings::split() and cudf::strings::rsplit() logic builds a column for each set of tokens by splitting a single string column vertically. That is, all of the first tokens for each string are placed in the first response column. The 2nd tokens in the second column, etc. The number of columns returned is based on the string with the most tokens.

The split()/rsplit() logic was creating each column individually to perhaps minimize memory usage. This means finding the strings for column 10 required parsing through 9 delimiters of each string and then finding the strings for column 11 repeated the same process to 10 delimiters.

As a baseline, I tested with 1000 rows of strings each with ~2KB of characters containing uniformly placed comma "," characters. Running split() with a ',' delimiter creates 100 columns. Pandas execution is about 50ms on my machine. The cudf::strings::split() call was 6.1s. Further details in #4885. An nsys profile showed that ~100ms was spent counting the tokens and the rest was iteratively building the columns. This included interleaved calls to make_strings_column.

The new approach for this PR is to create all the token positions for all string first and then just gather the appropriate token positions into a vector in order to call make_strings_column factory that accepts pointer/position pairs. This sped up the performance significantly since resolving each column did not need to restart searching from the beginning of each string.

Next, code was added to find delimiters by parallelizing the search over all the chars column bytes instead of just per string. Then, each delimiter was associated to its corresponding string by using upper_bound on the offsets column with the delimiter positions. Now it was a matter of computing the token positions purely from the delimiter positions. This reduced the split() time to about 150ms.

A further optimization was made by computing the number of tokens by just using the delimiters positions as well. This final solution executes in 15ms which is about 400x improvement overall. The speedup over Pandas increases with the number of strings.

rows\cols 100 200 300 400
1000 1.4x 1.4x 1.5x 1.6x
10000 12.9x 12.6x 14.3 15.3x
50000 51.8x 50.2x 54.5 63.5x
100000 73.9x 80.1x 82.9 95.1x

Additional changes in this PR

  • moved the contiguous_split_record functions to their own source file (split_record.cu); no code was common and there was no practical reason for this to be in the same source file as regular split()
  • fixed a bug in the column_utilities when printing strings with no validity mask
  • fixed compile warning in rolling.cu because it bugged me
  • added more split/rsplit tests

@davidwendt davidwendt self-assigned this Apr 16, 2020
@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Apr 16, 2020
@codecov
Copy link

codecov bot commented Apr 17, 2020

Codecov Report

Merging #4922 into branch-0.14 will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@             Coverage Diff              @@
##           branch-0.14    #4922   +/-   ##
============================================
  Coverage        88.43%   88.43%           
============================================
  Files               54       54           
  Lines            10201    10201           
============================================
  Hits              9021     9021           
  Misses            1180     1180           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bc4ede...5bc4ede. Read the comment docs.

@@ -0,0 +1,605 @@
/*
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not new code. This code was simply moved out of split.cu in which it shared no common code.

@@ -368,7 +368,7 @@ struct rolling_window_launcher
// The rows that represent null elements will be having negative values in gather map,
// and that's why nullify_out_of_bounds/ignore_out_of_bounds is true.
auto output_table = detail::gather(table_view{{input}}, output->view(), false, true, false, mr, stream);
return std::make_unique<cudf::column>(std::move(output_table->get_column(0)));;
output = std::make_unique<cudf::column>(std::move(output_table->get_column(0)));;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This simply removes a compile warning.

@davidwendt davidwendt marked this pull request as ready for review April 20, 2020 16:16
@davidwendt davidwendt requested review from a team as code owners April 20, 2020 16:16
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 20, 2020
@davidwendt davidwendt changed the title [WIP] Fix cudf::strings:split logic for many columns [REVIEW] Fix cudf::strings:split logic for many columns Apr 20, 2020
@kkraus14 kkraus14 added strings strings issues (C++ and Python) 4 - Needs Review Waiting for reviewer to review or respond labels Apr 20, 2020
Copy link
Collaborator

@kkraus14 kkraus14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cmake is good

cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@trevorsm7 trevorsm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments about comments, but mainly would like to see a test case for overlapping delimiters

cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved
cpp/tests/strings/split_tests.cpp Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from trevorsm7 April 24, 2020 19:07
Copy link
Contributor

@trevorsm7 trevorsm7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test for overlapping separators looks good

@harrism harrism merged commit 4063f4c into rapidsai:branch-0.14 Apr 27, 2020
@davidwendt davidwendt deleted the perf-strings-split-to-many-columns branch April 27, 2020 11:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cuDF str.split performance appears to be orders of magnitude slower than Pandas.
5 participants