-
Notifications
You must be signed in to change notification settings - Fork 916
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Fix cudf::strings:split logic for many columns #4922
[REVIEW] Fix cudf::strings:split logic for many columns #4922
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-0.14 #4922 +/- ##
============================================
Coverage 88.43% 88.43%
============================================
Files 54 54
Lines 10201 10201
============================================
Hits 9021 9021
Misses 1180 1180 Continue to review full report at Codecov.
|
@@ -0,0 +1,605 @@ | |||
/* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not new code. This code was simply moved out of split.cu
in which it shared no common code.
cpp/src/rolling/rolling.cu
Outdated
@@ -368,7 +368,7 @@ struct rolling_window_launcher | |||
// The rows that represent null elements will be having negative values in gather map, | |||
// and that's why nullify_out_of_bounds/ignore_out_of_bounds is true. | |||
auto output_table = detail::gather(table_view{{input}}, output->view(), false, true, false, mr, stream); | |||
return std::make_unique<cudf::column>(std::move(output_table->get_column(0)));; | |||
output = std::make_unique<cudf::column>(std::move(output_table->get_column(0)));; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This simply removes a compile warning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cmake is good
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments about comments, but mainly would like to see a test case for overlapping delimiters
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test for overlapping separators looks good
Closes #4885
The
cudf::strings::split()
andcudf::strings::rsplit()
logic builds a column for each set of tokens by splitting a single string column vertically. That is, all of the first tokens for each string are placed in the first response column. The 2nd tokens in the second column, etc. The number of columns returned is based on the string with the most tokens.The
split()/rsplit()
logic was creating each column individually to perhaps minimize memory usage. This means finding the strings for column 10 required parsing through 9 delimiters of each string and then finding the strings for column 11 repeated the same process to 10 delimiters.As a baseline, I tested with 1000 rows of strings each with ~2KB of characters containing uniformly placed comma "," characters. Running
split()
with a','
delimiter creates 100 columns. Pandas execution is about 50ms on my machine. Thecudf::strings::split()
call was 6.1s. Further details in #4885. An nsys profile showed that ~100ms was spent counting the tokens and the rest was iteratively building the columns. This included interleaved calls tomake_strings_column
.The new approach for this PR is to create all the token positions for all string first and then just gather the appropriate token positions into a vector in order to call
make_strings_column
factory that accepts pointer/position pairs. This sped up the performance significantly since resolving each column did not need to restart searching from the beginning of each string.Next, code was added to find delimiters by parallelizing the search over all the chars column bytes instead of just per string. Then, each delimiter was associated to its corresponding string by using
upper_bound
on the offsets column with the delimiter positions. Now it was a matter of computing the token positions purely from the delimiter positions. This reduced thesplit()
time to about 150ms.A further optimization was made by computing the number of tokens by just using the delimiters positions as well. This final solution executes in 15ms which is about 400x improvement overall. The speedup over Pandas increases with the number of strings.
Additional changes in this PR
contiguous_split_record
functions to their own source file (split_record.cu
); no code was common and there was no practical reason for this to be in the same source file as regularsplit()
column_utilities
when printing strings with no validity maskrolling.cu
because it bugged me