[REVIEW] Fix cudf::strings:split logic for many columns #4922

davidwendt · 2020-04-16T22:27:30Z

Closes #4885
The cudf::strings::split() and cudf::strings::rsplit() logic builds a column for each set of tokens by splitting a single string column vertically. That is, all of the first tokens for each string are placed in the first response column. The 2nd tokens in the second column, etc. The number of columns returned is based on the string with the most tokens.

The split()/rsplit() logic was creating each column individually to perhaps minimize memory usage. This means finding the strings for column 10 required parsing through 9 delimiters of each string and then finding the strings for column 11 repeated the same process to 10 delimiters.

As a baseline, I tested with 1000 rows of strings each with ~2KB of characters containing uniformly placed comma "," characters. Running split() with a ',' delimiter creates 100 columns. Pandas execution is about 50ms on my machine. The cudf::strings::split() call was 6.1s. Further details in #4885. An nsys profile showed that ~100ms was spent counting the tokens and the rest was iteratively building the columns. This included interleaved calls to make_strings_column.

The new approach for this PR is to create all the token positions for all string first and then just gather the appropriate token positions into a vector in order to call make_strings_column factory that accepts pointer/position pairs. This sped up the performance significantly since resolving each column did not need to restart searching from the beginning of each string.

Next, code was added to find delimiters by parallelizing the search over all the chars column bytes instead of just per string. Then, each delimiter was associated to its corresponding string by using upper_bound on the offsets column with the delimiter positions. Now it was a matter of computing the token positions purely from the delimiter positions. This reduced the split() time to about 150ms.

A further optimization was made by computing the number of tokens by just using the delimiters positions as well. This final solution executes in 15ms which is about 400x improvement overall. The speedup over Pandas increases with the number of strings.

rows\cols	100	200	300	400
1000	1.4x	1.4x	1.5x	1.6x
10000	12.9x	12.6x	14.3	15.3x
50000	51.8x	50.2x	54.5	63.5x
100000	73.9x	80.1x	82.9	95.1x

Additional changes in this PR

moved the contiguous_split_record functions to their own source file (split_record.cu); no code was common and there was no practical reason for this to be in the same source file as regular split()
fixed a bug in the column_utilities when printing strings with no validity mask
fixed compile warning in rolling.cu because it bugged me
added more split/rsplit tests

codecov · 2020-04-17T00:38:01Z

Codecov Report

Merging #4922 into branch-0.14 will not change coverage.
The diff coverage is n/a.

@@             Coverage Diff              @@
##           branch-0.14    #4922   +/-   ##
============================================
  Coverage        88.43%   88.43%           
============================================
  Files               54       54           
  Lines            10201    10201           
============================================
  Hits              9021     9021           
  Misses            1180     1180

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 5bc4ede...5bc4ede. Read the comment docs.

davidwendt · 2020-04-20T16:08:05Z

cpp/src/strings/split/split_record.cu

@@ -0,0 +1,605 @@
+/*


This is not new code. This code was simply moved out of split.cu in which it shared no common code.

davidwendt · 2020-04-20T16:08:58Z

cpp/src/rolling/rolling.cu

@@ -368,7 +368,7 @@ struct rolling_window_launcher
          // The rows that represent null elements will be having negative values in gather map,
          // and that's why nullify_out_of_bounds/ignore_out_of_bounds is true.
          auto output_table = detail::gather(table_view{{input}}, output->view(), false, true, false, mr, stream);
-          return std::make_unique<cudf::column>(std::move(output_table->get_column(0)));;
+          output = std::make_unique<cudf::column>(std::move(output_table->get_column(0)));;


This simply removes a compile warning.

kkraus14

cmake is good

cpp/src/strings/split/split.cu

trevorsm7

Left some comments about comments, but mainly would like to see a test case for overlapping delimiters

cpp/src/strings/split/split.cu

cpp/tests/strings/split_tests.cpp

trevorsm7

Test for overlapping separators looks good

davidwendt added 7 commits April 14, 2020 09:03

move split-record fn to separate source file

0d045e5

recoded split/rsplit for perf

e6f0d35

fix test::print for strings columns with no nullmask

c18eb2c

add more tests using maxsplits parm

7b6ead3

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

6dcfc91

update comments and remove prints

fa3135a

fix all-null special case

cc96b68

davidwendt self-assigned this Apr 16, 2020

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. labels Apr 16, 2020

davidwendt added 9 commits April 17, 2020 09:11

hates compile warnings

43fde57

workaround compile segfault

36659c1

update changelog

33dea6c

fix changelog entry

e225d7c

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

a709fa7

add base class for split/rsplit fns

0cb9b11

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

41c0eff

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

9340e8a

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

fb114f1

davidwendt commented Apr 20, 2020

View reviewed changes

davidwendt added 2 commits April 20, 2020 12:14

handle empty column

9832fef

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

fa59935

davidwendt marked this pull request as ready for review April 20, 2020 16:16

davidwendt requested review from a team as code owners April 20, 2020 16:16

davidwendt requested review from devavret and karthikeyann April 20, 2020 16:16

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Apr 20, 2020

davidwendt changed the title ~~[WIP] Fix cudf::strings:split logic for many columns~~ [REVIEW] Fix cudf::strings:split logic for many columns Apr 20, 2020

kkraus14 added strings strings issues (C++ and Python) 4 - Needs Review Waiting for reviewer to review or respond labels Apr 20, 2020

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

bb8cb7b

kkraus14 approved these changes Apr 21, 2020

View reviewed changes

karthikeyann reviewed Apr 21, 2020

View reviewed changes

trevorsm7 suggested changes Apr 22, 2020

View reviewed changes

cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved

cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved

cpp/src/strings/split/split.cu Outdated Show resolved Hide resolved

cpp/tests/strings/split_tests.cpp Outdated Show resolved Hide resolved

davidwendt added 12 commits April 23, 2020 10:38

fix merge conflicts from clang format

772237c

fix merge conflict from clang format again

3f0706b

clang-format new file

0eb2392

add tests for overlapped delimiter

f2a8142

update parameter name in declarations

789a37a

fix comment

520e29c

one more test string

26fb2f1

forgot clang format

80188e0

add factory with iterator parms

a258cf3

place tokens in column order; use new iterator factory

b1bc0dd

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

90f98a9

Merge branch 'branch-0.14' into perf-strings-split-to-many-columns

5bc4ede

davidwendt requested a review from trevorsm7 April 24, 2020 19:07

trevorsm7 approved these changes Apr 24, 2020

View reviewed changes

karthikeyann approved these changes Apr 24, 2020

View reviewed changes

harrism merged commit 4063f4c into rapidsai:branch-0.14 Apr 27, 2020

davidwendt deleted the perf-strings-split-to-many-columns branch April 27, 2020 11:59

davidwendt mentioned this pull request Apr 3, 2023

[FEA] Story - Improve performance with long strings #13048

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Fix cudf::strings:split logic for many columns #4922

[REVIEW] Fix cudf::strings:split logic for many columns #4922

davidwendt commented Apr 16, 2020 •

edited

Loading

codecov bot commented Apr 17, 2020 •

edited

Loading

davidwendt Apr 20, 2020

davidwendt Apr 20, 2020

kkraus14 left a comment

trevorsm7 left a comment

trevorsm7 left a comment

[REVIEW] Fix cudf::strings:split logic for many columns #4922

[REVIEW] Fix cudf::strings:split logic for many columns #4922

Conversation

davidwendt commented Apr 16, 2020 • edited Loading

codecov bot commented Apr 17, 2020 • edited Loading

Codecov Report

davidwendt Apr 20, 2020

Choose a reason for hiding this comment

davidwendt Apr 20, 2020

Choose a reason for hiding this comment

kkraus14 left a comment

Choose a reason for hiding this comment

trevorsm7 left a comment

Choose a reason for hiding this comment

trevorsm7 left a comment

Choose a reason for hiding this comment

davidwendt commented Apr 16, 2020 •

edited

Loading

codecov bot commented Apr 17, 2020 •

edited

Loading