[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

sriramch · 2020-05-27T22:27:13Z

this Closes [FEA] substring_index #5158
this emulates spark's substring_index function

[RELEASE] cudf v0.12

…er until end of string - this Closes rapidsai#5158 - this emulates spark's `substring_index` function

…ubstring_index

GPUtester · 2020-05-27T22:27:46Z

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

davidwendt · 2020-05-27T23:19:47Z

This should not be against branch-0.14.

davidwendt · 2020-05-27T23:22:18Z

Seems odd that substring function is using a delimiter instead of an index. Maybe this is more like a strings split or partition function. Seems like it more like a multi-delimiter partition function from the description.

sriramch · 2020-05-28T00:18:19Z

This should not be against branch-0.14.

[sc] does all new features from now on go against 0.15? the reason i'm asking is because i have been creating them against 0.14 thus far. is there a cut off date for 0.14 (after which nothing other than bug fixes get there)?

Seems odd that substring function is using a delimiter instead of an index. Maybe this is more like a strings split or partition function. Seems like it more like a multi-delimiter partition function from the description.

[sc] i have reused the function name from spark. should i rename this to split (analogous to the java api). even though this tokenizes it, it doesn't return all the tokens like split (and returns either the leading or trailing token).

harrism · 2020-05-28T03:28:33Z

https://docs.rapids.ai/releases/process/

Here are the current dates: https://docs.rapids.ai/maintainers

Once burndown starts, we generally don't accept new PRs unless they are urgent.

You just need to retarget this at 0.15 (click the edit button next to the PR title).

…ubstring_index

davidwendt · 2020-05-28T14:21:28Z

This is very similar to cudf::strings::split

std::unique_ptr<table> split(strings_column_view const& strings_column,
                             string_scalar const& delimiter      = string_scalar(""),
                             size_type maxsplit                  = -1,
                             rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

and cudf::strings::partition

std::unique_ptr<table> partition(
  strings_column_view const& strings,
  string_scalar const& delimiter      = string_scalar(""),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

You may be able to use split()/rsplit() or partition()/rpartition() directly instead of

std::unique_ptr<column> substring_index(
  strings_column_view const& strings,
  string_scalar const& delimiter,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

Example:

  s = ['www.nvidia.com', null, 'www.google.com', '', 'foo' ]
  r = split(s, '.', 1)
  r[0] is ['www', null, 'www', '', 'foo']
  r[1] is ['nvidia.com', null, 'google.com', null, null]
 
  r = partition(s, '.')
  r[0] is ['www', null, 'www', '', 'foo']
  r[1] is ['.', null, '.', '', '']
  r[2] is ['nvidia.com', null, 'google.com', '', '']

The rsplit() and rpartition() do splitting starting from the end of each string.

Looking at the overall behavior, I think overloading partition()/rpartition() to accept a column of delimiters would make more sense. Also, I think we could add a count parameter to the existingpartition()/rpartition() instead of adding a whole new API here.

sriramch · 2020-05-28T17:36:14Z

Looking at the overall behavior, I think overloading partition()/rpartition() to accept a column of delimiters would make more sense. Also, I think we could add a count parameter to the existingpartition()/rpartition() instead of adding a whole new API here.

thanks for the references to split and [r]partition apis. i wasn't aware of them.

the [r]partition api returns a table of (3) columns, and for our use-cases, a couple of them (column containing the delimiter and the leading/trailing strings) aren't needed. hence, doing this may incur more device memory to create 2 additional columns that are going to be thrown away and could be less performant.

fwiw, i did a small test to create a million strings (from the last test in this pr) and forward searched for a string scalar using both the substring_index api with delimiter count of 1 and the partition api. the partition api was 2.5 times slower.

we could add more flags to ignore creating those additional columns if needed. but, wouldn't this clutter the api?

davidwendt · 2020-05-28T19:18:45Z

fwiw, i did a small test to create a million strings (from the last test in this pr) and forward searched for a string scalar using both the substring_index api with delimiter count of 1 and the partition api. the partition api was 2.5 times slower.

Is it slower because it is building 2 extra columns? Or perhaps partition() needs a tune up.

we could add more flags to ignore creating those additional columns if needed. but, wouldn't this clutter the api?

Just seems there is a possibility of code re-use with some existing APIs. Also, I guess I'm having trouble with the name. There is index in this substring_index() and does not match the existing substring function in C++ or Java or C#. And it is performing more like a split/partition and I think it will be confusing for anyone not familiar with Spark or SQL.

There is also a set of slice_strings() functions that may be useful to look at. These return only a single column and one takes a column of indices. It would be a matter of doing find()/rfind() to locate the slice points.

sriramch · 2020-05-29T15:06:40Z

Is it slower because it is building 2 extra columns? Or perhaps partition() needs a tune up.

[sc] i did not profile it, but my suspicion was also more along the lines of building those extra columns for this use-case.

Also, I guess I'm having trouble with the name.

[sc] i wasn't happy with it either and simply reused what spark had in the absence of a better name to elicit discussions

Just seems there is a possibility of code re-use with some existing APIs.
There is also a set of slice_strings() functions that may be useful to look at.

[sc] i'll look @ the slice api's to see if i can reuse some of its implementation. would renaming these api's also as slice_strings with different set of parameters be more acceptable?

std::unique_ptr<column> slice_strings(
  strings_column_view const& strings,
  string_scalar const& delimiter,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> slice_strings(
  strings_column_view const& strings,
  strings_column_view const& delimiter_strings,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

- reuse some of the facility `slice_strings` already has to build the substrings

…ubstring_index

davidwendt

This looks great.

cpp/include/cudf/strings/substring.hpp

cpp/src/strings/substring.cu

cpp/tests/strings/substring_tests.cpp

…ubstring_index

vuule · 2020-06-01T18:15:18Z

rerun tests

vuule

Great work! The test coverage looks perfect :)
Some (mostly minor) suggestions

cpp/include/cudf/strings/substring.hpp

cpp/src/strings/substring.cu

…ubstring_index

…ing_index

…to substring_index

raydouglass and others added 6 commits February 4, 2020 11:13

Merge pull request rapidsai#3907 from rapidsai/branch-0.12

d3f2206

[RELEASE] cudf v0.12

REL v0.12.0 release

8d7bf34

- compute substrings from beginning until delimiter or from a delimit…

7178642

…er until end of string - this Closes rapidsai#5158 - this emulates spark's `substring_index` function

Merge branch 'branch-0.14' of https://github.com/rapidsai/cudf into s…

6a7e2b1

…ubstring_index

- updates after upstream merge

41e2ea9

- fix code style

fe8d3ea

sriramch added feature request New feature or request 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond Spark Functionality that helps Spark RAPIDS labels May 27, 2020

sriramch requested review from vuule and davidwendt May 27, 2020 22:27

sriramch requested a review from a team as a code owner May 27, 2020 22:27

- add changelog entry

3554406

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into s…

1223175

…ubstring_index

sriramch added 4 commits May 29, 2020 18:37

- rename method to slice_strings

e9c5e6a

- reuse some of the facility `slice_strings` already has to build the substrings

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into s…

f8f0f6e

…ubstring_index

- minor cleanup to substring index computing functor

81d8b62

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into s…

d32f5d6

…ubstring_index

sriramch requested a review from a team as a code owner May 30, 2020 01:01

sriramch requested review from a team as code owners May 30, 2020 01:01

sriramch changed the base branch from branch-0.14 to branch-0.15 May 30, 2020 01:03

sriramch removed request for a team May 30, 2020 01:09

davidwendt requested changes Jun 1, 2020

View reviewed changes

sriramch added 2 commits June 1, 2020 14:44

- incorporate review comments

b02244b

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into s…

a9cef97

…ubstring_index

davidwendt approved these changes Jun 1, 2020

View reviewed changes

vuule requested changes Jun 1, 2020

View reviewed changes

sriramch added 2 commits June 2, 2020 01:27

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into s…

221d5de

…ubstring_index

- incorporate review comments

9d2fc8b

cwharris self-requested a review June 2, 2020 17:09

vuule self-requested a review June 2, 2020 21:40

vuule approved these changes Jun 2, 2020

View reviewed changes

vuule and others added 4 commits June 2, 2020 14:46

Merge branch 'branch-0.15' into substring_index

edf655a

Merge branch 'branch-0.15' of https://github.com/rapidsai/cudf into s…

bfca7b0

…ubstring_index

Merge branch 'master' of https://github.com/sriramch/cudf into substr…

023b712

…ing_index

Merge branch 'substring_index' of https://github.com/sriramch/cudf in…

7712959

…to substring_index

vuule merged commit 18c618d into rapidsai:branch-0.15 Jun 3, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

sriramch commented May 27, 2020

GPUtester commented May 27, 2020

davidwendt commented May 27, 2020

davidwendt commented May 27, 2020

sriramch commented May 28, 2020 •

edited

Loading

harrism commented May 28, 2020

davidwendt commented May 28, 2020

sriramch commented May 28, 2020

davidwendt commented May 28, 2020

sriramch commented May 29, 2020

davidwendt left a comment

vuule commented Jun 1, 2020

vuule left a comment

[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

Conversation

sriramch commented May 27, 2020

GPUtester commented May 27, 2020

davidwendt commented May 27, 2020

davidwendt commented May 27, 2020

sriramch commented May 28, 2020 • edited Loading

harrism commented May 28, 2020

davidwendt commented May 28, 2020

sriramch commented May 28, 2020

davidwendt commented May 28, 2020

sriramch commented May 29, 2020

davidwendt left a comment

Choose a reason for hiding this comment

vuule commented Jun 1, 2020

vuule left a comment

Choose a reason for hiding this comment

sriramch commented May 28, 2020 •

edited

Loading