Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

Merged
merged 20 commits into from
Jun 3, 2020

Conversation

sriramch
Copy link
Contributor

@sriramch sriramch added feature request New feature or request 3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond Spark Functionality that helps Spark RAPIDS labels May 27, 2020
@sriramch sriramch requested review from vuule and davidwendt May 27, 2020 22:27
@sriramch sriramch requested a review from a team as a code owner May 27, 2020 22:27
@GPUtester
Copy link
Collaborator

Please update the changelog in order to start CI tests.

View the gpuCI docs here.

@davidwendt
Copy link
Contributor

This should not be against branch-0.14.

@davidwendt
Copy link
Contributor

Seems odd that substring function is using a delimiter instead of an index. Maybe this is more like a strings split or partition function. Seems like it more like a multi-delimiter partition function from the description.

@sriramch
Copy link
Contributor Author

sriramch commented May 28, 2020

This should not be against branch-0.14.

[sc] does all new features from now on go against 0.15? the reason i'm asking is because i have been creating them against 0.14 thus far. is there a cut off date for 0.14 (after which nothing other than bug fixes get there)?

Seems odd that substring function is using a delimiter instead of an index. Maybe this is more like a strings split or partition function. Seems like it more like a multi-delimiter partition function from the description.

[sc] i have reused the function name from spark. should i rename this to split (analogous to the java api). even though this tokenizes it, it doesn't return all the tokens like split (and returns either the leading or trailing token).

@harrism
Copy link
Member

harrism commented May 28, 2020

https://docs.rapids.ai/releases/process/

Here are the current dates: https://docs.rapids.ai/maintainers

Once burndown starts, we generally don't accept new PRs unless they are urgent.

You just need to retarget this at 0.15 (click the edit button next to the PR title).

@davidwendt
Copy link
Contributor

This is very similar to cudf::strings::split

std::unique_ptr<table> split(strings_column_view const& strings_column,
                             string_scalar const& delimiter      = string_scalar(""),
                             size_type maxsplit                  = -1,
                             rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());	

and cudf::strings::partition

std::unique_ptr<table> partition(
  strings_column_view const& strings,
  string_scalar const& delimiter      = string_scalar(""),
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

You may be able to use split()/rsplit() or partition()/rpartition() directly instead of

std::unique_ptr<column> substring_index(
  strings_column_view const& strings,
  string_scalar const& delimiter,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

Example:

  s = ['www.nvidia.com', null, 'www.google.com', '', 'foo' ]
  r = split(s, '.', 1)
  r[0] is ['www', null, 'www', '', 'foo']
  r[1] is ['nvidia.com', null, 'google.com', null, null]
 
  r = partition(s, '.')
  r[0] is ['www', null, 'www', '', 'foo']
  r[1] is ['.', null, '.', '', '']
  r[2] is ['nvidia.com', null, 'google.com', '', '']

The rsplit() and rpartition() do splitting starting from the end of each string.

Looking at the overall behavior, I think overloading partition()/rpartition() to accept a column of delimiters would make more sense. Also, I think we could add a count parameter to the existingpartition()/rpartition() instead of adding a whole new API here.

@sriramch
Copy link
Contributor Author

Looking at the overall behavior, I think overloading partition()/rpartition() to accept a column of delimiters would make more sense. Also, I think we could add a count parameter to the existingpartition()/rpartition() instead of adding a whole new API here.

thanks for the references to split and [r]partition apis. i wasn't aware of them.

the [r]partition api returns a table of (3) columns, and for our use-cases, a couple of them (column containing the delimiter and the leading/trailing strings) aren't needed. hence, doing this may incur more device memory to create 2 additional columns that are going to be thrown away and could be less performant.

fwiw, i did a small test to create a million strings (from the last test in this pr) and forward searched for a string scalar using both the substring_index api with delimiter count of 1 and the partition api. the partition api was 2.5 times slower.

we could add more flags to ignore creating those additional columns if needed. but, wouldn't this clutter the api?

@davidwendt
Copy link
Contributor

fwiw, i did a small test to create a million strings (from the last test in this pr) and forward searched for a string scalar using both the substring_index api with delimiter count of 1 and the partition api. the partition api was 2.5 times slower.

Is it slower because it is building 2 extra columns? Or perhaps partition() needs a tune up.

we could add more flags to ignore creating those additional columns if needed. but, wouldn't this clutter the api?

Just seems there is a possibility of code re-use with some existing APIs. Also, I guess I'm having trouble with the name. There is index in this substring_index() and does not match the existing substring function in C++ or Java or C#. And it is performing more like a split/partition and I think it will be confusing for anyone not familiar with Spark or SQL.

There is also a set of slice_strings() functions that may be useful to look at. These return only a single column and one takes a column of indices. It would be a matter of doing find()/rfind() to locate the slice points.

@sriramch
Copy link
Contributor Author

Is it slower because it is building 2 extra columns? Or perhaps partition() needs a tune up.

[sc] i did not profile it, but my suspicion was also more along the lines of building those extra columns for this use-case.

Also, I guess I'm having trouble with the name.

[sc] i wasn't happy with it either and simply reused what spark had in the absence of a better name to elicit discussions

Just seems there is a possibility of code re-use with some existing APIs.
There is also a set of slice_strings() functions that may be useful to look at.

[sc] i'll look @ the slice api's to see if i can reuse some of its implementation. would renaming these api's also as slice_strings with different set of parameters be more acceptable?

std::unique_ptr<column> slice_strings(
  strings_column_view const& strings,
  string_scalar const& delimiter,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> slice_strings(
  strings_column_view const& strings,
  strings_column_view const& delimiter_strings,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

@sriramch sriramch requested a review from a team as a code owner May 30, 2020 01:01
@sriramch sriramch requested review from a team as code owners May 30, 2020 01:01
@sriramch sriramch changed the base branch from branch-0.14 to branch-0.15 May 30, 2020 01:03
@sriramch sriramch removed request for a team May 30, 2020 01:09
Copy link
Contributor

@davidwendt davidwendt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great.

cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/tests/strings/substring_tests.cpp Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Jun 1, 2020

rerun tests

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! The test coverage looks perfect :)
Some (mostly minor) suggestions

cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/substring.hpp Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Outdated Show resolved Hide resolved
cpp/src/strings/substring.cu Show resolved Hide resolved
@cwharris cwharris self-requested a review June 2, 2020 17:09
@vuule vuule self-requested a review June 2, 2020 21:40
@vuule vuule merged commit 18c618d into rapidsai:branch-0.15 Jun 3, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] substring_index
6 participants