Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] substring_index #5158

Closed
revans2 opened this issue May 11, 2020 · 1 comment · Fixed by #5303
Closed

[FEA] substring_index #5158

revans2 opened this issue May 11, 2020 · 1 comment · Fixed by #5303
Assignees
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@revans2
Copy link
Contributor

revans2 commented May 11, 2020

Is your feature request related to a problem? Please describe.
I would love to have an API that acts like the substring_index SQL function.

The following is from the official spark docs

substring_index

substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for delim.

Examples:

SELECT substring_index('www.apache.org', '.', 2);
www.apache

Describe the solution you'd like
I would like a function that takes 3 parameters, the original string, a sub-string to look for and a count for how many matches to make. Ideally we provide versions that can take Scalars as well as columns for each of the parameters, but I am willing to take one that just uses columns as I can create a column from scalars if I need to.

Describe alternatives you've considered
I tried to do this with extract, and got most of the way there for some very special cases, but it is no where near complete and is likely going to be a lot slower than a special built solution.

Additional context
substring_index is a standard SQl operator so I suspect that others will be interested i it too.

@revans2 revans2 added feature request New feature or request Needs Triage Need team to review and classify labels May 11, 2020
@harrism harrism added strings strings issues (C++ and Python) Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 12, 2020
@sriramch
Copy link
Contributor

@revans2 - would you require the number of times the delimiter have to be searched unique for every row, or would a global delimiter search count work?

would something like this work?

std::unique_ptr<column> substring_index(
  strings_column_view const& strings,
  string_scalar const& delimiter,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> substring_index(
  strings_column_view const& strings,
  strings_column_view const& delimiter_strings,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

or, would you require the 3rd argument for the 2nd function to be a column as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
3 participants