[FEA] substring_index #5158

revans2 · 2020-05-11T17:45:12Z

Is your feature request related to a problem? Please describe.
I would love to have an API that acts like the substring_index SQL function.

The following is from the official spark docs

substring_index

substring_index(str, delim, count) - Returns the substring from str before count occurrences of the delimiter delim. If count is positive, everything to the left of the final delimiter (counting from the left) is returned. If count is negative, everything to the right of the final delimiter (counting from the right) is returned. The function substring_index performs a case-sensitive match when searching for delim.

Examples:

SELECT substring_index('www.apache.org', '.', 2);
www.apache

Describe the solution you'd like
I would like a function that takes 3 parameters, the original string, a sub-string to look for and a count for how many matches to make. Ideally we provide versions that can take Scalars as well as columns for each of the parameters, but I am willing to take one that just uses columns as I can create a column from scalars if I need to.

Describe alternatives you've considered
I tried to do this with extract, and got most of the way there for some very special cases, but it is no where near complete and is likely going to be a lot slower than a special built solution.

Additional context
substring_index is a standard SQl operator so I suspect that others will be interested i it too.

sriramch · 2020-05-20T21:52:45Z

@revans2 - would you require the number of times the delimiter have to be searched unique for every row, or would a global delimiter search count work?

would something like this work?

std::unique_ptr<column> substring_index(
  strings_column_view const& strings,
  string_scalar const& delimiter,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

std::unique_ptr<column> substring_index(
  strings_column_view const& strings,
  strings_column_view const& delimiter_strings,
  size_type count,
  rmm::mr::device_memory_resource* mr = rmm::mr::get_default_resource());

or, would you require the 3rd argument for the 2nd function to be a column as well?

…er until end of string - this Closes rapidsai#5158 - this emulates spark's `substring_index` function

revans2 added feature request New feature or request Needs Triage Need team to review and classify labels May 11, 2020

harrism added strings strings issues (C++ and Python) Spark Functionality that helps Spark RAPIDS libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 12, 2020

sriramch self-assigned this May 20, 2020

sriramch added a commit to sriramch/cudf that referenced this issue May 27, 2020

- compute substrings from beginning until delimiter or from a delimit…

7178642

…er until end of string - this Closes rapidsai#5158 - this emulates spark's `substring_index` function

sriramch mentioned this issue May 27, 2020

[REVIEW] compute substrings from beginning until delimiter or from a delimiter until end of string #5303

Merged

vuule closed this as completed in #5303 Jun 3, 2020

abellina mentioned this issue Jul 19, 2023

[FEA] Rework GpuSubstringIndex to use cudf::slice_strings NVIDIA/spark-rapids#8750

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] substring_index #5158

[FEA] substring_index #5158

revans2 commented May 11, 2020

sriramch commented May 20, 2020

[FEA] substring_index #5158

[FEA] substring_index #5158

Comments

revans2 commented May 11, 2020

sriramch commented May 20, 2020