Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] libcudf string split returning list of strings #5667

Closed
jlowe opened this issue Jul 9, 2020 · 7 comments
Closed

[FEA] libcudf string split returning list of strings #5667

jlowe opened this issue Jul 9, 2020 · 7 comments
Assignees
Labels
libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)

Comments

@jlowe
Copy link
Member

jlowe commented Jul 9, 2020

Is your feature request related to a problem? Please describe.
Spark supports a string split function that takes a string and delimiter and returns a list of strings. We would like to support this operation in the RAPIDS Accelerator for Apache Spark.

Describe the solution you'd like
A form of libcudf's cudf::strings::split that instead of returning separate columns for the fields it returns a single, list-of-strings column where each row contains a list of all the fields produced by the split. Ideally the split function would support a regular expression to identify the delimiter, but we can still support many common queries with a function that only allows an exact-match, scalar delimiter string.

Describe alternatives you've considered
Theoretically we could work with the existing cudf::strings::split but trying to map multiple columns to Spark's ArrayType is messy in practice and would be specific to this operation. It's not inline with the straightforward mapping of ArrayType to the new list type currently being added to libcudf.

@jlowe jlowe added libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python) labels Jul 9, 2020
@kkraus14
Copy link
Collaborator

kkraus14 commented Jul 9, 2020

Pandas similarly supports this if expand=False: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.split.html so we'd want this for cuDF Python as well.

@kkraus14 kkraus14 added the Python Affects Python cuDF API. label Jul 9, 2020
@davidwendt
Copy link
Contributor

Take a look at cudf::strings::contiguous_split_record which is intended to support the expand=False scenario.
It returns a column_view of split fields for each input string. Instead of individual columns, a single memory block holds the entire split result with the column_views identifying the results for each row.

@beckernick
Copy link
Member

Perhaps cuDF Python may want to also add list support to the output of a few other string methods, too (find_multiple (#4569), extract, and findall)

@kkraus14
Copy link
Collaborator

extract / extractall both always return a DataFrame with separate columns per capture group, so no need for list support there.

findall will need list support

@galipremsagar I remember we did something in porting the nvstrings find_multiple to merge it into another API, do you remember?

@galipremsagar
Copy link
Contributor

@galipremsagar I remember we did something in porting the nvstrings find_multiple to merge it into another API, do you remember?

We have a cython plumbing only for find_multiple: cudf._lib.strings.find_multiple.find_multiple: https://github.com/rapidsai/cudf/blob/branch-0.15/python/cudf/cudf/_lib/strings/find_multiple.pyx#L14

So we did the merging of API for replace which supports scalar & list-like inputs: https://docs.rapids.ai/api/cudf/nightly/api.html#cudf.core.column.string.StringMethods.replace

@davidwendt
Copy link
Contributor

The libcudf part of this was completed in PR #5687
I've left this open for any required cudf Python/Cython plumbing.

@kkraus14
Copy link
Collaborator

This was handled in Python / Cython as well. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API. Spark Functionality that helps Spark RAPIDS strings strings issues (C++ and Python)
Projects
None yet
Development

No branches or pull requests

5 participants