Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add libcudf strings split API that accepts regex pattern #10128

Merged
merged 25 commits into from
Feb 11, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jan 26, 2022

Reference #3584

This PR adds 4 new libcudf strings APIs for split.

  • cudf::strings::split_re - split using regex to locate delimiters with table output like cudf::strings::split.
  • cudf::strings::rsplit_re - same as split_re but delimiter search starts from the end of each string
  • cudf::strings::split_record_re - same as split_re but returns a list column like split_record does
  • cudf::strings::rsplit_record_re - same as split_record_re but delimiter search starts from the end of each string

Like split/rsplit the results try to match Pandas behavior for these. The record results are similar to specifying expand=False in the Pandas split/rsplit APIs. Python/Cython updates for cuDF will be in a follow-on PR.
Currently, Pandas does not support regex for its rsplit even though it has been documented and there is an issue here.

New gtests have been added for these along with some additional tests that were missing for the non-regex versions of these APIs.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Jan 26, 2022
@davidwendt davidwendt self-assigned this Jan 26, 2022
@github-actions github-actions bot added CMake CMake build issue conda labels Jan 26, 2022
@codecov
Copy link

codecov bot commented Jan 26, 2022

Codecov Report

Merging #10128 (63ec0ac) into branch-22.04 (a7d88cd) will decrease coverage by 0.42%.
The diff coverage is 0.00%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.04   #10128      +/-   ##
================================================
- Coverage         10.42%   10.00%   -0.43%     
================================================
  Files               119      122       +3     
  Lines             20603    21470     +867     
================================================
- Hits               2148     2147       -1     
- Misses            18455    19323     +868     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/main.py 0.00% <0.00%> (ø)
python/cudf/cudf/_version.py 0.00% <0.00%> (ø)
python/cudf/cudf/comm/gpuarrow.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/methods.py 0.00% <0.00%> (ø)
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 19dc46f...63ec0ac. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Feb 1, 2022
@davidwendt davidwendt marked this pull request as ready for review February 1, 2022 17:37
@ttnghia ttnghia added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Feb 7, 2022
@ttnghia
Copy link
Contributor

ttnghia commented Feb 7, 2022

Just add DO_NOT_MERGE label as we are testing it and potentially uncover some issues. I'll report later.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice overall. Comments attached.

cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Outdated Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Show resolved Hide resolved
cpp/tests/strings/split_tests.cpp Show resolved Hide resolved
@davidwendt davidwendt requested a review from bdice February 8, 2022 13:16
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/split/split_re.hpp Outdated Show resolved Hide resolved
cpp/src/strings/split/split_re.cu Show resolved Hide resolved
@davidwendt davidwendt requested a review from nvdbaranec February 8, 2022 17:43
@davidwendt
Copy link
Contributor Author

rerun tests

@davidwendt
Copy link
Contributor Author

This is ready to merge. It has a sufficient number of reviews and all comments have been addressed.
Anyone see a reason to keep the DO NOT MERGE label?
Any follow up issues can be addressed in follow up PRs.
There is draft #10185 which is waiting for this to be merged.

@ttnghia
Copy link
Contributor

ttnghia commented Feb 10, 2022

This is ready to merge. It has a sufficient number of reviews and all comments have been addressed. Anyone see a reason to keep the DO NOT MERGE label?

We are testing it with some corner cases. @andygrove are you done with the testing? Can this be merged now?

@andygrove
Copy link
Contributor

We are testing it with some corner cases. @andygrove are you done with the testing? Can this be merged now?

Yes, this LGTM. I have now approved.

@ttnghia ttnghia removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Feb 10, 2022
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 8cc84c6 into rapidsai:branch-22.04 Feb 11, 2022
@davidwendt davidwendt deleted the fea-split-with-regex branch February 11, 2022 13:06
rapids-bot bot pushed a commit that referenced this pull request Feb 14, 2022
This PR adds Java binding for the new strings API `strings::split_re` and `strings::split_record_re`, which allows splitting strings by regular expression delimiters.

In addition, the Java string split overloads with default split pattern (an empty string) are removed in this PR. That is because with default empty pattern the Java's split API produces different results than cudf.

Finally, some cleanup has been perform automatically thanks to IntelliJ IDE.

Depends on #10128.

This is breaking change which is fixed by NVIDIA/spark-rapids#4714. Thus, it should be merged at the same time with NVIDIA/spark-rapids#4714.

Authors:
  - Nghia Truong (https://github.com/ttnghia)
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert (Bobby) Evans (https://github.com/revans2)
  - Andy Grove (https://github.com/andygrove)

URL: #10139
rapids-bot bot pushed a commit that referenced this pull request Feb 19, 2022
Closes #3584 

This depends on libcudf changes in PR #10128 

This adds the regex parameter to the cudf strings `split()` function similar to the 1.4.0 Pandas one documented [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html). 

The main difference is that the `pat` parameter will only be interpreted as regex if the `pat` string has more than 1 character and the `regex` parameter is set to `True`. This is to help with consistency and migration from the previous implementation.

The 1.3.x Pandas version does not have a `regex` parameter for `split()` but instead will try to interpret the intention of the `pat` parameter without it. This seems a bit dangerous since regex would be much slower for us here. Therefore, the `regex` parameter is required to be set to `True` in the cudf implementation in order to use the regex logic path.

Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue here.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #10185
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants