-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add libcudf strings split API that accepts regex pattern #10128
Add libcudf strings split API that accepts regex pattern #10128
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.04 #10128 +/- ##
================================================
- Coverage 10.42% 10.00% -0.43%
================================================
Files 119 122 +3
Lines 20603 21470 +867
================================================
- Hits 2148 2147 -1
- Misses 18455 19323 +868
Continue to review full report at Codecov.
|
Just add DO_NOT_MERGE label as we are testing it and potentially uncover some issues. I'll report later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice overall. Comments attached.
rerun tests |
This is ready to merge. It has a sufficient number of reviews and all comments have been addressed. |
We are testing it with some corner cases. @andygrove are you done with the testing? Can this be merged now? |
Yes, this LGTM. I have now approved. |
@gpucibot merge |
This PR adds Java binding for the new strings API `strings::split_re` and `strings::split_record_re`, which allows splitting strings by regular expression delimiters. In addition, the Java string split overloads with default split pattern (an empty string) are removed in this PR. That is because with default empty pattern the Java's split API produces different results than cudf. Finally, some cleanup has been perform automatically thanks to IntelliJ IDE. Depends on #10128. This is breaking change which is fixed by NVIDIA/spark-rapids#4714. Thus, it should be merged at the same time with NVIDIA/spark-rapids#4714. Authors: - Nghia Truong (https://github.com/ttnghia) - David Wendt (https://github.com/davidwendt) Approvers: - Robert (Bobby) Evans (https://github.com/revans2) - Andy Grove (https://github.com/andygrove) URL: #10139
Closes #3584 This depends on libcudf changes in PR #10128 This adds the regex parameter to the cudf strings `split()` function similar to the 1.4.0 Pandas one documented [here](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.split.html). The main difference is that the `pat` parameter will only be interpreted as regex if the `pat` string has more than 1 character and the `regex` parameter is set to `True`. This is to help with consistency and migration from the previous implementation. The 1.3.x Pandas version does not have a `regex` parameter for `split()` but instead will try to interpret the intention of the `pat` parameter without it. This seems a bit dangerous since regex would be much slower for us here. Therefore, the `regex` parameter is required to be set to `True` in the cudf implementation in order to use the regex logic path. Pandas does not support regex for its `rsplit` even though it has been documented and there is an issue here. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Vyas Ramasubramani (https://github.com/vyasr) URL: #10185
Reference #3584
This PR adds 4 new libcudf strings APIs for split.
cudf::strings::split_re
- split using regex to locate delimiters with table output likecudf::strings::split
.cudf::strings::rsplit_re
- same assplit_re
but delimiter search starts from the end of each stringcudf::strings::split_record_re
- same assplit_re
but returns a list column likesplit_record
doescudf::strings::rsplit_record_re
- same assplit_record_re
but delimiter search starts from the end of each stringLike
split/rsplit
the results try to match Pandas behavior for these. Therecord
results are similar to specifyingexpand=False
in the Pandassplit/rsplit
APIs. Python/Cython updates for cuDF will be in a follow-on PR.Currently, Pandas does not support regex for its
rsplit
even though it has been documented and there is an issue here.New gtests have been added for these along with some additional tests that were missing for the non-regex versions of these APIs.