-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cudf::strings::extract_all API #9909
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9909 +/- ##
================================================
- Coverage 10.49% 10.41% -0.08%
================================================
Files 119 119
Lines 20305 20480 +175
================================================
+ Hits 2130 2134 +4
- Misses 18175 18346 +171
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few minor questions/suggestions but overall this is really nicely written.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One minor suggestion. Otherwise LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great. Tiny nit is all I have and it's up for interpretation.
rerun tests |
@gpucibot merge |
Reference #9856 specifically #9856 (comment) Adds `cudf::strings::findall_record` which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function `cudf.str.findall()` with the `expand=False` parameter more effectively. A follow-on PR will address these python changes. This PR reorganizes the libcudf strings _find_ source files into the `cpp/src/strings/search` subdirectory as well. Also, `findall()` has only a regex version so the `_re` suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface. Depends on #9909 -- shares the `cudf::strings::detail::count_matches()` utility function. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9911
Closes #9856
Adds a new
cudf::strings::extract_all
API that returns a LIST column of extracted strings given a regex pattern.This is similar to nvstrings version of
extract
calledextract_record
but returns groups from all matches in each string instead of just the first match. Here is pseudo code of it's behavior on various strings input:Each match results in two groups as specified in the regex pattern.
Also reorganized the extract source code into
src/strings/extract
directory.The match-counting has been factored out into new
count_matches.cuh
since it will become common code used withfindall_record
in a follow on PR.