Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add cudf::strings::extract_all API #9909

Merged
merged 11 commits into from
Jan 5, 2022

Conversation

davidwendt
Copy link
Contributor

Closes #9856

Adds a new cudf::strings::extract_all API that returns a LIST column of extracted strings given a regex pattern.

This is similar to nvstrings version of extract called extract_record but returns groups from all matches in each string instead of just the first match. Here is pseudo code of it's behavior on various strings input:

s = [ "ABC-200 DEF-400", "GHI-60", "JK-800", "900", NULL ]
r =  extract_all( s, "'(\w+)-(\d+)" )
r is a LIST column of strings that looks like this:

[ [ "ABC", "200", "DEF", "400" ], // 2 matches
  [ "GHI", "60" ], // 1 match
  [ "JK", "800" ], // 1 match
  NULL,            // no match
  NULL
]

Each match results in two groups as specified in the regex pattern.

Also reorganized the extract source code into src/strings/extract directory.
The match-counting has been factored out into new count_matches.cuh since it will become common code used with findall_record in a follow on PR.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Dec 15, 2021
@davidwendt davidwendt self-assigned this Dec 15, 2021
@github-actions github-actions bot added the CMake CMake build issue label Dec 15, 2021
@codecov
Copy link

codecov bot commented Dec 15, 2021

Codecov Report

Merging #9909 (d5b31d7) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9909      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20480     +175     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18346     +171     
Impacted Files Coverage Δ
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce02856...d5b31d7. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 16, 2021
@davidwendt davidwendt marked this pull request as ready for review December 16, 2021 15:54
@davidwendt davidwendt requested review from a team as code owners December 16, 2021 15:54
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor questions/suggestions but overall this is really nicely written.

cpp/include/cudf/strings/extract.hpp Outdated Show resolved Hide resolved
cpp/src/strings/count_matches.cuh Outdated Show resolved Hide resolved
cpp/src/strings/extract/extract_all.cu Show resolved Hide resolved
cpp/src/strings/extract/extract_all.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One minor suggestion. Otherwise LGTM!

cpp/src/strings/extract/extract_all.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. Tiny nit is all I have and it's up for interpretation.

@davidwendt
Copy link
Contributor Author

rerun tests

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit eba4f03 into rapidsai:branch-22.02 Jan 5, 2022
@davidwendt davidwendt deleted the fea-extract-all branch January 5, 2022 17:58
rapids-bot bot pushed a commit that referenced this pull request Jan 27, 2022
Reference #9856 specifically #9856 (comment)

Adds `cudf::strings::findall_record` which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function `cudf.str.findall()` with the `expand=False` parameter more effectively. A follow-on PR will address these python changes.

This PR reorganizes the libcudf strings _find_ source files into the `cpp/src/strings/search` subdirectory as well. Also, `findall()` has only a regex version so the `_re` suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface.

Depends on #9909 -- shares the `cudf::strings::detail::count_matches()` utility function.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)

URL: #9911
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Implement extract_all_re function
4 participants