Add cudf::strings::extract_all API #9909

davidwendt · 2021-12-15T14:06:37Z

Adds a new cudf::strings::extract_all API that returns a LIST column of extracted strings given a regex pattern.

This is similar to nvstrings version of extract called extract_record but returns groups from all matches in each string instead of just the first match. Here is pseudo code of it's behavior on various strings input:

s = [ "ABC-200 DEF-400", "GHI-60", "JK-800", "900", NULL ]
r =  extract_all( s, "'(\w+)-(\d+)" )
r is a LIST column of strings that looks like this:

[ [ "ABC", "200", "DEF", "400" ], // 2 matches
  [ "GHI", "60" ], // 1 match
  [ "JK", "800" ], // 1 match
  NULL,            // no match
  NULL
]

Each match results in two groups as specified in the regex pattern.

Also reorganized the extract source code into src/strings/extract directory.
The match-counting has been factored out into new count_matches.cuh since it will become common code used with findall_record in a follow on PR.

codecov · 2021-12-15T15:38:05Z

Codecov Report

Merging #9909 (d5b31d7) into branch-22.02 (967a333) will decrease coverage by 0.07%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9909      +/-   ##
================================================
- Coverage         10.49%   10.41%   -0.08%     
================================================
  Files               119      119              
  Lines             20305    20480     +175     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18346     +171

Impacted Files	Coverage Δ
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
... and 14 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ce02856...d5b31d7. Read the comment docs.

bdice

A few minor questions/suggestions but overall this is really nicely written.

cpp/include/cudf/strings/extract.hpp

cpp/src/strings/count_matches.cuh

cpp/src/strings/extract/extract_all.cu

robertmaynard

CMake changes LGTM

bdice

One minor suggestion. Otherwise LGTM!

cpp/src/strings/extract/extract_all.cu

hyperbolic2346

Looks great. Tiny nit is all I have and it's up for interpretation.

davidwendt · 2022-01-05T13:32:29Z

rerun tests

davidwendt · 2022-01-05T17:58:08Z

@gpucibot merge

Reference #9856 specifically #9856 (comment) Adds `cudf::strings::findall_record` which was initially implemented in nvstrings but not ported over since LIST column types did not exist at the time and returning a vector of small columns was very inefficient. This API should also allow using the current python function `cudf.str.findall()` with the `expand=False` parameter more effectively. A follow-on PR will address these python changes. This PR reorganizes the libcudf strings _find_ source files into the `cpp/src/strings/search` subdirectory as well. Also, `findall()` has only a regex version so the `_re` suffix is dropped from the name in the libcudf implementation. The python changes in this PR address only the name change and the addition of the new API in the cython interface. Depends on #9909 -- shares the `cudf::strings::detail::count_matches()` utility function. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vyas Ramasubramani (https://github.com/vyasr) URL: #9911

davidwendt added 5 commits December 10, 2021 17:14

Add cudf::strings::extract_all API

ccff618

Merge branch 'branch-22.02' into fea-extract-record

6e28845

fix merge conflict

63f7c7f

Merge branch 'branch-22.02' into fea-extract-all

c3664a4

Add libcudf API cudf::strings::extract_all

cdd3ab7

davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Dec 15, 2021

davidwendt self-assigned this Dec 15, 2021

github-actions bot added the CMake CMake build issue label Dec 15, 2021

Merge branch 'branch-22.02' into fea-extract-all

569c307

remove duplicate include

c8afe3f

davidwendt mentioned this pull request Dec 15, 2021

Add cudf::strings::findall_record API #9911

Merged

Merge branch 'branch-22.02' into fea-extract-all

fcc88cd

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Dec 16, 2021

davidwendt marked this pull request as ready for review December 16, 2021 15:54

davidwendt requested review from a team as code owners December 16, 2021 15:54

davidwendt requested review from hyperbolic2346 and bdice December 16, 2021 15:55

bdice requested changes Dec 17, 2021

View reviewed changes

cpp/include/cudf/strings/extract.hpp Outdated Show resolved Hide resolved

cpp/src/strings/count_matches.cuh Outdated Show resolved Hide resolved

cpp/src/strings/extract/extract_all.cu Show resolved Hide resolved

cpp/src/strings/extract/extract_all.cu Outdated Show resolved Hide resolved

robertmaynard approved these changes Dec 17, 2021

View reviewed changes

davidwendt added 2 commits December 17, 2021 18:51

fix header include position

2fbc19b

Merge branch 'branch-22.02' into fea-extract-all

153abaa

bdice approved these changes Dec 20, 2021

View reviewed changes

cpp/src/strings/extract/extract_all.cu Outdated Show resolved Hide resolved

hyperbolic2346 reviewed Dec 21, 2021

View reviewed changes

cpp/src/strings/extract/extract_all.cu Outdated Show resolved Hide resolved

hyperbolic2346 approved these changes Dec 21, 2021

View reviewed changes

simplify index-pair creation logic

d5b31d7

hyperbolic2346 approved these changes Jan 4, 2022

View reviewed changes

rapids-bot bot merged commit eba4f03 into rapidsai:branch-22.02 Jan 5, 2022

davidwendt deleted the fea-extract-all branch January 5, 2022 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add cudf::strings::extract_all API #9909

Add cudf::strings::extract_all API #9909

davidwendt commented Dec 15, 2021

codecov bot commented Dec 15, 2021 •

edited

Loading

bdice left a comment

robertmaynard left a comment

bdice left a comment

hyperbolic2346 left a comment

davidwendt commented Jan 5, 2022

davidwendt commented Jan 5, 2022

Add cudf::strings::extract_all API #9909

Add cudf::strings::extract_all API #9909

Conversation

davidwendt commented Dec 15, 2021

codecov bot commented Dec 15, 2021 • edited Loading

Codecov Report

bdice left a comment

Choose a reason for hiding this comment

robertmaynard left a comment

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

hyperbolic2346 left a comment

Choose a reason for hiding this comment

davidwendt commented Jan 5, 2022

davidwendt commented Jan 5, 2022

codecov bot commented Dec 15, 2021 •

edited

Loading