Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex_program class for use with all regex APIs #11927

Merged
merged 27 commits into from
Nov 9, 2022
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
0b0b1f6
Add regex_program class for use with all regex APIs
davidwendt Oct 14, 2022
f6c8b6b
fix missing doxygen
davidwendt Oct 14, 2022
c6a9b14
Merge branch 'branch-22.12' into regex-program-class
davidwendt Oct 17, 2022
87783ac
add regex_program dtor decl/def
davidwendt Oct 17, 2022
1ebfc3b
fix doxygen for compute-working-memory-size function
davidwendt Oct 17, 2022
52d86fd
add gtests with regex_program
davidwendt Oct 19, 2022
885c51e
Merge branch 'branch-22.12' into regex-program-class
davidwendt Oct 19, 2022
bd9cd9c
add gtests for split_re functions
davidwendt Oct 20, 2022
999c622
Merge branch 'branch-22.12' into regex-program-class
davidwendt Oct 20, 2022
c9cce60
fix merge conflict
davidwendt Oct 21, 2022
9f7ba6e
Merge branch 'branch-22.12' into regex-program-class
davidwendt Oct 24, 2022
aa3e021
Merge branch 'branch-22.12' into regex-program-class
davidwendt Oct 25, 2022
8c904f6
update doxygen for future work
davidwendt Oct 25, 2022
7b1457f
Merge branch 'branch-22.12' into regex-program-class
davidwendt Oct 26, 2022
3587ee5
fix merge conflict
davidwendt Oct 31, 2022
6fb423c
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 1, 2022
0d8aaaa
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 1, 2022
1783af9
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 2, 2022
93d3ba2
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 2, 2022
9759277
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 2, 2022
915df87
delete def ctor; fix parameter order
davidwendt Nov 3, 2022
c4c686b
replace get_impl() with friend access class
davidwendt Nov 3, 2022
6324240
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 3, 2022
f85897f
add ctors for the impl class
davidwendt Nov 3, 2022
7421135
Merge branch 'branch-22.12' into regex-program-class
davidwendt Nov 4, 2022
e90c4d0
fix merge conflicts
davidwendt Nov 4, 2022
d2d997a
fix wording in doxygen comment for regex_program
davidwendt Nov 7, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -234,6 +234,7 @@ outputs:
- test -f $PREFIX/include/cudf/strings/json.hpp
- test -f $PREFIX/include/cudf/strings/padding.hpp
- test -f $PREFIX/include/cudf/strings/regex/flags.hpp
- test -f $PREFIX/include/cudf/strings/regex/regex_program.hpp
- test -f $PREFIX/include/cudf/strings/repeat_strings.hpp
- test -f $PREFIX/include/cudf/strings/replace.hpp
- test -f $PREFIX/include/cudf/strings/replace_re.hpp
Expand Down
3 changes: 2 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -498,7 +498,8 @@ add_library(
src/strings/padding.cu
src/strings/json/json_path.cu
src/strings/regex/regcomp.cpp
src/strings/regex/regexec.cu
src/strings/regex/regexec.cpp
src/strings/regex/regex_program.cpp
src/strings/repeat_strings.cu
src/strings/replace/backref_re.cu
src/strings/replace/multi_re.cu
Expand Down
81 changes: 81 additions & 0 deletions cpp/include/cudf/strings/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@

namespace cudf {
namespace strings {

struct regex_program;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to include a note in the developer guide about our preferred uses of forward declarations of includes (if you can come up with a pithy summary).


/**
* @addtogroup strings_contains
* @{
Expand Down Expand Up @@ -58,6 +61,32 @@ std::unique_ptr<column> contains_re(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* match the given regex_program object
*
* @code{.pseudo}
* Example:
* s = ["abc", "123", "def456"]
* p = regex_program::create("\\d+")
* r = contains_re(s, p)
* r is now [false, true, true]
* @endcode
*
* Any null string entries return corresponding null output column entries.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column of boolean results for each string
*/
std::unique_ptr<column> contains_re(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* matching the given regex pattern but only at the beginning the string.
Expand Down Expand Up @@ -85,6 +114,32 @@ std::unique_ptr<column> matches_re(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* matching the given regex_program object but only at the beginning the string.
*
* @code{.pseudo}
* Example:
* s = ["abc", "123", "def456"]
* p = regex_program::create("\\d+")
* r = matches_re(s, p)
* r is now [false, true, false]
* @endcode
*
* Any null string entries return corresponding null output column entries.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column of boolean results for each string
*/
std::unique_ptr<column> matches_re(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the number of times the given regex pattern
* matches in each string.
Expand Down Expand Up @@ -112,6 +167,32 @@ std::unique_ptr<column> count_re(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the number of times the given regex_program's pattern
* matches in each string
*
* @code{.pseudo}
* Example:
* s = ["abc", "123", "def45"]
* p = regex_program::create("\\d")
* r = count_re(s, p)
* r is now [0, 3, 2]
* @endcode
*
* Any null string entries return corresponding null output column entries.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New INT32 column with counts for each string
*/
std::unique_ptr<column> count_re(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* match the given like pattern.
Expand Down
68 changes: 68 additions & 0 deletions cpp/include/cudf/strings/extract.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@

namespace cudf {
namespace strings {

struct regex_program;

/**
* @addtogroup strings_substring
* @{
Expand Down Expand Up @@ -61,6 +64,37 @@ std::unique_ptr<table> extract(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a table of strings columns where each column corresponds to the matching
* group specified in the given regex_program object
*
* All the strings for the first group will go in the first output column; the second group
* go in the second column and so on. Null entries are added to the columns in row `i` if
* the string at row `i` does not match.
*
* Any null string entries return corresponding null output column entries.
*
* @code{.pseudo}
* Example:
* s = ["a1", "b2", "c3"]
* p = regex_program::create("([ab])(\\d)")
* r = extract(s, p)
* r is now [ ["a", "b", null],
* ["1", "2", null] ]
* @endcode
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned table's device memory
* @return Columns of strings extracted from the input column
*/
std::unique_ptr<table> extract(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a lists column of strings where each string column row corresponds to the
* matching group specified in the given regular expression pattern.
Expand Down Expand Up @@ -96,6 +130,40 @@ std::unique_ptr<column> extract_all_record(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a lists column of strings where each string column row corresponds to the
* matching group specified in the given regex_program object
*
* All the matching groups for the first row will go in the first row output column; the second
* row results will go into the second row output column and so on.
*
* A null output row will result if the corresponding input string row does not match or
* that input row is null.
*
* @code{.pseudo}
* Example:
* s = ["a1 b4", "b2", "c3 a5", "b", null]
* p = regex_program::create("([ab])(\\d)")
* r = extract_all_record(s, p)
* r is now [ ["a", "1", "b", "4"],
* ["b", "2"],
* ["a", "5"],
* null,
* null ]
* @endcode
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate any returned device memory
* @return Lists column containing strings extracted from the input column
*/
std::unique_ptr<column> extract_all_record(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
36 changes: 36 additions & 0 deletions cpp/include/cudf/strings/findall.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@

namespace cudf {
namespace strings {

struct regex_program;

/**
* @addtogroup strings_contains
* @{
Expand Down Expand Up @@ -63,6 +66,39 @@ std::unique_ptr<column> findall(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a lists column of strings for each matching occurrence using
* the regex_program pattern within each string
*
* Each output row includes all the substrings within the corresponding input row
* that match the given pattern. If no matches are found, the output row is empty.
*
* @code{.pseudo}
* Example:
* s = ["bunny", "rabbit", "hare", "dog"]
* p = regex_program::create("[ab]")
* r = findall(s, p)
* r is now a lists column like:
* [ ["b"]
* ["a","b","b"]
* ["a"]
* [] ]
* @endcode
*
* A null output row occurs if the corresponding input row is null.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param input Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New lists column of strings
*/
std::unique_ptr<column> findall(
strings_column_view const& input,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
2 changes: 1 addition & 1 deletion cpp/include/cudf/strings/regex/flags.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ namespace cudf {
namespace strings {

/**
* @addtogroup strings_contains
* @addtogroup strings_regex
* @{
*/

Expand Down
Loading