Skip to content

Commit

Permalink
Add regex_program class for use with all regex APIs (#11927)
Browse files Browse the repository at this point in the history
Adds a new `regex_program` class to encapsulate a regex pattern and parameters used for executing regex calls on strings columns in libcudf. This provides a single object to hold the regex settings rather than adding or updating parameters to every call. Given a pattern (and other settings), it will _compile_ and validate the pattern and build the set of instructions/commands needed to execute the regex on a strings column. Converting the pattern is done in CPU code. The object contains no state data and can be reused on the same API or other similar calls as appropriate (per the settings).
The object can also be queried to help with resource allocation/expectations.

The main files to review are the new `regex_program*` source files plus the corresponding changes in `regexec.cpp` (renamed from .cu). The remainder are simply side-effects and have common patterns to use the new object.
No function or behavior has changed but rather an new interface has been added over existing function but additional tests have been added to exercise through the companion APIs.

Currently, all regex APIs are duplicated -- the original API plus a new one accepting a `regex_progam` object. Once accepted we may consider deprecating the non-object APIs and then removing them in a future release.

This will help with changes needed for #10852

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Bradley Dice (https://github.com/bdice)
  - Ray Douglass (https://github.com/raydouglass)

URL: #11927
  • Loading branch information
davidwendt authored Nov 9, 2022
1 parent 628cd4f commit 74053f4
Show file tree
Hide file tree
Showing 29 changed files with 1,508 additions and 334 deletions.
1 change: 1 addition & 0 deletions conda/recipes/libcudf/meta.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -236,6 +236,7 @@ outputs:
- test -f $PREFIX/include/cudf/strings/json.hpp
- test -f $PREFIX/include/cudf/strings/padding.hpp
- test -f $PREFIX/include/cudf/strings/regex/flags.hpp
- test -f $PREFIX/include/cudf/strings/regex/regex_program.hpp
- test -f $PREFIX/include/cudf/strings/repeat_strings.hpp
- test -f $PREFIX/include/cudf/strings/replace.hpp
- test -f $PREFIX/include/cudf/strings/replace_re.hpp
Expand Down
3 changes: 2 additions & 1 deletion cpp/CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -501,7 +501,8 @@ add_library(
src/strings/padding.cu
src/strings/json/json_path.cu
src/strings/regex/regcomp.cpp
src/strings/regex/regexec.cu
src/strings/regex/regexec.cpp
src/strings/regex/regex_program.cpp
src/strings/repeat_strings.cu
src/strings/replace/backref_re.cu
src/strings/replace/multi_re.cu
Expand Down
81 changes: 81 additions & 0 deletions cpp/include/cudf/strings/contains.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -24,6 +24,9 @@

namespace cudf {
namespace strings {

struct regex_program;

/**
* @addtogroup strings_contains
* @{
Expand Down Expand Up @@ -58,6 +61,32 @@ std::unique_ptr<column> contains_re(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* match the given regex_program object
*
* @code{.pseudo}
* Example:
* s = ["abc", "123", "def456"]
* p = regex_program::create("\\d+")
* r = contains_re(s, p)
* r is now [false, true, true]
* @endcode
*
* Any null string entries return corresponding null output column entries.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column of boolean results for each string
*/
std::unique_ptr<column> contains_re(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* matching the given regex pattern but only at the beginning the string.
Expand Down Expand Up @@ -85,6 +114,32 @@ std::unique_ptr<column> matches_re(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* matching the given regex_program object but only at the beginning the string.
*
* @code{.pseudo}
* Example:
* s = ["abc", "123", "def456"]
* p = regex_program::create("\\d+")
* r = matches_re(s, p)
* r is now [false, true, false]
* @endcode
*
* Any null string entries return corresponding null output column entries.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New column of boolean results for each string
*/
std::unique_ptr<column> matches_re(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the number of times the given regex pattern
* matches in each string.
Expand Down Expand Up @@ -112,6 +167,32 @@ std::unique_ptr<column> count_re(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns the number of times the given regex_program's pattern
* matches in each string
*
* @code{.pseudo}
* Example:
* s = ["abc", "123", "def45"]
* p = regex_program::create("\\d")
* r = count_re(s, p)
* r is now [0, 3, 2]
* @endcode
*
* Any null string entries return corresponding null output column entries.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New INT32 column with counts for each string
*/
std::unique_ptr<column> count_re(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a boolean column identifying rows which
* match the given like pattern.
Expand Down
68 changes: 68 additions & 0 deletions cpp/include/cudf/strings/extract.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@

namespace cudf {
namespace strings {

struct regex_program;

/**
* @addtogroup strings_substring
* @{
Expand Down Expand Up @@ -61,6 +64,37 @@ std::unique_ptr<table> extract(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a table of strings columns where each column corresponds to the matching
* group specified in the given regex_program object
*
* All the strings for the first group will go in the first output column; the second group
* go in the second column and so on. Null entries are added to the columns in row `i` if
* the string at row `i` does not match.
*
* Any null string entries return corresponding null output column entries.
*
* @code{.pseudo}
* Example:
* s = ["a1", "b2", "c3"]
* p = regex_program::create("([ab])(\\d)")
* r = extract(s, p)
* r is now [ ["a", "b", null],
* ["1", "2", null] ]
* @endcode
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned table's device memory
* @return Columns of strings extracted from the input column
*/
std::unique_ptr<table> extract(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a lists column of strings where each string column row corresponds to the
* matching group specified in the given regular expression pattern.
Expand Down Expand Up @@ -96,6 +130,40 @@ std::unique_ptr<column> extract_all_record(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a lists column of strings where each string column row corresponds to the
* matching group specified in the given regex_program object
*
* All the matching groups for the first row will go in the first row output column; the second
* row results will go into the second row output column and so on.
*
* A null output row will result if the corresponding input string row does not match or
* that input row is null.
*
* @code{.pseudo}
* Example:
* s = ["a1 b4", "b2", "c3 a5", "b", null]
* p = regex_program::create("([ab])(\\d)")
* r = extract_all_record(s, p)
* r is now [ ["a", "1", "b", "4"],
* ["b", "2"],
* ["a", "5"],
* null,
* null ]
* @endcode
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param strings Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate any returned device memory
* @return Lists column containing strings extracted from the input column
*/
std::unique_ptr<column> extract_all_record(
strings_column_view const& strings,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
36 changes: 36 additions & 0 deletions cpp/include/cudf/strings/findall.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -23,6 +23,9 @@

namespace cudf {
namespace strings {

struct regex_program;

/**
* @addtogroup strings_contains
* @{
Expand Down Expand Up @@ -63,6 +66,39 @@ std::unique_ptr<column> findall(
regex_flags const flags = regex_flags::DEFAULT,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/**
* @brief Returns a lists column of strings for each matching occurrence using
* the regex_program pattern within each string
*
* Each output row includes all the substrings within the corresponding input row
* that match the given pattern. If no matches are found, the output row is empty.
*
* @code{.pseudo}
* Example:
* s = ["bunny", "rabbit", "hare", "dog"]
* p = regex_program::create("[ab]")
* r = findall(s, p)
* r is now a lists column like:
* [ ["b"]
* ["a","b","b"]
* ["a"]
* [] ]
* @endcode
*
* A null output row occurs if the corresponding input row is null.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param input Strings instance for this operation
* @param prog Regex program instance
* @param mr Device memory resource used to allocate the returned column's device memory
* @return New lists column of strings
*/
std::unique_ptr<column> findall(
strings_column_view const& input,
regex_program const& prog,
rmm::mr::device_memory_resource* mr = rmm::mr::get_current_device_resource());

/** @} */ // end of doxygen group
} // namespace strings
} // namespace cudf
2 changes: 1 addition & 1 deletion cpp/include/cudf/strings/regex/flags.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,7 +21,7 @@ namespace cudf {
namespace strings {

/**
* @addtogroup strings_contains
* @addtogroup strings_regex
* @{
*/

Expand Down
Loading

0 comments on commit 74053f4

Please sign in to comment.