Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add regex_program class for use with all regex APIs #11927

Merged
merged 27 commits into from
Nov 9, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Oct 14, 2022

Description

Adds a new regex_program class to encapsulate a regex pattern and parameters used for executing regex calls on strings columns in libcudf. This provides a single object to hold the regex settings rather than adding or updating parameters to every call. Given a pattern (and other settings), it will compile and validate the pattern and build the set of instructions/commands needed to execute the regex on a strings column. Converting the pattern is done in CPU code. The object contains no state data and can be reused on the same API or other similar calls as appropriate (per the settings).
The object can also be queried to help with resource allocation/expectations.

The main files to review are the new regex_program* source files plus the corresponding changes in regexec.cpp (renamed from .cu). The remainder are simply side-effects and have common patterns to use the new object.
No function or behavior has changed but rather an new interface has been added over existing function but additional tests have been added to exercise through the companion APIs.

Currently, all regex APIs are duplicated -- the original API plus a new one accepting a regex_progam object. Once accepted we may consider deprecating the non-object APIs and then removing them in a future release.

This will help with changes needed for #10852

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 14, 2022
@davidwendt davidwendt self-assigned this Oct 14, 2022
@github-actions github-actions bot added the CMake CMake build issue label Oct 14, 2022
@github-actions github-actions bot added the conda label Oct 14, 2022
@codecov
Copy link

codecov bot commented Oct 14, 2022

Codecov Report

Base: 87.47% // Head: 88.07% // Increases project coverage by +0.59% 🎉

Coverage data is based on head (d2d997a) compared to base (f817d96).
Patch has no changes to coverable lines.

Additional details and impacted files
@@               Coverage Diff                @@
##           branch-22.12   #11927      +/-   ##
================================================
+ Coverage         87.47%   88.07%   +0.59%     
================================================
  Files               133      135       +2     
  Lines             21826    22025     +199     
================================================
+ Hits              19093    19398     +305     
+ Misses             2733     2627     -106     
Impacted Files Coverage Δ
python/cudf/cudf/core/column/interval.py 85.45% <0.00%> (-9.10%) ⬇️
python/cudf/cudf/io/text.py 91.66% <0.00%> (-8.34%) ⬇️
python/cudf/cudf/core/_base_index.py 81.28% <0.00%> (-4.27%) ⬇️
python/cudf/cudf/io/json.py 92.06% <0.00%> (-2.68%) ⬇️
python/cudf/cudf/utils/utils.py 89.91% <0.00%> (-0.69%) ⬇️
python/cudf/cudf/core/column/timedelta.py 90.17% <0.00%> (-0.58%) ⬇️
python/cudf/cudf/core/column/datetime.py 89.21% <0.00%> (-0.51%) ⬇️
python/cudf/cudf/core/column/column.py 87.96% <0.00%> (-0.46%) ⬇️
python/dask_cudf/dask_cudf/core.py 73.72% <0.00%> (-0.41%) ⬇️
python/cudf/cudf/io/parquet.py 90.45% <0.00%> (-0.39%) ⬇️
... and 39 more

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 25, 2022
@davidwendt davidwendt marked this pull request as ready for review October 25, 2022 19:01
@davidwendt davidwendt requested review from a team as code owners October 25, 2022 19:01
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the refactor. I mostly have questions about the current encapsulation of the program impl class.

Also, should we be concerned about overloading the "prog" name given that we already use this on device for the reprog_device class? Are the two similar enough that it makes sense to have similar names? Should we consider an alternative?

@@ -24,6 +24,9 @@

namespace cudf {
namespace strings {

struct regex_program;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to include a note in the developer guide about our preferred uses of forward declarations of includes (if you can come up with a pithy summary).

cpp/include/cudf/strings/regex/regex_program.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/regex/regex_program.hpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regex_program.cpp Show resolved Hide resolved
cpp/src/strings/regex/regcomp.h Show resolved Hide resolved
cpp/src/strings/regex/regex.cuh Show resolved Hide resolved
cpp/src/strings/regex/regex_program_impl.h Outdated Show resolved Hide resolved
cpp/src/strings/regex/regexec.cpp Show resolved Hide resolved
@davidwendt davidwendt requested a review from vyasr November 3, 2022 12:50
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Encapsulating information in the regex_program will help keep the public APIs constrained while adding new features that can be controlled via the program.

cpp/include/cudf/strings/regex/regex_program.hpp Outdated Show resolved Hide resolved
cpp/include/cudf/strings/regex/regex_program.hpp Outdated Show resolved Hide resolved
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 74053f4 into rapidsai:branch-22.12 Nov 9, 2022
@davidwendt davidwendt deleted the regex-program-class branch November 9, 2022 18:52
@jlowe
Copy link
Member

jlowe commented Nov 10, 2022

FYI, surprisingly this PR has broken a large number of RAPIDS Accelerator tests related to regex. Tests pass before this change and fail afterwards. In at least one of the tests, the nature of the failure appears to be solely related to the data column. There are null characters (or uninitialized memory?) appearing in blocks of rows periodically in the output. I'll try to narrow this down to a repro case in C++.

@davidwendt
Copy link
Contributor Author

I just noticed this too. I'm working on a fix.

rapids-bot bot pushed a commit that referenced this pull request Nov 10, 2022
rapids-bot bot pushed a commit that referenced this pull request Mar 9, 2023
Deprecates the libcudf regex APIs that do not accept `cudf::strings::regex_program` objects.
The libcudf regex functions were converted to accept `regex_program` objects instead of a pattern string and regex-flags directly in PR #11927
Most of this is removing calls to the deprecated functions in the gtests.
The declarations in the headers had to be reordered to make the reference links work in the doxygen output.

These deprecated functions will be removed in a future release.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Nghia Truong (https://github.com/ttnghia)

URL: #12891
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants