Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regex cleanup internal reclass and reclass_device classes #11045

Merged
merged 16 commits into from
Jun 27, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jun 3, 2022

This cleans up the awkward range literals for supporting the CCLASS and NCCLASS regex instructions. The range values were always paired (first,last) but arranged consecutively in a flat vector so [idx] and [idx+1] were range pairs idx was even. This PR introduces a reclass_range class that holds the pairs so we can use normal algorithms to manipulate them.

There is some overlap with code changes in PR #10975

Reference #3582

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 3, 2022
@davidwendt davidwendt self-assigned this Jun 3, 2022
@codecov
Copy link

codecov bot commented Jun 3, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.08@6362fbe). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head 2bcbccc differs from pull request most recent head 7d3539b. Consider uploading reports for the commit 7d3539b to get more accurate results

@@               Coverage Diff               @@
##             branch-22.08   #11045   +/-   ##
===============================================
  Coverage                ?   86.34%           
===============================================
  Files                   ?      144           
  Lines                   ?    22729           
  Branches                ?        0           
===============================================
  Hits                    ?    19625           
  Misses                  ?     3104           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6362fbe...7d3539b. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 16, 2022
@davidwendt davidwendt marked this pull request as ready for review June 16, 2022 20:14
@davidwendt davidwendt requested a review from a team as a code owner June 16, 2022 20:14
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions, but I like this cleanup overall.

cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved
Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification and changes.

Comment on lines +281 to +297
// transform pairs of literals to ranges
std::vector<reclass_range> ranges(literals.size() / 2);
auto const counter = thrust::make_counting_iterator(0);
std::transform(counter, counter + ranges.size(), ranges.begin(), [&literals](auto idx) {
return reclass_range{literals[idx * 2], literals[idx * 2 + 1]};
});
// sort the ranges to help with detecting overlapping entries
std::sort(ranges.begin(), ranges.end(), [](auto l, auto r) {
return l.first == r.first ? l.last < r.last : l.first < r.first;
});
// combine overlapping entries: [a-f][c-g] => [a-g]
if (ranges.size() > 1) {
for (auto itr = ranges.begin() + 1; itr < ranges.end(); ++itr) {
auto const prev = *(itr - 1);
if (itr->first <= prev.last + 1) {
// if these 2 ranges intersect, expand the current one
*itr = reclass_range{prev.first, std::max(prev.last, itr->last)};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is impressive. So much more readable.

@@ -104,14 +109,14 @@ std::unique_ptr<reprog_device, std::function<void(reprog_device*)>> reprog_devic
auto d_end = d_ptr + (classes_count * sizeof(reclass_device));
// place each class and append the variable length data
for (int32_t idx = 0; idx < classes_count; ++idx) {
reclass& h_class = h_prog.class_at(idx);
auto h_class = h_prog.class_at(idx);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To prevent the copying of literals (as the previous code did), should this not be auto const& h_class = ...;? Or is that affordable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are correct. Good catch.

Copy link
Contributor

@mythrocks mythrocks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent! I really appreciate how the right abstraction makes the code so much more readable.

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit e0003a0 into rapidsai:branch-22.08 Jun 27, 2022
@davidwendt davidwendt deleted the regex-cleanup-reclass branch June 27, 2022 12:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants