Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in replace_with_backrefs when group has greedy quantifier #8575

Merged
merged 4 commits into from
Jun 25, 2021

Conversation

davidwendt
Copy link
Contributor

Closes #8569

This essentially undoes the performance improvement made in #8460 since the logic mishandles a greedy quantifier pattern when it occurs inside an extract group. The internal regex logic is only able to track a single extract group when such a quantifier is specified.

This PR does improve the interface for the internal extract call and adds some gtests for this issue.

@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change labels Jun 21, 2021
@davidwendt davidwendt self-assigned this Jun 21, 2021
@davidwendt davidwendt requested a review from a team as a code owner June 21, 2021 19:25
@codecov
Copy link

codecov bot commented Jun 21, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@0c64d0d). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8575   +/-   ##
===============================================
  Coverage                ?   83.01%           
===============================================
  Files                   ?      109           
  Lines                   ?    18225           
  Branches                ?        0           
===============================================
  Hits                    ?    15129           
  Misses                  ?     3096           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 0c64d0d...47298ec. Read the comment docs.

cpp/src/strings/replace/backref_re.cuh Outdated Show resolved Hide resolved
cpp/tests/strings/replace_regex_tests.cpp Show resolved Hide resolved
@davidwendt davidwendt added the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jun 24, 2021
@davidwendt davidwendt removed the 5 - DO NOT MERGE Hold off on merging; see PR for details label Jun 24, 2021
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5adedee into rapidsai:branch-21.08 Jun 25, 2021
@davidwendt davidwendt deleted the bug-replace-backrefs branch June 25, 2021 22:24
rapids-bot bot pushed a commit that referenced this pull request Oct 13, 2021
)

This is a less ambitious version of #8460 which had to be reverted in #8575 because it did not work with greedy quantifiers. The change here involves calling the underlying `reprog_device::extract` to retrieve each group result within a single kernel rather than launching a kernel for each group. The output is placed contiguously in a 2d span (wrapped uvector) and a permutation iterator is used to build the output columns (one column per group).

Like it's predecessor, the performance improvement is mostly when specifying more than 1 group in the regex pattern. The benchmark results showed no change for single groups but was 2x faster for multiple groups over long (8K) strings and up to 4x faster for multiple groups over many (16M) strings.

The benchmark test for extract was also updated to better report the number of groups being used when measuring results.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)

URL: #9358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] cudf::strings::replace_with_backrefs produces incorrect result
3 participants