Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Modify reprog_device::extract to return groups in a single pass #8460

Merged
merged 11 commits into from
Jun 18, 2021

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jun 8, 2021

This PR modifies the internal regex reprog_device::extract function to return all matching groups in a single call. Previously, retrieving each group range required individual calls to this extract function resulted in re-matching the entire given pattern for each group. The code logic would identify each group but only return the range for the specified group.

The code change here passes a pre-allocated global memory array to capture each group range in a single pass. The extract is an all-or-nothing process. In fact, a find function must first be executed to retrieve the bounds of the given pattern. So if any of the groups are missing or do not match, no groups are returned for that row. Retrieving the last group would always require processing the previous groups and the code logic now records those positions in the global memory array. The memory array can then be used directly to build the output columns.

This simplifies the code around extract and also improves performance especially for long strings or patterns with many groups. For small strings and a small number of groups, the gbenchmark showed equivalent performance to the previous implementation. For larger strings and more groups, the gbenchmark showed a 2-3x improvement.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 8, 2021
@davidwendt davidwendt self-assigned this Jun 8, 2021
@github-actions github-actions bot added the CMake CMake build issue label Jun 8, 2021
@codecov
Copy link

codecov bot commented Jun 14, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@7c8d847). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head f2221eb differs from pull request most recent head 085b110. Consider uploading reports for the commit 085b110 to get more accurate results
Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8460   +/-   ##
===============================================
  Coverage                ?   82.95%           
===============================================
  Files                   ?      109           
  Lines                   ?    18226           
  Branches                ?        0           
===============================================
  Hits                    ?    15120           
  Misses                  ?     3106           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c8d847...085b110. Read the comment docs.

@davidwendt davidwendt changed the title Modify strings::extract to return groups in a single pass Modify reprog_device::extract to return groups in a single pass Jun 15, 2021
@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 15, 2021
@davidwendt davidwendt marked this pull request as ready for review June 15, 2021 15:35
@davidwendt davidwendt requested review from a team as code owners June 15, 2021 15:35
@davidwendt davidwendt requested review from cwharris and vuule June 15, 2021 15:35
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, got a few mostly stylistic suggestions.
Didn't dig too deep into the algorithmic aspect of the changes, let me know if it's needed.

cpp/src/strings/regex/regex.cuh Outdated Show resolved Hide resolved
cpp/src/strings/replace/backref_re.cuh Outdated Show resolved Hide resolved
cpp/src/strings/extract.cu Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from vuule June 16, 2021 15:40
Copy link
Contributor

@robertmaynard robertmaynard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CMake changes LGTM

@vuule
Copy link
Contributor

vuule commented Jun 18, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit d183d50 into rapidsai:branch-21.08 Jun 18, 2021
@davidwendt davidwendt deleted the extract-all-groups branch June 18, 2021 21:29
rapids-bot bot pushed a commit that referenced this pull request Jun 25, 2021
)

Closes #8569 

This essentially undoes the performance improvement made in #8460 since the logic mishandles a greedy quantifier pattern when it occurs inside an extract group. The internal regex logic is only able to track a single extract group when such a quantifier is specified. 

This PR does improve the interface for the internal extract call and adds some gtests for this issue.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - MithunR (https://github.com/mythrocks)

URL: #8575
rapids-bot bot pushed a commit that referenced this pull request Oct 13, 2021
)

This is a less ambitious version of #8460 which had to be reverted in #8575 because it did not work with greedy quantifiers. The change here involves calling the underlying `reprog_device::extract` to retrieve each group result within a single kernel rather than launching a kernel for each group. The output is placed contiguously in a 2d span (wrapped uvector) and a permutation iterator is used to build the output columns (one column per group).

Like it's predecessor, the performance improvement is mostly when specifying more than 1 group in the regex pattern. The benchmark results showed no change for single groups but was 2x faster for multiple groups over long (8K) strings and up to 4x faster for multiple groups over many (16M) strings.

The benchmark test for extract was also updated to better report the number of groups being used when measuring results.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Mark Harris (https://github.com/harrism)
  - Nghia Truong (https://github.com/ttnghia)

URL: #9358
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team CMake CMake build issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants