Modify reprog_device::extract to return groups in a single pass #8460

davidwendt · 2021-06-08T17:41:29Z

This PR modifies the internal regex reprog_device::extract function to return all matching groups in a single call. Previously, retrieving each group range required individual calls to this extract function resulted in re-matching the entire given pattern for each group. The code logic would identify each group but only return the range for the specified group.

The code change here passes a pre-allocated global memory array to capture each group range in a single pass. The extract is an all-or-nothing process. In fact, a find function must first be executed to retrieve the bounds of the given pattern. So if any of the groups are missing or do not match, no groups are returned for that row. Retrieving the last group would always require processing the previous groups and the code logic now records those positions in the global memory array. The memory array can then be used directly to build the output columns.

This simplifies the code around extract and also improves performance especially for long strings or patterns with many groups. For small strings and a small number of groups, the gbenchmark showed equivalent performance to the previous implementation. For larger strings and more groups, the gbenchmark showed a 2-3x improvement.

codecov · 2021-06-14T18:16:48Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@7c8d847). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head f2221eb differs from pull request most recent head 085b110. Consider uploading reports for the commit 085b110 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.08    #8460   +/-   ##
===============================================
  Coverage                ?   82.95%           
===============================================
  Files                   ?      109           
  Lines                   ?    18226           
  Branches                ?        0           
===============================================
  Hits                    ?    15120           
  Misses                  ?     3106           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7c8d847...085b110. Read the comment docs.

vuule

Looks good, got a few mostly stylistic suggestions.
Didn't dig too deep into the algorithmic aspect of the changes, let me know if it's needed.

cpp/src/strings/regex/regex.cuh

cpp/src/strings/replace/backref_re.cuh

cpp/src/strings/extract.cu

robertmaynard

CMake changes LGTM

vuule · 2021-06-18T18:58:33Z

@gpucibot merge

) Closes #8569 This essentially undoes the performance improvement made in #8460 since the logic mishandles a greedy quantifier pattern when it occurs inside an extract group. The internal regex logic is only able to track a single extract group when such a quantifier is specified. This PR does improve the interface for the internal extract call and adds some gtests for this issue. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu) - MithunR (https://github.com/mythrocks) URL: #8575

) This is a less ambitious version of #8460 which had to be reverted in #8575 because it did not work with greedy quantifiers. The change here involves calling the underlying `reprog_device::extract` to retrieve each group result within a single kernel rather than launching a kernel for each group. The output is placed contiguously in a 2d span (wrapped uvector) and a permutation iterator is used to build the output columns (one column per group). Like it's predecessor, the performance improvement is mostly when specifying more than 1 group in the regex pattern. The benchmark results showed no change for single groups but was 2x faster for multiple groups over long (8K) strings and up to 4x faster for multiple groups over many (16M) strings. The benchmark test for extract was also updated to better report the number of groups being used when measuring results. Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Mark Harris (https://github.com/harrism) - Nghia Truong (https://github.com/ttnghia) URL: #9358

davidwendt added 3 commits June 8, 2021 10:10

Modify strings::extract to return groups in a single pass

8538236

fix merge conflicts

ff848b4

recombine backref_re.cu

15c2592

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Jun 8, 2021

davidwendt self-assigned this Jun 8, 2021

github-actions bot added the CMake CMake build issue label Jun 8, 2021

davidwendt added 3 commits June 8, 2021 13:47

resurrect removed comment

54e5534

fix backref extract to use group index

be6ec61

Merge branch 'branch-21.08' into extract-all-groups

ae9e3bf

Merge branch 'branch-21.08' into extract-all-groups

f7896a7

davidwendt changed the title ~~Modify strings::extract to return groups in a single pass~~ Modify reprog_device::extract to return groups in a single pass Jun 15, 2021

replace children-pair with struct bindings

00c152a

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Jun 15, 2021

davidwendt marked this pull request as ready for review June 15, 2021 15:35

davidwendt requested review from a team as code owners June 15, 2021 15:35

davidwendt requested review from cwharris and vuule June 15, 2021 15:35

davidwendt added 2 commits June 15, 2021 18:04

remove unneeded header include

632fe23

Merge branch 'branch-21.08' into extract-all-groups

930bbe4

vuule requested changes Jun 15, 2021

View reviewed changes

cpp/src/strings/regex/regex.cuh Outdated Show resolved Hide resolved

cpp/src/strings/replace/backref_re.cuh Outdated Show resolved Hide resolved

cpp/src/strings/extract.cu Outdated Show resolved Hide resolved

use device_span and device_2dspan instead for extract interface

085b110

davidwendt requested a review from vuule June 16, 2021 15:40

vuule approved these changes Jun 16, 2021

View reviewed changes

robertmaynard approved these changes Jun 17, 2021

View reviewed changes

cwharris approved these changes Jun 18, 2021

View reviewed changes

rapids-bot bot merged commit d183d50 into rapidsai:branch-21.08 Jun 18, 2021

davidwendt deleted the extract-all-groups branch June 18, 2021 21:29

This was referenced Jun 21, 2021

[BUG] nightly test failed with lists: testStringReplaceWithBackrefs NVIDIA/spark-rapids#2750

Closed

[BUG] cudf::strings::replace_with_backrefs produces incorrect result #8569

Closed

davidwendt mentioned this pull request Jun 21, 2021

Fix bug in replace_with_backrefs when group has greedy quantifier #8575

Merged

davidwendt mentioned this pull request Oct 4, 2021

Use single kernel to extract all groups in cudf::strings::extract #9358

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Modify reprog_device::extract to return groups in a single pass #8460

Modify reprog_device::extract to return groups in a single pass #8460

davidwendt commented Jun 8, 2021 •

edited

Loading

codecov bot commented Jun 14, 2021 •

edited

Loading

vuule left a comment

robertmaynard left a comment

vuule commented Jun 18, 2021

Modify reprog_device::extract to return groups in a single pass #8460

Modify reprog_device::extract to return groups in a single pass #8460

Conversation

davidwendt commented Jun 8, 2021 • edited Loading

codecov bot commented Jun 14, 2021 • edited Loading

Codecov Report

vuule left a comment

Choose a reason for hiding this comment

robertmaynard left a comment

Choose a reason for hiding this comment

vuule commented Jun 18, 2021

davidwendt commented Jun 8, 2021 •

edited

Loading

codecov bot commented Jun 14, 2021 •

edited

Loading