Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add strings 'like' function #11558

Merged
merged 36 commits into from
Aug 26, 2022
Merged

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Aug 17, 2022

Description

Adds new strings like function to cudf. This is a wildcard-based string matching function based on SQL's LIKE statement.
https://www.sqltutorial.org/sql-like/
Though some SQL implementations provide regex-like capabilities in the like statement pattern, the implementation here is strictly limited to the % (multi-character placeholder) and the _ (single character placeholder) behavior. It also accepts an optional escape character that can be used when trying to match strings that contain % or _ in them.

This is an easier (and faster) alternative to using the regex based contains function.
Example usage:

s = cudf.Series(["David", "Daniel", "Darcy"])
s.str.like('Da%')   ==> [True, True, True]    # starts with 'Da'
s.str.like('_a_i%') ==> [True, True, False]   # 2nd character is 'a' and 4th character is 'i'
s.str.like('_____') ==> [True, False, True]   # match any 5 characters
s.str.like('%y')    ==> [False, False, True]  # ends with 'y'

This PR includes gtests, pytest, and an nvbench-mark.

Reference #10797

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@davidwendt davidwendt added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Aug 17, 2022
@davidwendt davidwendt self-assigned this Aug 17, 2022
@github-actions github-actions bot added CMake CMake build issue Python Affects Python cuDF API. labels Aug 17, 2022
@davidwendt
Copy link
Contributor Author

This was benchmarked against cudf::strings::contains_re with equivalent patterns over the same dataset. The benchmark was setup to so both functions required reading the entire matching string to resolve to true. This provided the fairest comparison between implementations. The results are pictured here varying 4K to 16M rows with 1-100% hit-rates (matches).

like-benchmark

The speedup is an x-factor which ranged between 2x to ~12x.
These were compared by temporarily implementing contains_re with nvbench. A follow-on PR will migrate and consolidate the contains gbenchmarks with the new like nvbench-mark.

@codecov
Copy link

codecov bot commented Aug 17, 2022

Codecov Report

❗ No coverage uploaded for pull request base (branch-22.10@096bbc4). Click here to learn what that means.
The diff coverage is n/a.

@@               Coverage Diff               @@
##             branch-22.10   #11558   +/-   ##
===============================================
  Coverage                ?   86.41%           
===============================================
  Files                   ?      145           
  Lines                   ?    22992           
  Branches                ?        0           
===============================================
  Hits                    ?    19869           
  Misses                  ?     3123           
  Partials                ?        0           

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I have a couple small issues with naming, otherwise approved.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @davidwendt!

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Aug 25, 2022
@bdice
Copy link
Contributor

bdice commented Aug 26, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit ccd72f2 into rapidsai:branch-22.10 Aug 26, 2022
@davidwendt davidwendt deleted the fea-str-like branch August 26, 2022 12:30
rapids-bot bot pushed a commit that referenced this pull request Nov 3, 2022
[#11558](#11558) added strings `like` function to cudf, which is a wildcard-based string matching function based on SQL's LIKE statement.

We add `like` jni and native method calling the `like` function in #11558 and corresponding Java unit tests. This is part of the solution for issue [NVIDIA/spark-rapids#6430](NVIDIA/spark-rapids#6430).

Authors:
  - Yuan Jiang (https://github.com/cindyyuanjiang)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Gera Shegalov (https://github.com/gerashegalov)
  - Jason Lowe (https://github.com/jlowe)

URL: #12032
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants