Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add check for regex instructions causing an infinite-loop #10095

Merged
merged 6 commits into from
Jan 27, 2022

Conversation

davidwendt
Copy link
Contributor

@davidwendt davidwendt commented Jan 20, 2022

Closes #10006

Fixes a use case where the regex pattern creates a set of instructions that can cause the regex evaluation process to go into an infinite loop. For example, the pattern (x?)+ creates the following instructions:

Instructions:
  0:    CHAR c='x', next=2
  1:      OR right=0, left=2, next=2
  2:    RBRA id=1, next=4
  3:    LBRA id=1, next=1
  4:      OR right=3, left=5, next=5
  5:     END
startinst_id=3
startinst_ids: [ 3 -1]

This causes in an infinite loop at instruction 4 where the path may go like: 4->3->1->2->4 ... forever.
Supporting this pattern does not look possible. The + quantifier is applied to capture group symbol ) inside of which x? means 0 or more repeating the character x. This means it could match x or nothing and so applying the + to nothing would be invalid. That said, the pattern x?+ currently already throws an error because of the invalid usage of + quantifier.

Therefore, the fix here adds a checking step after the instruction set is created to check for a possible infinite-loop case. If one is detected, an exception is thrown indicating the pattern is not supported.

@davidwendt davidwendt added bug Something isn't working 3 - Ready for Review Ready for review by team libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) non-breaking Non-breaking change labels Jan 20, 2022
@davidwendt davidwendt self-assigned this Jan 20, 2022
@davidwendt davidwendt requested a review from a team as a code owner January 20, 2022 20:18
@davidwendt davidwendt requested review from vyasr and devavret January 20, 2022 20:18
@codecov
Copy link

codecov bot commented Jan 24, 2022

Codecov Report

Merging #10095 (9fc3469) into branch-22.04 (e24fa8f) will increase coverage by 0.00%.
The diff coverage is 2.95%.

❗ Current head 9fc3469 differs from pull request most recent head 36bf19f. Consider uploading reports for the commit 36bf19f to get more accurate results
Impacted file tree graph

@@              Coverage Diff              @@
##           branch-22.04   #10095   +/-   ##
=============================================
  Coverage         10.37%   10.38%           
=============================================
  Files               119      119           
  Lines             20149    20220   +71     
=============================================
+ Hits               2091     2099    +8     
- Misses            18058    18121   +63     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/io.py 0.00% <0.00%> (ø)
python/cudf/cudf/_fuzz_testing/orc.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/parquet.py 0.00% <ø> (ø)
python/cudf/cudf/_fuzz_testing/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/decimal.py 0.00% <0.00%> (ø)
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fd968f3...36bf19f. Read the comment docs.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple questions (only one really important one) otherwise this looks good.

cpp/tests/strings/contains_tests.cpp Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved
@vyasr vyasr self-requested a review January 26, 2022 19:16
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit b2d5874 into rapidsai:branch-22.04 Jan 27, 2022
@davidwendt davidwendt deleted the bug-contains-re-hang branch January 27, 2022 23:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] contains_re hangs with some regexp patterns
3 participants