Cleanup regex compiler fixed quantifiers source #10843

davidwendt · 2022-05-12T21:04:52Z

Cleans up the source for handling fixed quantifiers {n,m} used for repeating patterns using a range of values instead of just zero, one, or infinite. Hopefully this will help make this part of the regex parser/compiler easier to follow and maintain. There are many other items to cleanup (reference #3582) and this change concentrates mainly on the fixed quantifier handling.

No function or behavior has changed but new gtests have been added that did not previously cover these quantifier combinations.

codecov · 2022-05-12T23:09:52Z

Codecov Report

Merging #10843 (168464a) into branch-22.06 (54789ee) will increase coverage by 0.02%.
The diff coverage is 93.75%.

❗ Current head 168464a differs from pull request most recent head 043f195. Consider uploading reports for the commit 043f195 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10843      +/-   ##
================================================
+ Coverage         86.30%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22665    22668       +3     
================================================
+ Hits              19560    19569       +9     
+ Misses             3105     3099       -6

Impacted Files	Coverage Δ
python/cudf/cudf/core/indexed_frame.py	`91.70% <ø> (ø)`
python/cudf/cudf/utils/ioutils.py	`79.47% <87.50%> (-0.13%)`	⬇️
python/cudf/cudf/io/avro.py	`78.57% <100.00%> (ø)`
python/cudf/cudf/io/csv.py	`91.80% <100.00%> (ø)`
python/cudf/cudf/io/json.py	`97.56% <100.00%> (ø)`
python/cudf/cudf/io/orc.py	`92.77% <100.00%> (ø)`
python/cudf/cudf/io/parquet.py	`90.83% <100.00%> (ø)`
python/cudf/cudf/io/text.py	`100.00% <100.00%> (ø)`
python/cudf/cudf/core/dataframe.py	`93.78% <0.00%> (+0.04%)`	⬆️
python/cudf/cudf/core/column/string.py	`88.78% <0.00%> (+0.12%)`	⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6acf226...043f195. Read the comment docs.

cpp/src/strings/regex/regcomp.cpp

ttnghia

Approve with some comments.

vyasr

Whoops sorry I completed this review but I guess I forgot to actually submit it.

cpp/doxygen/regex.md

vyasr · 2022-05-18T17:03:53Z

cpp/src/strings/regex/regcomp.cpp

+        // get left-side (n) value => min_count
+        exprp += transform_until(exprp, exprp + max_read, buffer.data(), "},");
+        auto count = std::atoi(buffer.data());
+        if ((*exprp != '}' && *exprp != ',') || (count > max_value)) {


Since you're only reading at most 3 characters it is impossible to find count > max_value, right?

That is kind of a weak link. If the max_value changes to a smaller 3-digit value, the check would need to be re-added. This way, this line should never need to change.

But would there ever be a reason to use a number that isn't the largest number representable by max_read digits? My understanding of the code was that the limitation was solely in place to control the width of the buffer reads.

Yes, I'd like to consider limiting max_value to something like 255 in the future which would not change max_read but require the count check.

That's fine with me. Just for my edification, what is the benefit of that choice? It doesn't have any performance implications unless someone actually requests a number that large at runtime, right?

I am asking whether there is a difference between 255 and 999 if if a user never actually requests a number > 255. Based on

The number does contribute to the size of the working memory so may affect runtime performance.

it sounds like the answer is yes? You allocate memory based on the maximum number of repetitions somewhere? In that case, I assume that the amount of memory increases stepwise as the maximum hits crosses powers of two?

Nothing that sophisticated. Just maybe store the value in a smaller variable (e.g. uint8).

If it doesn’t affect runtime performance, why arbitrarily limit at 999 and not the max value for the size of the repetitions variable? There’s some awkwardness in explaining this arbitrary limit. If it were INT_MAX, I think the docs might not even need to mention it. After all, string columns also have a length limit, right? It might be impossible to reach a repeat limit of INT_MAX.

Is just an arbitrary string of numbers entered by a human that is being converted to an integer so some error checking will need to be done since the string of decimal digits could be any length. Is there some reason not to limit it?

Discussed offline with @davidwendt. I don't think it's worth blocking on this point so I'm fine with accepting a limit of 999. We can change it later if users need more.

cpp/src/strings/regex/regcomp.cpp

cpp/tests/strings/contains_tests.cpp

cpp/src/strings/regex/regcomp.cpp

cpp/tests/strings/contains_tests.cpp

cpp/src/strings/regex/regcomp.cpp

bdice

@davidwendt I apologize sincerely -- I just noticed that I also started a PR review that I did not complete and submit. Here are a small number of comments from that partial review.

It looks like the review from @vyasr is much more complete, so I will excuse myself from additional review unless you need another look at anything.

cpp/doxygen/regex.md

bdice · 2022-05-18T13:09:14Z

cpp/doxygen/regex.md

+| Greedy quantifier | `{n,m}` where `n` and `m` are integers: `0 ≤ n ≤ 999` and `n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Greedy, so repeating `m` times is tried before reducing the repetition to `n` times. | `a{2,4}` matches `aaaa`, `aaa` or `aa` |
+| Greedy quantifier | `{n,}` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item at least `n` times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only `n` times. | `a{2,}` matches `aaaaa` in `aaaaa` |
+| Lazy quantifier | `{n,m}?` where `n` and `m` are integers `0 ≤ n ≤ 999` and `n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Lazy, so repeating `n` times is tried before increasing the repetition to `m` times. | `a{2,4}?` matches `aa`, `aaa` or `aaaa` |
+| Lazy quantifier | `{n,}?` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item `n` or more times. Lazy, so the engine first matches the previous item `n` times, before trying permutations with ever increasing matches of the preceding item. | `a{2,}?` matches `aa` in `aaaaa` |


General regex question: if this is lazy, how does its behavior differ from matching exactly n repetitions? What would force it to match more repetitions?

Honestly, I don't know. I think it depends on the previous character pattern. Here is an example with '.' as the repeat item:

>>> re.search('.{2,}b', 'aabcdefb') <re.Match object; span=(0, 8), match='aabcdefb'> >>> re.search('.{3,}b', 'aabcdefb') <re.Match object; span=(0, 8), match='aabcdefb'> >>> re.search('.{2,}?b', 'aabcdefb') <re.Match object; span=(0, 3), match='aab'> >>> re.search('.{3,}?b', 'aabcdefb') <re.Match object; span=(0, 8), match='aabcdefb'>

Maybe there are better examples.

I think Bradley is asking how {n} is different from {n,}?, not how {n,} is different from {n,}?. Here are the extra two cases that need to be added to your examples:

>>> re.search('.{2}b', 'aabcdefb') <re.Match object; span=(0, 3), match='aab'> >>> re.search('.{3}b', 'aabcdefb') <re.Match object; span=(4, 8), match='defb'>

The differences have to do with backtracking behavior and whether matching the entire regex requires that the lazy quantifier accept more characters. For example:

>>> re.search('a+b{2}a+', 'aaaabbbaaa') >>> re.search('a+b{2,}?a+', 'aaaabbbaaa') <re.Match object; span=(0, 10), match='aaaabbbaaa'>

In this case, an exact requirement of b{2} won't match, because there are three. But the lazy quantifier says "OK, in that case I'll take some extra b characters and see if I can get it to match".

cpp/src/strings/regex/regcomp.cpp

vyasr

Once we resolve the question of the curly braces with >999 this gets a green light from me.

davidwendt · 2022-05-24T20:56:24Z

Once we resolve the question of the curly braces with >999 this gets a green light from me.

If the count value is greater than max_value then an error is thrown now. I think that is all that is left.

vyasr

LGTM!

davidwendt · 2022-05-25T02:50:45Z

@gpucibot merge

Cleans up the `regcomp.cpp` source to fix class names, comments, and simplify logic around processing operators and operands returned by the parser. Several class member variables used for state are moved or eliminated. Some member functions and variables are renamed. Cleanup of the parser logic will be in a follow-on PR. Reference #3582 Follow on to #10843 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Yunsong Wang (https://github.com/PointKernel) URL: #10879

Cleanup regex fixed quantifiers code

f439900

davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 12, 2022

davidwendt self-assigned this May 12, 2022

davidwendt added the tech debt label May 12, 2022

fix merge conflict

5d79332

davidwendt added 6 commits May 13, 2022 08:09

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

8b7dad8

add documentation quantifier range limits

26be3d0

add additional error checks

1c865e9

improve comments and variable names in expand_counted fn

dc55200

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

9de0b60

update restrictions in regex doc page

962625a

davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 16, 2022

davidwendt marked this pull request as ready for review May 16, 2022 15:39

davidwendt requested a review from a team as a code owner May 16, 2022 15:39

davidwendt requested review from mythrocks and ttnghia May 16, 2022 15:39

davidwendt added 2 commits May 16, 2022 17:57

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

5b18daf

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

de217de

davidwendt mentioned this pull request May 17, 2022

Cleanup regex compiler operators and operands source #10879

Merged

ttnghia reviewed May 17, 2022

View reviewed changes

cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved

ttnghia reviewed May 17, 2022

View reviewed changes

cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved

ttnghia reviewed May 17, 2022

View reviewed changes

cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved

ttnghia approved these changes May 17, 2022

View reviewed changes

davidwendt added 5 commits May 17, 2022 16:46

add assert(n>=0) check

303829a

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

1808665

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

6d12fc3

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

593cf0e

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

d8cea51

vyasr requested changes May 19, 2022

View reviewed changes

add some consts; change vector to array

cfc0ced

davidwendt requested a review from vyasr May 19, 2022 18:30

vyasr reviewed May 19, 2022

View reviewed changes

cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved

cpp/tests/strings/contains_tests.cpp Outdated Show resolved Hide resolved

cpp/tests/strings/contains_tests.cpp Show resolved Hide resolved

cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved

Merge branch 'branch-22.06' into cleanup-fixed-quantifiers

84e0e86

bdice reviewed May 20, 2022

View reviewed changes

vyasr reviewed May 20, 2022

View reviewed changes

davidwendt added 2 commits May 24, 2022 10:54

fix regex doc page wording for quantifiers

fe25cb7

throw error if invalid repeat count

91f7c34

davidwendt requested a review from vyasr May 24, 2022 15:22

davidwendt added 2 commits May 24, 2022 16:58

remove commented out line

581da1f

add range test for m

043f195

vyasr approved these changes May 24, 2022

View reviewed changes

davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels May 24, 2022

rapids-bot bot merged commit 6a64ce1 into rapidsai:branch-22.06 May 25, 2022

davidwendt deleted the cleanup-fixed-quantifiers branch May 25, 2022 02:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup regex compiler fixed quantifiers source #10843

Cleanup regex compiler fixed quantifiers source #10843

davidwendt commented May 12, 2022

codecov bot commented May 12, 2022 •

edited

Loading

ttnghia left a comment

vyasr left a comment

vyasr May 18, 2022

davidwendt May 19, 2022

vyasr May 19, 2022

davidwendt May 19, 2022

vyasr May 20, 2022

vyasr May 24, 2022 •

edited

Loading

davidwendt May 24, 2022

bdice May 24, 2022

davidwendt May 24, 2022

bdice May 24, 2022 •

edited

Loading

bdice left a comment •

edited

Loading

bdice May 18, 2022

davidwendt May 24, 2022

vyasr May 24, 2022

vyasr left a comment

davidwendt commented May 24, 2022

vyasr left a comment

davidwendt commented May 25, 2022

Cleanup regex compiler fixed quantifiers source #10843

Cleanup regex compiler fixed quantifiers source #10843

Conversation

davidwendt commented May 12, 2022

codecov bot commented May 12, 2022 • edited Loading

Codecov Report

ttnghia left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr May 24, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bdice May 24, 2022 • edited Loading

Choose a reason for hiding this comment

bdice left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

davidwendt commented May 24, 2022

vyasr left a comment

Choose a reason for hiding this comment

davidwendt commented May 25, 2022

codecov bot commented May 12, 2022 •

edited

Loading

vyasr May 24, 2022 •

edited

Loading

bdice May 24, 2022 •

edited

Loading

bdice left a comment •

edited

Loading