Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup regex compiler fixed quantifiers source #10843

Merged
merged 21 commits into from
May 25, 2022

Conversation

davidwendt
Copy link
Contributor

Cleans up the source for handling fixed quantifiers {n,m} used for repeating patterns using a range of values instead of just zero, one, or infinite. Hopefully this will help make this part of the regex parser/compiler easier to follow and maintain. There are many other items to cleanup (reference #3582) and this change concentrates mainly on the fixed quantifier handling.

No function or behavior has changed but new gtests have been added that did not previously cover these quantifier combinations.

@davidwendt davidwendt added 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 12, 2022
@davidwendt davidwendt self-assigned this May 12, 2022
@codecov
Copy link

codecov bot commented May 12, 2022

Codecov Report

Merging #10843 (168464a) into branch-22.06 (54789ee) will increase coverage by 0.02%.
The diff coverage is 93.75%.

❗ Current head 168464a differs from pull request most recent head 043f195. Consider uploading reports for the commit 043f195 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10843      +/-   ##
================================================
+ Coverage         86.30%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22665    22668       +3     
================================================
+ Hits              19560    19569       +9     
+ Misses             3105     3099       -6     
Impacted Files Coverage Δ
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/utils/ioutils.py 79.47% <87.50%> (-0.13%) ⬇️
python/cudf/cudf/io/avro.py 78.57% <100.00%> (ø)
python/cudf/cudf/io/csv.py 91.80% <100.00%> (ø)
python/cudf/cudf/io/json.py 97.56% <100.00%> (ø)
python/cudf/cudf/io/orc.py 92.77% <100.00%> (ø)
python/cudf/cudf/io/parquet.py 90.83% <100.00%> (ø)
python/cudf/cudf/io/text.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6acf226...043f195. Read the comment docs.

@davidwendt davidwendt added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels May 16, 2022
@davidwendt davidwendt marked this pull request as ready for review May 16, 2022 15:39
@davidwendt davidwendt requested a review from a team as a code owner May 16, 2022 15:39
@davidwendt davidwendt requested review from mythrocks and ttnghia May 16, 2022 15:39
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve with some comments.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whoops sorry I completed this review but I guess I forgot to actually submit it.

cpp/doxygen/regex.md Show resolved Hide resolved
// get left-side (n) value => min_count
exprp += transform_until(exprp, exprp + max_read, buffer.data(), "},");
auto count = std::atoi(buffer.data());
if ((*exprp != '}' && *exprp != ',') || (count > max_value)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you're only reading at most 3 characters it is impossible to find count > max_value, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is kind of a weak link. If the max_value changes to a smaller 3-digit value, the check would need to be re-added. This way, this line should never need to change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But would there ever be a reason to use a number that isn't the largest number representable by max_read digits? My understanding of the code was that the limitation was solely in place to control the width of the buffer reads.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I'd like to consider limiting max_value to something like 255 in the future which would not change max_read but require the count check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine with me. Just for my edification, what is the benefit of that choice? It doesn't have any performance implications unless someone actually requests a number that large at runtime, right?

Copy link
Contributor

@vyasr vyasr May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am asking whether there is a difference between 255 and 999 if if a user never actually requests a number > 255. Based on

The number does contribute to the size of the working memory so may affect runtime performance.

it sounds like the answer is yes? You allocate memory based on the maximum number of repetitions somewhere? In that case, I assume that the amount of memory increases stepwise as the maximum hits crosses powers of two?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing that sophisticated. Just maybe store the value in a smaller variable (e.g. uint8).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it doesn’t affect runtime performance, why arbitrarily limit at 999 and not the max value for the size of the repetitions variable? There’s some awkwardness in explaining this arbitrary limit. If it were INT_MAX, I think the docs might not even need to mention it. After all, string columns also have a length limit, right? It might be impossible to reach a repeat limit of INT_MAX.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is just an arbitrary string of numbers entered by a human that is being converted to an integer so some error checking will need to be done since the string of decimal digits could be any length. Is there some reason not to limit it?

Copy link
Contributor

@bdice bdice May 24, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline with @davidwendt. I don't think it's worth blocking on this point so I'm fine with accepting a limit of 999. We can change it later if users need more.

cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
cpp/tests/strings/contains_tests.cpp Show resolved Hide resolved
cpp/tests/strings/contains_tests.cpp Outdated Show resolved Hide resolved
@davidwendt davidwendt requested a review from vyasr May 19, 2022 18:30
cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved
cpp/tests/strings/contains_tests.cpp Outdated Show resolved Hide resolved
cpp/tests/strings/contains_tests.cpp Show resolved Hide resolved
cpp/src/strings/regex/regcomp.cpp Outdated Show resolved Hide resolved
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidwendt I apologize sincerely -- I just noticed that I also started a PR review that I did not complete and submit. Here are a small number of comments from that partial review.

It looks like the review from @vyasr is much more complete, so I will excuse myself from additional review unless you need another look at anything.

cpp/doxygen/regex.md Outdated Show resolved Hide resolved
cpp/doxygen/regex.md Outdated Show resolved Hide resolved
cpp/doxygen/regex.md Outdated Show resolved Hide resolved
| Greedy quantifier | `{n,m}` where `n` and `m` are integers: `0 ≤ n ≤ 999` and `n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Greedy, so repeating `m` times is tried before reducing the repetition to `n` times. | `a{2,4}` matches `aaaa`, `aaa` or `aa` |
| Greedy quantifier | `{n,}` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item at least `n` times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only `n` times. | `a{2,}` matches `aaaaa` in `aaaaa` |
| Lazy quantifier | `{n,m}?` where `n` and `m` are integers `0 ≤ n ≤ 999` and `n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Lazy, so repeating `n` times is tried before increasing the repetition to `m` times. | `a{2,4}?` matches `aa`, `aaa` or `aaaa` |
| Lazy quantifier | `{n,}?` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item `n` or more times. Lazy, so the engine first matches the previous item `n` times, before trying permutations with ever increasing matches of the preceding item. | `a{2,}?` matches `aa` in `aaaaa` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General regex question: if this is lazy, how does its behavior differ from matching exactly n repetitions? What would force it to match more repetitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know. I think it depends on the previous character pattern. Here is an example with '.' as the repeat item:

>>> re.search('.{2,}b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
>>> re.search('.{3,}b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
>>> re.search('.{2,}?b', 'aabcdefb')
<re.Match object; span=(0, 3), match='aab'>
>>> re.search('.{3,}?b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>

Maybe there are better examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Bradley is asking how {n} is different from {n,}?, not how {n,} is different from {n,}?. Here are the extra two cases that need to be added to your examples:

>>> re.search('.{2}b', 'aabcdefb')
<re.Match object; span=(0, 3), match='aab'>
>>> re.search('.{3}b', 'aabcdefb')
<re.Match object; span=(4, 8), match='defb'>

The differences have to do with backtracking behavior and whether matching the entire regex requires that the lazy quantifier accept more characters. For example:

>>> re.search('a+b{2}a+', 'aaaabbbaaa')
>>> re.search('a+b{2,}?a+', 'aaaabbbaaa')
<re.Match object; span=(0, 10), match='aaaabbbaaa'>

In this case, an exact requirement of b{2} won't match, because there are three. But the lazy quantifier says "OK, in that case I'll take some extra b characters and see if I can get it to match".

cpp/src/strings/regex/regcomp.cpp Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we resolve the question of the curly braces with >999 this gets a green light from me.

@davidwendt davidwendt requested a review from vyasr May 24, 2022 15:22
@davidwendt
Copy link
Contributor Author

Once we resolve the question of the curly braces with >999 this gets a green light from me.

If the count value is greater than max_value then an error is thrown now. I think that is all that is left.

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@davidwendt davidwendt added breaking Breaking change and removed non-breaking Non-breaking change labels May 24, 2022
@davidwendt
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 6a64ce1 into rapidsai:branch-22.06 May 25, 2022
@davidwendt davidwendt deleted the cleanup-fixed-quantifiers branch May 25, 2022 02:50
rapids-bot bot pushed a commit that referenced this pull request May 27, 2022
Cleans up the `regcomp.cpp` source to fix class names, comments, and simplify logic around processing operators and operands returned by the parser. Several class member variables used for state are moved or eliminated. Some member functions and variables are renamed. Cleanup of the parser logic will be in a follow-on PR.

Reference #3582
Follow on to #10843

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #10879
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team breaking Breaking change improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants