Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cleanup regex compiler fixed quantifiers source #10843

Merged
merged 21 commits into from
May 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
21 commits
Select commit Hold shift + click to select a range
f439900
Cleanup regex fixed quantifiers code
davidwendt May 12, 2022
5d79332
fix merge conflict
davidwendt May 12, 2022
8b7dad8
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 13, 2022
26be3d0
add documentation quantifier range limits
davidwendt May 13, 2022
1c865e9
add additional error checks
davidwendt May 13, 2022
dc55200
improve comments and variable names in expand_counted fn
davidwendt May 13, 2022
9de0b60
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 16, 2022
962625a
update restrictions in regex doc page
davidwendt May 16, 2022
5b18daf
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 16, 2022
de217de
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 17, 2022
303829a
add assert(n>=0) check
davidwendt May 17, 2022
1808665
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 17, 2022
6d12fc3
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 18, 2022
593cf0e
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 19, 2022
d8cea51
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 19, 2022
cfc0ced
add some consts; change vector to array
davidwendt May 19, 2022
84e0e86
Merge branch 'branch-22.06' into cleanup-fixed-quantifiers
davidwendt May 19, 2022
fe25cb7
fix regex doc page wording for quantifiers
davidwendt May 24, 2022
91f7c34
throw error if invalid repeat count
davidwendt May 24, 2022
581da1f
remove commented out line
davidwendt May 24, 2022
043f195
add range test for m
davidwendt May 24, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions cpp/doxygen/regex.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,16 +98,16 @@ The details are based on features documented at https://www.regular-expressions.
| Feature | Syntax | Description | Example |
| ---------- | ------------- | ------------- | ------------- |
| Greedy quantifier | `?` (question mark) | Makes the preceding item optional. Greedy, so the optional item is included in the match if possible. | `abc?` matches `abc` or `ab` |
| Greedy quantifier | `*` (star) | Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is not matched at all. | `".*"` matches `"def"` and `"ghi"` in `abc "def" "ghi" jkl` |
| Greedy quantifier | `+` (plus) | Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only once. | `".+"` matches `"def"` and `"ghi"` in `abc "def" "ghi" jkl` |
| Greedy quantifier | `*` (star) | Repeats the previous item zero or more times. Greedy, so as many items as possible will be matched before trying permutations with fewer matches of the preceding item, up to the point where the preceding item is not matched at all. | `".*"` matches `"def"` and `"ghi"` in `abc "def" "ghi" jkl` |
| Greedy quantifier | `+` (plus) | Repeats the previous item once or more. Greedy, so as many items as possible will be matched before trying permutations with fewer matches of the preceding item, up to the point where the preceding item is matched only once. | `".+"` matches `"def"` and `"ghi"` in `abc "def" "ghi" jkl` |
| Lazy quantifier | `??` | Makes the preceding item optional. Lazy, so the optional item is excluded in the match if possible. | `abc??` matches `ab` or `abc` |
| Lazy quantifier | `*?` | Repeats the previous item zero or more times. Lazy, so the engine first attempts to skip the previous item, before trying permutations with ever increasing matches of the preceding item. | `".*?"` matches `"def"` and `"ghi"` in `abc "def" "ghi" jkl` |
| Lazy quantifier | `+?` | Repeats the previous item once or more. Lazy, so the engine first matches the previous item only once, before trying permutations with ever increasing matches of the preceding item. | `".+?"` matches `"def"` and `"ghi"` in `abc "def" "ghi" jkl` |
| Fixed quantifier | `{n}` where `n is an integer >= 1` | Repeats the previous item exactly `n` times. | `a{5}` matches `aaaaa` |
| Greedy quantifier | `{n,m}` where `n >= 0` and `m >= n` | Repeats the previous item between `n` and `m` times. Greedy, so repeating `m` times is tried before reducing the repetition to `n` times. | `a{2,4}` matches `aaaa`, `aaa` or `aa` |
| Greedy quantifier | `{n,}` where `n >= 0` | Repeats the previous item at least `n` times. Greedy, so as many items as possible will be matched before trying permutations with less matches of the preceding item, up to the point where the preceding item is matched only `n` times. | `a{2,}` matches `aaaaa` in `aaaaa` |
| Lazy quantifier | `{n,m}?` where `n >= 0` and `m >= n` | Repeats the previous item between `n` and `m` times. Lazy, so repeating `n` times is tried before increasing the repetition to `m` times. | `a{2,4}?` matches `aa`, `aaa` or `aaaa` |
| Lazy quantifier | `{n,}?` where `n >= 0` | Repeats the previous item `n` or more times. Lazy, so the engine first matches the previous item `n` times, before trying permutations with ever increasing matches of the preceding item. | `a{2,}?` matches `aa` in `aaaaa` |
| Fixed quantifier | `{n}` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item exactly `n` times. | `a{5}` matches `aaaaa` |
vyasr marked this conversation as resolved.
Show resolved Hide resolved
| Greedy quantifier | `{n,m}` where `n` and `m` are integers: `0 ≤ n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Greedy, so repeating `m` times is tried before reducing the repetition to `n` times. | `a{2,4}` matches `aaaa`, `aaa` or `aa` |
| Greedy quantifier | `{n,}` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item at least `n` times. Greedy, so as many items as possible will be matched before trying permutations with fewer matches of the preceding item, up to the point where the preceding item is matched only `n` times. | `a{2,}` matches `aaaaa` in `aaaaa` |
| Lazy quantifier | `{n,m}?` where `n` and `m` are integers `0 ≤ n ≤ m ≤ 999` | Repeats the previous item between `n` and `m` times. Lazy, so repeating `n` times is tried before increasing the repetition to `m` times. | `a{2,4}?` matches `aa`, `aaa`, or `aaaa` |
| Lazy quantifier | `{n,}?` where `n` is an integer: `0 ≤ n ≤ 999` | Repeats the previous item `n` or more times. Lazy, so the engine first matches the previous item `n` times, before trying permutations with ever increasing matches of the preceding item. | `a{2,}?` matches `aa` in `aaaaa` |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

General regex question: if this is lazy, how does its behavior differ from matching exactly n repetitions? What would force it to match more repetitions?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly, I don't know. I think it depends on the previous character pattern. Here is an example with '.' as the repeat item:

>>> re.search('.{2,}b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
>>> re.search('.{3,}b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>
>>> re.search('.{2,}?b', 'aabcdefb')
<re.Match object; span=(0, 3), match='aab'>
>>> re.search('.{3,}?b', 'aabcdefb')
<re.Match object; span=(0, 8), match='aabcdefb'>

Maybe there are better examples.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Bradley is asking how {n} is different from {n,}?, not how {n,} is different from {n,}?. Here are the extra two cases that need to be added to your examples:

>>> re.search('.{2}b', 'aabcdefb')
<re.Match object; span=(0, 3), match='aab'>
>>> re.search('.{3}b', 'aabcdefb')
<re.Match object; span=(4, 8), match='defb'>

The differences have to do with backtracking behavior and whether matching the entire regex requires that the lazy quantifier accept more characters. For example:

>>> re.search('a+b{2}a+', 'aaaabbbaaa')
>>> re.search('a+b{2,}?a+', 'aaaabbbaaa')
<re.Match object; span=(0, 10), match='aaaabbbaaa'>

In this case, an exact requirement of b{2} won't match, because there are three. But the lazy quantifier says "OK, in that case I'll take some extra b characters and see if I can get it to match".


### Groups

Expand Down
255 changes: 128 additions & 127 deletions cpp/src/strings/regex/regcomp.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,11 @@

#include <algorithm>
#include <array>
#include <cctype>
#include <numeric>
#include <stack>
#include <string>
#include <vector>

namespace cudf {
namespace strings {
Expand All @@ -45,6 +47,7 @@ enum OperatorType {
COUNTED_LAZY = 0215,
NOP = 0302, // No operation, internal use only
};
#define ITEM_MASK 0300

static reclass ccls_w(CCLASS_W); // \w
static reclass ccls_s(CCLASS_S); // \s
Expand Down Expand Up @@ -152,10 +155,10 @@ class regex_parser {
int id_ccls_d = -1; // digit
int id_ccls_D = -1; // not digit

char32_t yy; /* last lex'd Char */
int yyclass_id; /* last lex'd class */
short yy_min_count;
short yy_max_count;
char32_t yy{}; /* last lex'd Char */
int yyclass_id{}; /* last lex'd class */
int16_t yy_min_count{};
int16_t yy_max_count{};

bool nextc(char32_t& c) // return "quoted" == backslash-escape prefix
{
Expand Down Expand Up @@ -454,41 +457,69 @@ class regex_parser {
return PLUS_LAZY;
}
return PLUS;
case '{': // counted repetition
case '{': // counted repetition: {n,m}
{
if (*exprp < '0' || *exprp > '9') break;
const char32_t* exprp_backup = exprp; // in case '}' is not found
char buff[8] = {0};
for (int i = 0; i < 7 && *exprp != '}' && *exprp != ',' && *exprp != 0; i++, exprp++) {
buff[i] = *exprp;
buff[i + 1] = 0;
}
if (*exprp != '}' && *exprp != ',') {
exprp = exprp_backup;
break;
}
sscanf(buff, "%hd", &yy_min_count);
if (*exprp != ',')
yy_max_count = yy_min_count;
else {
yy_max_count = -1;
exprp++;
buff[0] = 0;
for (int i = 0; i < 7 && *exprp != '}' && *exprp != 0; i++, exprp++) {
buff[i] = *exprp;
buff[i + 1] = 0;
if (!std::isdigit(*exprp)) { break; }

// transform char32 to char until null, delimiter, non-digit or end is reached;
// returns the number of chars read/transformed
auto transform_until = [](char32_t const* input,
char32_t const* end,
char* output,
vyasr marked this conversation as resolved.
Show resolved Hide resolved
std::string_view const delimiters) -> int32_t {
int32_t count = 0;
while (*input != 0 && input < end) {
auto const ch = static_cast<char>(*input++);
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
// if ch not a digit or ch is a delimiter, we are done
if (!std::isdigit(ch) || delimiters.find(ch) != delimiters.npos) { break; }
output[count] = ch;
++count;
}
if (*exprp != '}') {
exprp = exprp_backup;
break;
output[count] = 0; // null-terminate (for the atoi call)
return count;
};

constexpr auto max_read = 4; // 3 digits plus the delimiter
constexpr auto max_value = 999; // support only 3 digits
std::array<char, max_read + 1> buffer = {0}; //(max_read + 1);

// get left-side (n) value => min_count
auto bytes_read = transform_until(exprp, exprp + max_read, buffer.data(), "},");
if (exprp[bytes_read] != '}' && exprp[bytes_read] != ',') {
break; // re-interpret as CHAR
}
auto count = std::atoi(buffer.data());
CUDF_EXPECTS(count <= max_value,
"unsupported repeat value at " + std::to_string(exprp - pattern - 1));
yy_min_count = static_cast<int16_t>(count);

auto const exprp_backup = exprp; // save in case ending '}' is not found
exprp += bytes_read;

// get optional right-side (m) value => max_count
yy_max_count = yy_min_count;
if (*exprp++ == ',') {
bytes_read = transform_until(exprp, exprp + max_read, buffer.data(), "}");
if (exprp[bytes_read] != '}') {
exprp = exprp_backup; // abort, rollback and
break; // re-interpret as CHAR
}
if (buff[0] != 0) sscanf(buff, "%hd", &yy_max_count);

count = std::atoi(buffer.data());
CUDF_EXPECTS(count <= max_value,
"unsupported repeat value at " + std::to_string(exprp - pattern - 1));

// {n,m} and {n,} are both valid
yy_max_count = buffer[0] == 0 ? -1 : static_cast<int16_t>(count);
exprp += bytes_read + 1;
}
exprp++;

// {n,m}? pattern is lazy counted quantifier
if (*exprp == '?') {
exprp++;
return COUNTED_LAZY;
}
// otherwise, fixed counted quantifier
return COUNTED;
}
case '|': return OR;
Expand Down Expand Up @@ -562,6 +593,9 @@ class regex_compiler {

regex_flags flags;

char32_t yy;
vyasr marked this conversation as resolved.
Show resolved Hide resolved
int yyclass_id;

inline void pushand(int f, int l) { andstack.push_back({f, l}); }

inline Node popand(int op)
Expand Down Expand Up @@ -714,97 +748,70 @@ class regex_compiler {
lastwasand = true;
}

char32_t yy;
int yyclass_id;

void expand_counted(const std::vector<regex_parser::Item>& in,
std::vector<regex_parser::Item>& out)
std::vector<regex_parser::Item> expand_counted(std::vector<regex_parser::Item> const& in)
{
std::vector<int> lbra_stack;
int rep_start = -1;

out.clear();
for (std::size_t i = 0; i < in.size(); i++) {
if (in[i].t != COUNTED && in[i].t != COUNTED_LAZY) {
out.push_back(in[i]);
if (in[i].t == LBRA || in[i].t == LBRA_NC) {
lbra_stack.push_back(i);
rep_start = -1;
} else if (in[i].t == RBRA) {
rep_start = lbra_stack[lbra_stack.size() - 1];
lbra_stack.pop_back();
} else if ((in[i].t & 0300) != OPERATOR_MASK) {
rep_start = i;
std::vector<regex_parser::Item> out;
std::stack<int> lbra_stack;
auto repeat_start_index = -1;

for (std::size_t index = 0; index < in.size(); index++) {
auto const item = in[index];

if (item.t != COUNTED && item.t != COUNTED_LAZY) {
out.push_back(item);
if (item.t == LBRA || item.t == LBRA_NC) {
lbra_stack.push(index);
repeat_start_index = -1;
} else if (item.t == RBRA) {
repeat_start_index = lbra_stack.top();
lbra_stack.pop();
} else if ((item.t & ITEM_MASK) != OPERATOR_MASK) {
repeat_start_index = index;
}
vyasr marked this conversation as resolved.
Show resolved Hide resolved
} else {
if (rep_start < 0) // broken regex
return;
// item is of type COUNTED or COUNTED_LAZY
// here we repeat the previous item(s) based on the count range in item

regex_parser::Item item = in[i];
if (item.d.yycount.n <= 0) {
// need to erase
for (std::size_t j = 0; j < i - rep_start; j++)
out.pop_back();
} else {
// repeat
for (int j = 1; j < item.d.yycount.n; j++)
for (std::size_t k = rep_start; k < i; k++)
out.push_back(in[k]);
CUDF_EXPECTS(repeat_start_index >= 0, "regex: invalid counted quantifier location");

// range of affected item(s) to repeat
auto const begin = in.begin() + repeat_start_index;
auto const end = in.begin() + index;
// count range values
auto const n = item.d.yycount.n; // minimum count
auto const m = item.d.yycount.m; // maximum count

assert(n >= 0 && "invalid repeat count value n");
// zero-repeat edge-case: need to erase the previous items
if (n == 0) { out.erase(out.end() - (index - repeat_start_index), out.end()); }

// minimum repeats (n)
for (int j = 1; j < n; j++) {
out.insert(out.end(), begin, end);
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
}

// optional repeats
if (item.d.yycount.m >= 0) {
for (int j = item.d.yycount.n; j < item.d.yycount.m; j++) {
regex_parser::Item o_item;
o_item.t = LBRA_NC;
o_item.d.yy = 0;
out.push_back(o_item);
for (std::size_t k = rep_start; k < i; k++)
out.push_back(in[k]);
// optional maximum repeats (m)
if (m >= 0) {
for (int j = n; j < m; j++) {
out.push_back(regex_parser::Item{LBRA_NC, 0});
out.insert(out.end(), begin, end);
davidwendt marked this conversation as resolved.
Show resolved Hide resolved
}
for (int j = item.d.yycount.n; j < item.d.yycount.m; j++) {
regex_parser::Item o_item;
o_item.t = RBRA;
o_item.d.yy = 0;
out.push_back(o_item);
if (item.t == COUNTED) {
o_item.t = QUEST;
out.push_back(o_item);
} else {
o_item.t = QUEST_LAZY;
out.push_back(o_item);
}
for (int j = n; j < m; j++) {
out.push_back(regex_parser::Item{RBRA, 0});
out.push_back(regex_parser::Item{item.t == COUNTED ? QUEST : QUEST_LAZY, 0});
}
} else // infinite repeat
{
regex_parser::Item o_item;
o_item.d.yy = 0;

if (item.d.yycount.n > 0) // put '+' after last repetition
{
if (item.t == COUNTED) {
o_item.t = PLUS;
out.push_back(o_item);
} else {
o_item.t = PLUS_LAZY;
out.push_back(o_item);
}
} else // copy it once then put '*'
{
for (std::size_t k = rep_start; k < i; k++)
out.push_back(in[k]);

if (item.t == COUNTED) {
o_item.t = STAR;
out.push_back(o_item);
} else {
o_item.t = STAR_LAZY;
out.push_back(o_item);
}
} else {
// infinite repeats
if (n > 0) { // append '+' after last repetition
out.push_back(regex_parser::Item{item.t == COUNTED ? PLUS : PLUS_LAZY, 0});
} else { // copy it once then append '*'
out.insert(out.end(), begin, end);
out.push_back(regex_parser::Item{item.t == COUNTED ? STAR : STAR_LAZY, 0});
}
}
}
}
return out;
}

public:
Expand All @@ -819,23 +826,17 @@ class regex_compiler {
yyclass_id(0)
{
// Parse
std::vector<regex_parser::Item> items;
{
std::vector<regex_parser::Item> const items = [&] {
regex_parser parser(pattern, is_dotall(flags) ? ANYNL : ANY, m_prog);

// Expand counted repetitions
if (parser.m_has_counted)
expand_counted(parser.m_items, items);
else
items = parser.m_items;
}
return parser.m_has_counted ? expand_counted(parser.m_items) : parser.m_items;
}();

/* Start with a low priority operator to prime parser */
pushator(START - 1);

for (int i = 0; i < static_cast<int>(items.size()); i++) {
regex_parser::Item item = items[i];
int token = item.t;
auto const item = items[i];
int token = item.t;
if (token == CCLASS || token == NCCLASS)
yyclass_id = item.d.yyclass_id;
else
Expand Down Expand Up @@ -1109,12 +1110,12 @@ void reprog::print(regex_flags const flags)
if (cls.builtins) {
int mask = cls.builtins;
printf(" builtins(x%02X):", static_cast<unsigned>(mask));
if (mask & 1) printf(" \\w");
if (mask & 2) printf(" \\s");
if (mask & 4) printf(" \\d");
if (mask & 8) printf(" \\W");
if (mask & 16) printf(" \\S");
if (mask & 32) printf(" \\D");
if (mask & CCLASS_W) printf(" \\w");
if (mask & CCLASS_S) printf(" \\s");
if (mask & CCLASS_D) printf(" \\d");
if (mask & NCCLASS_W) printf(" \\W");
if (mask & NCCLASS_S) printf(" \\S");
if (mask & NCCLASS_D) printf(" \\D");
}
printf("\n");
}
Expand Down
Loading