diff --git a/cpp/doxygen/regex.md b/cpp/doxygen/regex.md index 68a446846ce..bfa5745e269 100644 --- a/cpp/doxygen/regex.md +++ b/cpp/doxygen/regex.md @@ -11,6 +11,8 @@ This page specifies which regular expression (regex) features are currently supp - cudf::strings::findall_record() - cudf::strings::replace_re() - cudf::strings::replace_with_backrefs() +- cudf::strings::split_re() +- cudf::strings::split_record_re() The details are based on features documented at https://www.regular-expressions.info/reference.html @@ -43,7 +45,7 @@ The details are based on features documented at https://www.regular-expressions. | Feature | Syntax | Description | Example | | ---------- | ------------- | ------------- | ------------- | -| Dot | . (dot) | Matches any single character except line break characters. Optionally match line break characters. | . matches x or (almost) any other character | +| Dot | . (dot) | Matches any single character except line break characters. Optionally match line break characters. The behavior of the dot when encountering a `\n` character can be controlled by cudf::strings::regex_flags for some regex APIs. | . matches x or (almost) any other character | | Alternation | `⎮` (pipe) | Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of alternations. | `abc⎮def⎮xyz` matches `abc`, `def` or `xyz` | @@ -79,8 +81,8 @@ The details are based on features documented at https://www.regular-expressions. | ---------- | ------------- | ------------- | ------------- | | String anchor | `^` (caret) | Matches at the start of the string | `^.` matches `a` in `abcdef` | | String anchor | `$` (dollar) | Matches at the end of the string | `.$` matches `f` in `abcdef` | -| Line anchor | `^` (caret) | Matches after each line break in addition to matching at the start of the string, thus matching at the start of each line in the string. | `^.` matches `a` and `d` in `abc\ndef` | -| Line anchor | `$` (dollar) | Matches before each line break in addition to matching at the end of the string, thus matching at the end of each line in the string. | `.$` matches `c` and `f` in `abc\ndef` | +| Line anchor | `^` (caret) | Matches after each line break in addition to matching at the start of the string, thus matching at the start of each line in the string. The behavior of this anchor can be controlled by cudf::strings::regex_flags for some regex APIs. | `^.` matches `a` and `d` in `abc\ndef` | +| Line anchor | `$` (dollar) | Matches before each line break in addition to matching at the end of the string, thus matching at the end of each line in the string. The behavior of this anchor can be controlled by cudf::strings::regex_flags for some regex APIs. | `.$` matches `c` and `f` in `abc\ndef` | | String anchor | `\A` | Matches at the start of the string | `\A\w` matches only `a` in `abc` | | String anchor | `\Z` | Matches at the end of the string | `\w\Z` matches `f` in `abc\ndef` but fails to match `abc\ndef\n` or `abc\ndef\n\n` | @@ -111,5 +113,13 @@ The details are based on features documented at https://www.regular-expressions. | Feature | Syntax | Description | Example | | ---------- | ------------- | ------------- | ------------- | -| Capturing group | `(`regex`)` | Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group. They allow you to apply regex operators to the entire grouped regex. | `(abc⎮def)ghi` matches `abcghi` or `defghi` | -| Non-capturing group | `(?:`regex`)` | Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything. | `(?:abc⎮def)ghi` matches `abcghi` or `defghi` | +| Capturing group | `(regex)` | Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group. They allow you to apply regex operators to the entire grouped regex. | `(abc⎮def)ghi` matches `abcghi` or `defghi` | +| Non-capturing group | `(?:regex)` | Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything. | `(?:abc⎮def)ghi` matches `abcghi` or `defghi` | + +### Replacement Backreferences + +| Feature | Syntax | Description | Example | +| ---------- | ------------- | ------------- | ------------- | +| Backreference | `\1` through `\99` | Insert the text matched by capturing groups 1 through 99 | Replacing `(a)(b)(c)` with `\3\3\1` in `abc` yields `cca` | +| Backreference | `${1}` through `${99}` | Insert the text matched by capturing groups 1 through 99 | Replacing `(a)(b)(c)` with `${2}.${2}:{$3}` in `abc` yields `b.b:c` | +| Whole match | `${0}` | Insert the whole regex match | Replacing `(\d)(a)` with `[${0}]:-${2}_${1};` in `123abc` yields `12[3a]:-a_3;bc` diff --git a/cpp/include/cudf/strings/split/split_re.hpp b/cpp/include/cudf/strings/split/split_re.hpp index 320d1bdc9b4..9f40956722d 100644 --- a/cpp/include/cudf/strings/split/split_re.hpp +++ b/cpp/include/cudf/strings/split/split_re.hpp @@ -162,6 +162,8 @@ std::unique_ptr