Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Replace Backreferences section to Regex Features page #10560

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 15 additions & 5 deletions cpp/doxygen/regex.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,8 @@ This page specifies which regular expression (regex) features are currently supp
- cudf::strings::findall_record()
- cudf::strings::replace_re()
- cudf::strings::replace_with_backrefs()
- cudf::strings::split_re()
- cudf::strings::split_record_re()

The details are based on features documented at https://www.regular-expressions.info/reference.html

Expand Down Expand Up @@ -43,7 +45,7 @@ The details are based on features documented at https://www.regular-expressions.

| Feature | Syntax | Description | Example |
| ---------- | ------------- | ------------- | ------------- |
| Dot | . (dot) | Matches any single character except line break characters. Optionally match line break characters. | . matches x or (almost) any other character |
| Dot | . (dot) | Matches any single character except line break characters. Optionally match line break characters. The behavior of the dot when encountering a `\n` character can be controlled by cudf::strings::regex_flags for some regex APIs. | . matches x or (almost) any other character |
| Alternation | `⎮` (pipe) | Causes the regex engine to match either the part on the left side, or the part on the right side. Can be strung together into a series of alternations. | `abc⎮def⎮xyz` matches `abc`, `def` or `xyz` |


Expand Down Expand Up @@ -79,8 +81,8 @@ The details are based on features documented at https://www.regular-expressions.
| ---------- | ------------- | ------------- | ------------- |
| String anchor | `^` (caret) | Matches at the start of the string | `^.` matches `a` in `abcdef` |
| String anchor | `$` (dollar) | Matches at the end of the string | `.$` matches `f` in `abcdef` |
| Line anchor | `^` (caret) | Matches after each line break in addition to matching at the start of the string, thus matching at the start of each line in the string. | `^.` matches `a` and `d` in `abc\ndef` |
| Line anchor | `$` (dollar) | Matches before each line break in addition to matching at the end of the string, thus matching at the end of each line in the string. | `.$` matches `c` and `f` in `abc\ndef` |
| Line anchor | `^` (caret) | Matches after each line break in addition to matching at the start of the string, thus matching at the start of each line in the string. The behavior of this anchor can be controlled by cudf::strings::regex_flags for some regex APIs. | `^.` matches `a` and `d` in `abc\ndef` |
| Line anchor | `$` (dollar) | Matches before each line break in addition to matching at the end of the string, thus matching at the end of each line in the string. The behavior of this anchor can be controlled by cudf::strings::regex_flags for some regex APIs. | `.$` matches `c` and `f` in `abc\ndef` |
| String anchor | `\A` | Matches at the start of the string | `\A\w` matches only `a` in `abc` |
| String anchor | `\Z` | Matches at the end of the string | `\w\Z` matches `f` in `abc\ndef` but fails to match `abc\ndef\n` or `abc\ndef\n\n` |

Expand Down Expand Up @@ -111,5 +113,13 @@ The details are based on features documented at https://www.regular-expressions.

| Feature | Syntax | Description | Example |
| ---------- | ------------- | ------------- | ------------- |
| Capturing group | `(`regex`)` | Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group. They allow you to apply regex operators to the entire grouped regex. | `(abc⎮def)ghi` matches `abcghi` or `defghi` |
| Non-capturing group | `(?:`regex`)` | Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything. | `(?:abc⎮def)ghi` matches `abcghi` or `defghi` |
| Capturing group | `(regex)` | Parentheses group the regex between them. They capture the text matched by the regex inside them into a numbered group. They allow you to apply regex operators to the entire grouped regex. | `(abc⎮def)ghi` matches `abcghi` or `defghi` |
| Non-capturing group | `(?:regex)` | Non-capturing parentheses group the regex so you can apply regex operators, but do not capture anything. | `(?:abc⎮def)ghi` matches `abcghi` or `defghi` |

### Replacement Backreferences

| Feature | Syntax | Description | Example |
| ---------- | ------------- | ------------- | ------------- |
| Backreference | `\1` through `\99` | Insert the text matched by capturing groups 1 through 99 | Replacing `(a)(b)(c)` with `\3\3\1` in `abc` yields `cca` |
| Backreference | `${1}` through `${99}` | Insert the text matched by capturing groups 1 through 99 | Replacing `(a)(b)(c)` with `${2}.${2}:{$3}` in `abc` yields `b.b:c` |
| Whole match | `${0}` | Insert the whole regex match | Replacing `(\d)(a)` with `[${0}]:-${2}_${1};` in `123abc` yields `12[3a]:-a_3;bc`
4 changes: 4 additions & 0 deletions cpp/include/cudf/strings/split/split_re.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -162,6 +162,8 @@ std::unique_ptr<table> rsplit_re(
*
* @throw cudf::logic_error if `pattern` is empty.
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @param input A column of string elements to be split.
* @param pattern The regex pattern for delimiting characters within each string.
* @param maxsplit Maximum number of splits to perform.
Expand Down Expand Up @@ -212,6 +214,8 @@ std::unique_ptr<column> split_record_re(
* ["ab_cd", ""] ]
* @endcode
*
* See the @ref md_regex "Regex Features" page for details on patterns supported by this API.
*
* @throw cudf::logic_error if `pattern` is empty.
*
* @param input A column of string elements to be split.
Expand Down