Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Regex: $ does not match \n at end of string #9620

Closed
andygrove opened this issue Nov 5, 2021 · 3 comments · Fixed by #9715
Closed

[BUG] Regex: $ does not match \n at end of string #9620

andygrove opened this issue Nov 5, 2021 · 3 comments · Fixed by #9715
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@andygrove
Copy link
Contributor

Describe the bug
There is an edge case with multiline / EOL matching that seems to be incorrect in cuDF.

Given the pattern 2$ I would expect it to match the inputs 2 and 2\n as seen in Python:

>>> print(re.compile('2$').search("2"))
<re.Match object; span=(0, 1), match='2'>

>>> print(re.compile('2$').search("2\n"))
<re.Match object; span=(0, 1), match='2'>

However, cuDF does not match in the 2\n case.

Steps/Code to reproduce bug

I don't know how to use cuDF from the Python repl with the latest code so I have not actually tested this, but this should be the repro case.

cudf.Series(['2\n']).str.contains('2$', regex=True)

Expected behavior
$ should match EOL even if the input ends with a line terminator.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context
None

@andygrove andygrove added bug Something isn't working Needs Triage Need team to review and classify labels Nov 5, 2021
@davidwendt davidwendt self-assigned this Nov 5, 2021
@davidwendt
Copy link
Contributor

The default behavior of contains changed in #9482 in regards to ^ and $ and the new-line character as explained in that PR. To make ^ and $ match with new-line characters instead of the whole string requires passing the MULTILINE flags parameter.

>>> import cudf
>>> import re
>>> cudf.Series(['2\n']).str.contains('2$', flags=re.MULTILINE, regex=True)
0    True
dtype: bool

The flags are defined here

enum regex_flags : uint32_t {
DEFAULT = 0, /// default
MULTILINE = 8, /// the '^' and '$' honor new-line characters
DOTALL = 16 /// the '.' matching includes new-line characters
};

@andygrove
Copy link
Contributor Author

andygrove commented Nov 8, 2021

I think I could have done a better job explaining this issue. What I am looking for is to have the default cuDF behavior (MULTILINE disabled) match Python's default behavior (also MULTILINE disabled), as described here. The key point is around how $ behaves when the string ends with \n:

By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

Does that make sense?

@davidwendt
Copy link
Contributor

Ok, got it, thanks. This is some wacky, yet documented edge case.

... By default, '^' matches only at the beginning of the string, and '$' only at the end of the string and immediately before the newline (if any) at the end of the string.

@beckernick beckernick added libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed Needs Triage Need team to review and classify labels Nov 12, 2021
rapids-bot bot pushed a commit that referenced this issue Nov 19, 2021
…#9715)

Closes #9620 

Fixes an edge case described in https://docs.python.org/3/library/re.html#re.MULTILINE
where the '$' EOL regex pattern character (without `MULTILINE` set) should match at the very end of a string and also just before the end of the string if the end of that string contains a new-line.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Christopher Harris (https://github.com/cwharris)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Sheilah Kirui (https://github.com/skirui-source)

URL: #9715
rapids-bot bot pushed a commit that referenced this issue Dec 10, 2021
Closes #9764 

This reverts the change made in #9620 for the reasons given in #9764 (comment)

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Mike Wilson (https://github.com/hyperbolic2346)
  - Andy Grove (https://github.com/andygrove)

URL: #9774
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants