-
Notifications
You must be signed in to change notification settings - Fork 244
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] [Regexp] Line anchor '$' incorrect matching of unicode line terminators #7585
Comments
The handling of line terminators was documented in the compatibility guide in PR #7211 |
To add more context here, cuDF has the correct behavior when passed Possible solutions to resolve this issue:
If we remove the transcoding, then the test fails on a different input. In this case, Java
|
Here are some examples of mismatches when we remove the transcoding and use the same pattern
|
One option is to do a replace of these with a single
And then apply |
Signed-off-by: Suraj Aralihalli <[email protected]>
Resolved by #11663 |
Describe the bug
Line anchor
$
will incorrectly match any of the unicode characters\u0085
,\u2028
, or\u2029
followed byanother line-terminator, such as
\n
. For example, the patternTEST$
will matchTEST\u0085\n
on the GPU butnot on the CPU.
Steps/Code to reproduce bug
See new unit tests added in #7211
Expected behavior
GPU and CPU should match
Environment details (please complete the following information)
N/A
Additional context
The text was updated successfully, but these errors were encountered: