-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663
Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663
Conversation
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Can we confirm some of the behavior described in compatibility.md and update accordingly? |
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Thank you for pointing it, I found another issue that is resolved by this PR. I've updated the guide and tests to reflect this. |
Build |
Signed-off-by: Suraj Aralihalli <[email protected]>
Build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. pending pre-merge.
Resolves #11554, #7585
In cuDF, support for multiple newline characters was expanded from NEW_LINE (
\n
) to include the following:\u0085
)\u2028
)\u2029
)\r
)\n
)PR #17139 introduced this change to cuDf JNI with
RegexFlag::EXT_LINE
. This PR simplifies the transpilation of$
by changing the pattern from(?:\r|\u0085|\u2028|\u2029|\r\n)?$
to the simpler(?:\r\n)?$
and updates all functions to useRegexFlag::EXT_LINE
wherever this transpilation occurs.This PR also drops support for
$\z
because\z
is not supported by cuDf. Alternatively, we could transpile$\z
to$(?![\r\n\u0085\u2028\u2029])
. However, cuDf doesn't support negative look ahead.This PR also drops support for regex patterns with end-of-line anchors
$
and\Z
when followed by any escape sequences like\W
,\B
,\b
etc, as they produce different results on CPU and GPU.