Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Simplify Transpilation of $ with Extended Line Separator Support in cuDF Regex #11663

Merged

Conversation

SurajAralihalli
Copy link
Collaborator

@SurajAralihalli SurajAralihalli commented Oct 25, 2024

Resolves #11554, #7585

In cuDF, support for multiple newline characters was expanded from NEW_LINE (\n) to include the following:

  • NEXT_LINE (\u0085)
  • LINE_SEPARATOR (\u2028)
  • PARAGRAPH_SEPARATOR (\u2029)
  • CARRIAGE_RETURN (\r)
  • NEW_LINE (\n)

PR #17139 introduced this change to cuDf JNI with RegexFlag::EXT_LINE. This PR simplifies the transpilation of $ by changing the pattern from (?:\r|\u0085|\u2028|\u2029|\r\n)?$ to the simpler (?:\r\n)?$ and updates all functions to use RegexFlag::EXT_LINE wherever this transpilation occurs.

This PR also drops support for $\z because \z is not supported by cuDf. Alternatively, we could transpile $\z to $(?![\r\n\u0085\u2028\u2029]). However, cuDf doesn't support negative look ahead.

This PR also drops support for regex patterns with end-of-line anchors $ and \Z when followed by any escape sequences like \W, \B,\b etc, as they produce different results on CPU and GPU.

Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
@NVnavkumar
Copy link
Collaborator

Can we confirm some of the behavior described in compatibility.md and update accordingly?

Signed-off-by: Suraj Aralihalli <[email protected]>
Signed-off-by: Suraj Aralihalli <[email protected]>
@SurajAralihalli
Copy link
Collaborator Author

Can we confirm some of the behavior described in compatibility.md and update accordingly?

Thank you for pointing it, I found another issue that is resolved by this PR. I've updated the guide and tests to reflect this.
As part of the process we also reviewed the feasibility of solving #10641 and #10764 in this PR. Updated these issues with the status.

@SurajAralihalli SurajAralihalli marked this pull request as ready for review October 28, 2024 22:31
@SurajAralihalli
Copy link
Collaborator Author

Build

@SurajAralihalli
Copy link
Collaborator Author

Build

Copy link
Collaborator

@NVnavkumar NVnavkumar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. pending pre-merge.

@SurajAralihalli SurajAralihalli merged commit f0ae2ba into NVIDIA:branch-24.12 Oct 30, 2024
47 checks passed
@sameerz sameerz added the bug Something isn't working label Nov 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Reimplement $ transpilation using cuDF new line terminator support
3 participants