Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Support replacing word boundaries in regexp replace in way that is compatible with Python/Java #9950

Closed
andygrove opened this issue Dec 22, 2021 · 0 comments · Fixed by #9997
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)

Comments

@andygrove
Copy link
Contributor

Is your feature request related to a problem? Please describe.
When replacing the pattern \\b (word boundary) with X for the string a\nb, Python and Java produce XaX\nXbX and cuDF produces XXXa\nb.

>>> re.sub('\\b', 'X', 'a\nb')
'XaX\nXbX'
>>> cudf.Series(['a\nb']).str.replace('\\b', 'X', regex=True)
0    XXXa\nb

Describe the solution you'd like
I would like the ability to match Python/Java behavior in this case.

Describe alternatives you've considered
None

Additional context
None

@andygrove andygrove added feature request New feature or request Needs Triage Need team to review and classify labels Dec 22, 2021
@andygrove andygrove changed the title [BUG] Support replacing word boundaries in regexp replace in way that is compatible with Python/Java [FEA] Support replacing word boundaries in regexp replace in way that is compatible with Python/Java Dec 22, 2021
@davidwendt davidwendt self-assigned this Jan 4, 2022
@davidwendt davidwendt added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python) and removed feature request New feature or request labels Jan 5, 2022
rapids-bot bot pushed a commit that referenced this issue Jan 20, 2022
Closes #9950 

Fixes matching a single word-boundary (BOW) regex pattern. This pattern will match word boundaries and not any actual characters. This means the `(begin,end)` position values will be equal. The replace code was always expecting `begin < end` character range to replace. The logic has been updated to allow for this case.

Additional gtests have been added that include a single `\b` pattern character.

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - MithunR (https://github.com/mythrocks)

URL: #9997
@bdice bdice removed the Needs Triage Need team to review and classify label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. strings strings issues (C++ and Python)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants