Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] extract API returns empty string returned instead of None in a specific example #5234

Closed
galipremsagar opened this issue May 20, 2020 · 2 comments
Labels
bug Something isn't working

Comments

@galipremsagar
Copy link
Contributor

Describe the bug
extract API seems to return an empty string in a specific case where it has to return None.

Steps/Code to reproduce bug

>>> s = cudf.Series(['a1', 'b2', 'c3'])
>>> s.str.extract(r'([ab])(\d)')
      0     1
0     a     1
1     b     2
2  None  None


>>> s.str.extract(r'([ab])?(\d)')
   0  1
0  a  1
1  b  2
2     3
>>> s.str.extract(r'([ab])?(\d)')[0][2]
''


>>> s.to_pandas().str.extract(r'([ab])?(\d)')
     0  1
0    a  1
1    b  2
2  NaN  3
>>> 

Expected behavior
The expected behavior is to return None instead of an empty string ''.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: from source[branch-0.14]

Additional context
https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.str.extract.html

@galipremsagar galipremsagar added bug Something isn't working Needs Triage Need team to review and classify labels May 20, 2020
@galipremsagar galipremsagar changed the title [BUG] Empty string returned instead of None in a specific example [BUG] extract API returns empty string returned instead of None in a specific example May 20, 2020
@davidwendt
Copy link
Contributor

Adding reference to similar issue #5157
The problem is the placement of the ? after the group (...)

>>> import pandas as pd
>>> ps = pd.Series(['a1', 'b2', 'c3'])
>>> ps.str.extract(r'([ab])(\d)')
     0    1
0    a    1
1    b    2
2  NaN  NaN
>>> ps.str.extract(r'([ab])?(\d)')
     0  1
0    a  1
1    b  2
2  NaN  3
>>> ps.str.extract(r'([ab]?)(\d)')
   0  1
0  a  1
1  b  2
2     3

The greedy quantifier ? is not supported for capturing group (...) in the cudf regex implementation. That is, I don't see how the internal code can return this information.

@galipremsagar
Copy link
Contributor Author

Thanks, @davidwendt ! Closing this issue.

@galipremsagar galipremsagar removed the Needs Triage Need team to review and classify label May 28, 2020
rapids-bot bot pushed a commit that referenced this issue Oct 29, 2021
Closes #9463 
Closes #9434 

This adds a small section to the [Regex Features](https://docs.rapids.ai/api/libcudf/stable/md_regex.html) page describing invalid regex patterns may result in undefined behavior. The list here includes current issues as well as ones opened in the past:
#3732, #8832, #5234, #4746, #3725

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Karthikeyan (https://github.com/karthikeyann)
  - Paul Taylor (https://github.com/trxcllnt)
  - MithunR (https://github.com/mythrocks)

URL: #9473
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants