Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make string methods return a Series with a useful Index #12814

Merged
merged 6 commits into from
Mar 9, 2023

Conversation

shwina
Copy link
Contributor

@shwina shwina commented Feb 21, 2023

Description

Closes #12806

Many string methods like character_ngrams currently return a Series with the default index (RangeIndex). This PR makes it so that the index of the result corresponds to the index of the input.

More specifically, this PR changes the index of the result of the following string methods:

  • character_ngrams
  • tokenize
  • character_tokenize

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@shwina shwina requested a review from a team as a code owner February 21, 2023 18:06
@github-actions github-actions bot added the Python Affects Python cuDF API. label Feb 21, 2023
@shwina shwina marked this pull request as draft February 21, 2023 18:06
@vyasr
Copy link
Contributor

vyasr commented Feb 22, 2023

@shwina was this PR intentionally marked as draft?

@shwina shwina marked this pull request as ready for review March 7, 2023 15:57
@shwina shwina added bug Something isn't working non-breaking Non-breaking change labels Mar 7, 2023
@shwina
Copy link
Contributor Author

shwina commented Mar 7, 2023

@vyasr - should be ready for review now

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One question about null behavior but I'm approving and will let you resolve accordingly.

dtype: object
"""
result_col = libstrings.character_tokenize(self._column)
if isinstance(self._parent, cudf.Series):
return cudf.Series(result_col, name=self._parent.name)
lengths = self.len().fillna(0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the fillna(0) matching pandas behavior? It looks like the test is constructing the expected output manually to be the same, so it's not clear if it's going to match pandas or if the test is just constructing the same as what this outputs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no Pandas behaviour (Pandas doesn't have a tokenize).

The fillna(0) is there because nulls are ignored during tokenization, corresponding to 0 rows in the result (and thus, length = 0).

@shwina
Copy link
Contributor Author

shwina commented Mar 9, 2023

/merge

@rapids-bot rapids-bot bot merged commit 02d3751 into rapidsai:branch-23.04 Mar 9, 2023
@galipremsagar galipremsagar added the breaking Breaking change label Mar 10, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Series.str.character_ngrams(as_list=True) resets index when it shouldn't
3 participants