-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make string methods return a Series with a useful Index #12814
Make string methods return a Series with a useful Index #12814
Conversation
@shwina was this PR intentionally marked as draft? |
@vyasr - should be ready for review now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One question about null behavior but I'm approving and will let you resolve accordingly.
dtype: object | ||
""" | ||
result_col = libstrings.character_tokenize(self._column) | ||
if isinstance(self._parent, cudf.Series): | ||
return cudf.Series(result_col, name=self._parent.name) | ||
lengths = self.len().fillna(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the fillna(0)
matching pandas behavior? It looks like the test is constructing the expected output manually to be the same, so it's not clear if it's going to match pandas or if the test is just constructing the same as what this outputs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no Pandas behaviour (Pandas doesn't have a tokenize
).
The fillna(0)
is there because nulls are ignored during tokenization, corresponding to 0 rows in the result (and thus, length = 0).
/merge |
Description
Closes #12806
Many string methods like
character_ngrams
currently return aSeries
with the default index (RangeIndex
). This PR makes it so that the index of the result corresponds to the index of the input.More specifically, this PR changes the index of the result of the following string methods:
character_ngrams
tokenize
character_tokenize
Checklist