Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BEAM-13947] Add split() and rsplit(), non-deferred column operations on categorical columns #16677
[BEAM-13947] Add split() and rsplit(), non-deferred column operations on categorical columns #16677
Changes from 2 commits
6b382a3
5f6123b
0437851
43ff9b8
20fda68
a08094c
c64e42c
92df15a
8e29388
f094db7
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If an entry in a series is
np.nan
, and then is converted to dtype CategoricalDtype, then pandas behavior is to propogate the NaN. Example:In row 1, where which the string does not get split, in order to propagate
None
into other columns, I do.replace(np.nan, value=None)
. However this makes row 2 be allNone
instead ofNaN
.Is there a way to only choose specific rows to be
NaN
and notNone
? i.e. uses.isna()
to find those indices?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah this should be doable, I'm not sure if it can be done in a one-liner though. You could define a multi-line function, something like
Making it multi-line might improve readability anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good! I was playing around with very similar logic to this, and was running into issues passing indices
~s.isna()
into the original series and doing subsequent updates to the result. So thought that it wasn't possible. I guess it there was some sort of oversight of mine.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mm well it's certainly possible I'm missing something, but I think this should work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I wasn't clear. The code you have suggested works (just pushed a commit with it). My point was that I was trying to do something similar -- with logic that sort of looked like yours (though I don't remember exactly how it looked) -- but for some reason I was having issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
rsplit
API does not support regex as an argument even though its documentation does. This is probably not a bug, but rather documentation inconsistency. I can file an issue w/ pandas.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah let's file an issue with pandas about this, if you want to get a commit in pandas I think it's a valid use of your time to contribute a fix too. If not I'm sure someone will pick it up, it's a very active community.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like it was filed a while time ago, but no one has picked it up. I'll assign it to myself to do. In the meantime, I can add a subtask (under the non-deferred ticket?) to remind us to update the function api once the
rsplit
fix in pandas is released.