-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ENH: Added str.normalize to use unicodedata.normalize #10031
Conversation
@@ -24,6 +24,8 @@ Enhancements | |||
|
|||
- Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`) | |||
- Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`) | |||
- Added ``StringMethods.normalize()`` which behave as the same as standard :func:`unicodedata.normalizes` (:issue:`10031`) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor typo, should be "which behaves the same as"
I never used this, but I can see this could be a useful string method (and I think it is OK for 0.16.1) And good catch for the unicode index! (for series this did work as it allows all object types) |
unistr([0xFF21, 0xFF22, 0xFF23]), # ABC | ||
unistr([0xFF11, 0xFF12, 0xFF13]), # 123 | ||
np.nan, | ||
unistr([0xFF71, 0xFF72, 0xFF74])] # アイエ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know to what extent we want to really have such unicode characters in our source files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an alternative expression of normal unicode string, such as "u'ABC'" to work both on 2.x and 3.x. Unable to use six.u
here, because it escapes unicode literal and change the result.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we remove Python 3.2 support we can finally use u
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@shoyer 0.17? see #9118, maybe we should just decide when the 'finally' will be
@sinhrks sorry to be unclear, I just meant the unicode in the comment (what the unistr forms). I know our source files are unicode (or at least this one is), but I was just wondering to what extent we should also really use such characters (eg for when people with older or misconfigured editors looking at this file). But probably not a big deal
828ac70
to
0281f6c
Compare
Yes, added tests for |
allowed_types = ('string', 'unicode', 'mixed', 'mixed-integer') | ||
if self.inferred_type not in allowed_types: | ||
message = ("Can only use .str accessor with string values " | ||
"(i.e. inferred_type is 'string', 'unicode' or 'mixed')") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you can accept mixed-integer
, as these are in general python integer objects which need stringification first (which we could do, but that's a separate issue).
In [10]: pd.lib.infer_dtype([1,2,'a'])
Out[10]: 'mixed-integer'
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jreback It is to be compat with current Series
behavior. In above case, Series.str
is applied to all the elements and leave non-str as NaN
.
s = pd.Series([1, 2, 'a'])
s.str.len()
# 0 NaN
# 1 NaN
# 2 1
# dtype: float64
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i c...ok
ENH: Added str.normalize to use unicodedata.normalize
@sinhrks thanks! |
Derived from #9111. Can this be considered in v0.16.1? Otherwise will change the milestone.
Another point I'd like to discuss here is the condition
Index.str
can be used. Currently,inferred_type
must bestring
. I think the preferable condition is:Index
must be normal Index, notMultiIndex
.inferred_type
should be eitherstring
,unicode
ormixed
.This PR adds
unicode
currently, notmixed
.CC: @mortada