Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: Added str.normalize to use unicodedata.normalize #10031

Merged
merged 1 commit into from
May 6, 2015

Conversation

sinhrks
Copy link
Member

@sinhrks sinhrks commented Apr 30, 2015

Derived from #9111. Can this be considered in v0.16.1? Otherwise will change the milestone.

unicodedata.normalize is quite useful to standardize multi-bytes characters. I think it is nice if StringMethods.normalize can perform this.

import pandas as pd
s = pd.Series([u'ABCDE', u'12345'])
s
#0    ABCDE
#1    12345
# dtype: object

s.str.normalize()
#0    ABCDE
#1    12345
# dtype: object

Another point I'd like to discuss here is the condition Index.str can be used. Currently, inferred_type must be string. I think the preferable condition is:

  • Index must be normal Index, not MultiIndex.
  • Its inferred_type should be either string, unicode or mixed.

This PR adds unicode currently, not mixed.

pd.Index([u'a', u'B']).inferred_type
# unicode
pd.Index(['a', u'B']).inferred_type
# mixed

# when we allow "mixed" to show str, we should exclude MultiIndex case.
pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b')]).inferred_type
# mixed

CC: @mortada

@sinhrks sinhrks added the Strings String extension data type and string data label Apr 30, 2015
@sinhrks sinhrks added this to the 0.16.1 milestone Apr 30, 2015
@@ -24,6 +24,8 @@ Enhancements

- Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`)
- Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`)
- Added ``StringMethods.normalize()`` which behave as the same as standard :func:`unicodedata.normalizes` (:issue:`10031`)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor typo, should be "which behaves the same as"

@jorisvandenbossche
Copy link
Member

I never used this, but I can see this could be a useful string method (and I think it is OK for 0.16.1)

And good catch for the unicode index! (for series this did work as it allows all object types)
Maybe add an explicit test for this? (it is now tested in your normalize tests, but for clarity?)

unistr([0xFF21, 0xFF22, 0xFF23]), # ABC
unistr([0xFF11, 0xFF12, 0xFF13]), # 123
np.nan,
unistr([0xFF71, 0xFF72, 0xFF74])] # アイエ
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know to what extent we want to really have such unicode characters in our source files?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an alternative expression of normal unicode string, such as "u'ABC'" to work both on 2.x and 3.x. Unable to use six.u here, because it escapes unicode literal and change the result.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we remove Python 3.2 support we can finally use u....

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@shoyer 0.17? see #9118, maybe we should just decide when the 'finally' will be

@sinhrks sorry to be unclear, I just meant the unicode in the comment (what the unistr forms). I know our source files are unicode (or at least this one is), but I was just wondering to what extent we should also really use such characters (eg for when people with older or misconfigured editors looking at this file). But probably not a big deal

@sinhrks sinhrks force-pushed the str_normalize branch 2 times, most recently from 828ac70 to 0281f6c Compare May 4, 2015 03:20
@sinhrks
Copy link
Member Author

sinhrks commented May 5, 2015

Maybe add an explicit test for this?

Yes, added tests for Index to confirm it should work as the same as Series.

allowed_types = ('string', 'unicode', 'mixed', 'mixed-integer')
if self.inferred_type not in allowed_types:
message = ("Can only use .str accessor with string values "
"(i.e. inferred_type is 'string', 'unicode' or 'mixed')")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think you can accept mixed-integer, as these are in general python integer objects which need stringification first (which we could do, but that's a separate issue).

In [10]: pd.lib.infer_dtype([1,2,'a'])
Out[10]: 'mixed-integer'

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback It is to be compat with current Series behavior. In above case, Series.str is applied to all the elements and leave non-str as NaN.

s = pd.Series([1, 2, 'a'])
s.str.len()
# 0   NaN
# 1   NaN
# 2     1
# dtype: float64

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i c...ok

jreback added a commit that referenced this pull request May 6, 2015
ENH: Added str.normalize to use unicodedata.normalize
@jreback jreback merged commit 976e683 into pandas-dev:master May 6, 2015
@jreback
Copy link
Contributor

jreback commented May 6, 2015

@sinhrks thanks!

@sinhrks sinhrks deleted the str_normalize branch May 7, 2015 02:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Strings String extension data type and string data
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants