ENH: Added str.normalize to use unicodedata.normalize #10031

sinhrks · 2015-04-30T16:12:18Z

Derived from #9111. Can this be considered in v0.16.1? Otherwise will change the milestone.

unicodedata.normalize is quite useful to standardize multi-bytes characters. I think it is nice if StringMethods.normalize can perform this.

import pandas as pd
s = pd.Series([u'ＡＢＣＤＥ', u'１２３４５'])
s
#0    ＡＢＣＤＥ
#1    １２３４５
# dtype: object

s.str.normalize()
#0    ABCDE
#1    12345
# dtype: object

Another point I'd like to discuss here is the condition Index.str can be used. Currently, inferred_type must be string. I think the preferable condition is:

Index must be normal Index, not MultiIndex.
Its inferred_type should be either string, unicode or mixed.

This PR adds unicode currently, not mixed.

pd.Index([u'a', u'B']).inferred_type
# unicode
pd.Index(['a', u'B']).inferred_type
# mixed

# when we allow "mixed" to show str, we should exclude MultiIndex case.
pd.MultiIndex.from_tuples([('a', 'a'), ('a', 'b')]).inferred_type
# mixed

CC: @mortada

mortada · 2015-04-30T19:34:49Z

doc/source/whatsnew/v0.16.1.txt

@@ -24,6 +24,8 @@ Enhancements

 - Added ``StringMethods.capitalize()`` and ``swapcase`` which behave as the same as standard ``str`` (:issue:`9766`)
 - Added ``StringMethods`` (.str accessor) to ``Index`` (:issue:`9068`)
+- Added ``StringMethods.normalize()`` which behave as the same as standard :func:`unicodedata.normalizes` (:issue:`10031`)
+


minor typo, should be "which behaves the same as"

jorisvandenbossche · 2015-05-01T12:41:04Z

I never used this, but I can see this could be a useful string method (and I think it is OK for 0.16.1)

And good catch for the unicode index! (for series this did work as it allows all object types)
Maybe add an explicit test for this? (it is now tested in your normalize tests, but for clarity?)

jorisvandenbossche · 2015-05-01T12:42:08Z

pandas/tests/test_strings.py

+                  unistr([0xFF21, 0xFF22, 0xFF23]), # ＡＢＣ
+                  unistr([0xFF11, 0xFF12, 0xFF13]), # １２３
+                  np.nan,
+                  unistr([0xFF71, 0xFF72, 0xFF74])] # ｱｲｴ


I don't know to what extent we want to really have such unicode characters in our source files?

This is an alternative expression of normal unicode string, such as "u'ＡＢＣ'" to work both on 2.x and 3.x. Unable to use six.u here, because it escapes unicode literal and change the result.

Once we remove Python 3.2 support we can finally use u....

@shoyer 0.17? see #9118, maybe we should just decide when the 'finally' will be

@sinhrks sorry to be unclear, I just meant the unicode in the comment (what the unistr forms). I know our source files are unicode (or at least this one is), but I was just wondering to what extent we should also really use such characters (eg for when people with older or misconfigured editors looking at this file). But probably not a big deal

sinhrks · 2015-05-05T03:14:59Z

Maybe add an explicit test for this?

Yes, added tests for Index to confirm it should work as the same as Series.

jreback · 2015-05-05T10:49:07Z

pandas/core/base.py

+            allowed_types = ('string', 'unicode', 'mixed', 'mixed-integer')
+            if self.inferred_type not in allowed_types:
+                message = ("Can only use .str accessor with string values "
+                           "(i.e. inferred_type is 'string', 'unicode' or 'mixed')")


I don't think you can accept mixed-integer, as these are in general python integer objects which need stringification first (which we could do, but that's a separate issue).

In [10]: pd.lib.infer_dtype([1,2,'a']) Out[10]: 'mixed-integer'

@jreback It is to be compat with current Series behavior. In above case, Series.str is applied to all the elements and leave non-str as NaN.

s = pd.Series([1, 2, 'a']) s.str.len() # 0 NaN # 1 NaN # 2 1 # dtype: float64

ENH: Added str.normalize to use unicodedata.normalize

jreback · 2015-05-06T10:40:29Z

@sinhrks thanks!

sinhrks added the Strings String extension data type and string data label Apr 30, 2015

sinhrks added this to the 0.16.1 milestone Apr 30, 2015

sinhrks force-pushed the str_normalize branch from f4c1676 to 755963e Compare April 30, 2015 16:13

mortada reviewed Apr 30, 2015
View reviewed changes

jorisvandenbossche reviewed May 1, 2015
View reviewed changes

sinhrks force-pushed the str_normalize branch 2 times, most recently from 828ac70 to 0281f6c Compare May 4, 2015 03:20

ENH: Added str.normalize to use unicodedata.normalize

84afe26

sinhrks force-pushed the str_normalize branch from 0281f6c to 84afe26 Compare May 4, 2015 04:17

jreback reviewed May 5, 2015
View reviewed changes

jreback added a commit that referenced this pull request May 6, 2015

Merge pull request #10031 from sinhrks/str_normalize

976e683

ENH: Added str.normalize to use unicodedata.normalize

jreback merged commit 976e683 into pandas-dev:master May 6, 2015

sinhrks mentioned this pull request May 6, 2015

StringMethods should have the same methods as standard str #9111

Closed

24 tasks

sinhrks deleted the str_normalize branch May 7, 2015 02:39

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: Added str.normalize to use unicodedata.normalize #10031

ENH: Added str.normalize to use unicodedata.normalize #10031

sinhrks commented Apr 30, 2015

mortada Apr 30, 2015

jorisvandenbossche commented May 1, 2015

jorisvandenbossche May 1, 2015

sinhrks May 1, 2015

shoyer May 1, 2015

jorisvandenbossche May 1, 2015

sinhrks commented May 5, 2015

jreback May 5, 2015

sinhrks May 6, 2015

jreback May 6, 2015

jreback commented May 6, 2015

ENH: Added str.normalize to use unicodedata.normalize #10031

ENH: Added str.normalize to use unicodedata.normalize #10031

Conversation

sinhrks commented Apr 30, 2015

Choose a reason for hiding this comment

jorisvandenbossche commented May 1, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sinhrks commented May 5, 2015

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jreback commented May 6, 2015