PERF: speed up certain string operations #10081

jreback · 2015-05-08T00:23:58Z

from SO

on big enough strings this might be quite useful for a number of string ops.

import pandas as pd
import random
import numpy as np
from StringIO import StringIO

def make_ip():
    return '.'.join(str(random.randint(0, 255)) for n in range(4))

df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
#1 loops, best of 3: 3.06 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
#1 loops, best of 3: 3.1 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
#10 loops, best of 3: 46.4 ms per loop

The text was updated successfully, but these errors were encountered:

cgevans · 2015-05-08T00:32:07Z

Interestingly, my initial thought that this was because Pandas' split just iterates through in Python was wrong: Using the obvious and pure Python [ x.split('.') for x in list(df['ip'])] is actually 400 times faster than using Pandas. Thus there's something else that's causing the problem.

It seems like vast amounts of time are spent in maybe_convert_objects and _possibly_cast_to_datetime, amongst other things. I'll have to look into it further.

jreback · 2015-05-08T00:35:55Z

it seems that this operation creates a Series then passes to the DataFrame constructor. No need to do this, the list-like operation should effectively be this.

In [33]: %timeit pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
100 loops, best of 3: 13.4 ms per loop

jreback · 2015-05-08T00:39:09Z

simple change fixes this.

In [8]: %timeit df['ip'].str.split('.',return_type='frame')
100 loops, best of 3: 20 ms per loop

In [9]: quit()
[jreback-~/pandas] git diff
diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 6e603f6..b8ea27f 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -723,7 +723,7 @@ def str_split(arr, pat=None, n=None, return_type='series'):
             regex = re.compile(pat)
             f = lambda x: regex.split(x, maxsplit=n)
     if return_type == 'frame':
-        res = DataFrame((Series(x) for x in _na_map(f, arr)), index=arr.index)
+        res = DataFrame([x for x in _na_map(f, arr)], index=arr.index)
     else:
         res = _na_map(f, arr)
     return res

cgevans · 2015-05-08T00:55:14Z

It seems like avoiding the list creation could be done as well, though it may not make much difference, and I'm not quite sure how Pandas handles iterators internally:

diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 6e603f6..8fb7f10 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -723,7 +723,7 @@ def str_split(arr, pat=None, n=None, return_type='series'):
             regex = re.compile(pat)
             f = lambda x: regex.split(x, maxsplit=n)
     if return_type == 'frame':
-        res = DataFrame((Series(x) for x in _na_map(f, arr)), index=arr.index)
+        res = DataFrame((x for x in _na_map(f, arr)), index=arr.index)
     else:
         res = _na_map(f, arr)
     return res

Ah; it just converts to a list. Oh well.

sinhrks · 2015-05-08T03:22:28Z

Because this changes current behavior (non-str values are all converted to str), it may be an option to prepare a shortpath for all-string values. On my environment, numpy's string method is faster than above workaround.

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
# 10 loops, best of 3: 80.9 ms per loop

%timeit pd.DataFrame(np.core.defchararray.split(df['ip'].values.astype(str), '.'))
# 10 loops, best of 3: 29.9 ms per loop

cgevans · 2015-05-08T10:27:36Z

sinhrks: the CSV write/read method is a horrible hack. I don't think jrevack's legitimate solution changes current behavior, and it is significantly faster than the hack, likely in line with numpy's performance.

jreback: while I'd be happy to do a PR for this if necessary, I assume you have it dealt with?

jreback · 2015-05-08T10:53:46Z

pull requests are welcome on this

sinhrks · 2015-05-09T04:20:37Z

@cgevans Thanks for your cooperation :) What I meant in current behavior is:

s = pd.Series([1.1, '2.2'])

# current behavior (non-strings are left as NaN)
s.str.split('.', expand=True)
#      0    1
# 0  NaN  NaN
# 1    2    2

# numpy (non-strings are forced to be converted)
pd.DataFrame(list(np.core.defchararray.split(s.values.astype(str), '.')))
#    0  1
# 0  1  1
# 1  2  2

# cant use numpy method without changing dtype
pd.DataFrame(list(np.core.defchararray.split(s.values, '.')))
# TypeError: string operation on non-string array

I think numpy funcs can be used when the values are all string or unicode (maybe regular case). One idea is to use pandas.lib.is_string_array for check and use numpy logic if possible.

cgevans · 2015-05-09T05:16:16Z

@sinhrks, I think what you're showing is behavior in branches for 10085 / 9847 , not the current pydata/master. I don't have those branches, and am not sure where that work is right now. But I think what you keep referring to as changing behavior is the workaround in the first post here, which is not what I'm discussing, and not what jreback is discussing. I'll make a PR momentarily.

With that said, there is a NaN issue that I'm working on addressing, but it's not quite the same.

cgevans · 2015-05-24T21:18:00Z

Looking into this further, the problem is not necessarily the string operations themselves, but oddness and seeming inconsistencies with DataFrame construction.

In the current code, str_split itself takes a series, splits each string into lists of strings, and returns an ndarray of dtype 'object'. If there are nans in the input data, these stay as nans, and are not lists.

If you take this array of list objects (and potentially nans), and input it to pd.DataFrame, it will output a frame with one column, containing the objects. No expansion takes place.

If you convert the array to a list (of list objects) and input it to pd.DataFrame, it will output a frame with multiple columns, containing the values in the list objects. Shorter lists will be padded with Nones. If there are any nan values, the constructor fails, because it assumes the nan value should be a list with a length.

If you instead convert the array to a list of Series objects, and input it to pd.DataFrame, it will output a frame with multiple columns, containing the values of the series objects. Shorter lists will be padded with NaNs, not Nones. NaN values are expanded to rows of all NaNs.

This thus leads to a few questions:

Is having None for missing values from a list of lists, and NaN for missing values from a list of Series, the desired behavior, or should the DataFrame constructor only use one of them?
Is there a way to handle NaN values when constructing a DataFrame from a list of lists?
While I can somewhat understand the differences in behavior, do we want all these slightly different input types containing the same data to lead to different DataFrames?

Here is an example:

In [1]: import pandas as pd

In [2]: import pandas.core.strings as st

In [3]: import numpy as np

In [4]: s4 = pd.Series(['asdf.asdf','asdf',np.nan,'asdf.asdf.asdf','asdf.asdf.asdf'])

In [5]: s4.str.split('.')
Out[5]:
0          [asdf, asdf]
1                [asdf]
2                   NaN
3    [asdf, asdf, asdf]
4    [asdf, asdf, asdf]
dtype: object

In [6]: s4.str.split('.',expand=True)
Out[6]:
      0     1     2
0  asdf  asdf   NaN
1  asdf   NaN   NaN
2   NaN   NaN   NaN
3  asdf  asdf  asdf
4  asdf  asdf  asdf

In [7]: st.str_split(s4,'.')
Out[7]:
array([['asdf', 'asdf'], ['asdf'], nan, ['asdf', 'asdf', 'asdf'],
       ['asdf', 'asdf', 'asdf']], dtype=object)

In [8]: pd.DataFrame( st.str_split(s4,'.') )
Out[8]:
                    0
0        [asdf, asdf]
1              [asdf]
2                 NaN
3  [asdf, asdf, asdf]
4  [asdf, asdf, asdf]

In [9]: pd.DataFrame( list( st.str_split(s4,'.') ) )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-0aa1ea9b709b> in <module>()
----> 1 pd.DataFrame( list( st.str_split(s4,'.') ) )

/Users/cge/Dev/pandas/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    249             if len(data) > 0:
    250                 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
--> 251                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    252                     columns = _ensure_index(columns)
    253

/Users/cge/Dev/pandas/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   4907     if isinstance(data[0], (list, tuple)):
   4908         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 4909                                dtype=dtype)
   4910     elif isinstance(data[0], collections.Mapping):
   4911         return _list_of_dict_to_arrays(data, columns,

/Users/cge/Dev/pandas/pandas/core/frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
   4988     else:
   4989         # list of lists
-> 4990         content = list(lib.to_object_array(data).T)
   4991     return _convert_object_array(content, columns, dtype=dtype,
   4992                                  coerce_float=coerce_float)

/Users/cge/Dev/pandas/pandas/src/inference.pyx in pandas.lib.to_object_array (pandas/lib.c:58812)()
   1090     k = 0
   1091     for i from 0 <= i < n:
-> 1092         tmp = len(rows[i])
   1093         if tmp > k:
   1094             k = tmp

TypeError: object of type 'float' has no len()

In [10]: s4[2] = 'asdf'

In [11]: pd.DataFrame( list( st.str_split(s4,'.') ) )
Out[11]:
      0     1     2
0  asdf  asdf  None
1  asdf  None  None
2  asdf  None  None
3  asdf  asdf  asdf
4  asdf  asdf  asdf

In [12]: pd.DataFrame( [pd.Series(x) for x in st.str_split(s4,'.')] )
Out[12]:
      0     1     2
0  asdf  asdf   NaN
1  asdf   NaN   NaN
2  asdf   NaN   NaN
3  asdf  asdf  asdf
4  asdf  asdf  asdf

In [13]: s4[2] = np.nan

In [14]: pd.DataFrame( [pd.Series(x) for x in st.str_split(s4,'.')] )
Out[14]:
      0     1     2
0  asdf  asdf   NaN
1  asdf   NaN   NaN
2   NaN   NaN   NaN
3  asdf  asdf  asdf
4  asdf  asdf  asdf

jreback · 2015-05-26T10:39:57Z

np.nan is the marker for missing values. None is accepted (and converted), however it is somewhat ambiguous as it can also be a 'valid' element. list-of-lists needs specific conversions, e.g. to see if for example a list is a Series.

jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels May 8, 2015

jreback added this to the Next Major Release milestone May 8, 2015

jreback mentioned this issue May 8, 2015

DEPR: Deprecate str.split return_type #10085

Closed

cgevans mentioned this issue May 9, 2015

PERF: increase performance of str_split when returning a frame #10090

Merged

sinhrks mentioned this issue May 14, 2015

(WIP) PERF: improve .str perf for all-string values (about 2x-) #10135

Closed

3 tasks

jreback modified the milestones: 0.17.0, Next Major Release, 0.16.2 Jun 2, 2015

jreback closed this as completed in #10090 Jun 5, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: speed up certain string operations #10081

PERF: speed up certain string operations #10081

jreback commented May 8, 2015

cgevans commented May 8, 2015

jreback commented May 8, 2015

jreback commented May 8, 2015

cgevans commented May 8, 2015

sinhrks commented May 8, 2015

cgevans commented May 8, 2015

jreback commented May 8, 2015

sinhrks commented May 9, 2015

cgevans commented May 9, 2015

cgevans commented May 24, 2015

jreback commented May 26, 2015

PERF: speed up certain string operations #10081

PERF: speed up certain string operations #10081

Comments

jreback commented May 8, 2015

cgevans commented May 8, 2015

jreback commented May 8, 2015

jreback commented May 8, 2015

cgevans commented May 8, 2015

sinhrks commented May 8, 2015

cgevans commented May 8, 2015

jreback commented May 8, 2015

sinhrks commented May 9, 2015

cgevans commented May 9, 2015

cgevans commented May 24, 2015

jreback commented May 26, 2015