Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: speed up certain string operations #10081

Closed
jreback opened this issue May 8, 2015 · 11 comments · Fixed by #10090
Closed

PERF: speed up certain string operations #10081

jreback opened this issue May 8, 2015 · 11 comments · Fixed by #10090
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Milestone

Comments

@jreback
Copy link
Contributor

jreback commented May 8, 2015

from SO

on big enough strings this might be quite useful for a number of string ops.

import pandas as pd
import random
import numpy as np
from StringIO import StringIO

def make_ip():
    return '.'.join(str(random.randint(0, 255)) for n in range(4))

df = pd.DataFrame({'ip': [make_ip() for i in range(20000)]})

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df.ip.str.split('.', return_type='frame')
#1 loops, best of 3: 3.06 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = df['ip'].apply(lambda x: pd.Series(x.split('.')))
#1 loops, best of 3: 3.1 s per loop

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
#10 loops, best of 3: 46.4 ms per loop
@jreback jreback added Performance Memory or execution speed performance Strings String extension data type and string data labels May 8, 2015
@jreback jreback added this to the Next Major Release milestone May 8, 2015
@cgevans
Copy link
Contributor

cgevans commented May 8, 2015

Interestingly, my initial thought that this was because Pandas' split just iterates through in Python was wrong: Using the obvious and pure Python [ x.split('.') for x in list(df['ip'])] is actually 400 times faster than using Pandas. Thus there's something else that's causing the problem.

It seems like vast amounts of time are spent in maybe_convert_objects and _possibly_cast_to_datetime, amongst other things. I'll have to look into it further.

@jreback
Copy link
Contributor Author

jreback commented May 8, 2015

it seems that this operation creates a Series then passes to the DataFrame constructor. No need to do this, the list-like operation should effectively be this.

In [33]: %timeit pd.DataFrame([ x.split('.') for x in list(df['ip']) ])
100 loops, best of 3: 13.4 ms per loop

@jreback
Copy link
Contributor Author

jreback commented May 8, 2015

simple change fixes this.

In [8]: %timeit df['ip'].str.split('.',return_type='frame')
100 loops, best of 3: 20 ms per loop

In [9]: quit()
[jreback-~/pandas] git diff
diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 6e603f6..b8ea27f 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -723,7 +723,7 @@ def str_split(arr, pat=None, n=None, return_type='series'):
             regex = re.compile(pat)
             f = lambda x: regex.split(x, maxsplit=n)
     if return_type == 'frame':
-        res = DataFrame((Series(x) for x in _na_map(f, arr)), index=arr.index)
+        res = DataFrame([x for x in _na_map(f, arr)], index=arr.index)
     else:
         res = _na_map(f, arr)
     return res

@cgevans
Copy link
Contributor

cgevans commented May 8, 2015

It seems like avoiding the list creation could be done as well, though it may not make much difference, and I'm not quite sure how Pandas handles iterators internally:

diff --git a/pandas/core/strings.py b/pandas/core/strings.py
index 6e603f6..8fb7f10 100644
--- a/pandas/core/strings.py
+++ b/pandas/core/strings.py
@@ -723,7 +723,7 @@ def str_split(arr, pat=None, n=None, return_type='series'):
             regex = re.compile(pat)
             f = lambda x: regex.split(x, maxsplit=n)
     if return_type == 'frame':
-        res = DataFrame((Series(x) for x in _na_map(f, arr)), index=arr.index)
+        res = DataFrame((x for x in _na_map(f, arr)), index=arr.index)
     else:
         res = _na_map(f, arr)
     return res

Ah; it just converts to a list. Oh well.

@sinhrks
Copy link
Member

sinhrks commented May 8, 2015

Because this changes current behavior (non-str values are all converted to str), it may be an option to prepare a shortpath for all-string values. On my environment, numpy's string method is faster than above workaround.

%timeit df[['ip0', 'ip1', 'ip2', 'ip3']] = \
    pd.read_table(StringIO(df['ip'].to_csv(None,index=None)),sep='.')
# 10 loops, best of 3: 80.9 ms per loop

%timeit pd.DataFrame(np.core.defchararray.split(df['ip'].values.astype(str), '.'))
# 10 loops, best of 3: 29.9 ms per loop

@cgevans
Copy link
Contributor

cgevans commented May 8, 2015

sinhrks: the CSV write/read method is a horrible hack. I don't think jrevack's legitimate solution changes current behavior, and it is significantly faster than the hack, likely in line with numpy's performance.

jreback: while I'd be happy to do a PR for this if necessary, I assume you have it dealt with?

@jreback
Copy link
Contributor Author

jreback commented May 8, 2015

pull requests are welcome on this

@sinhrks
Copy link
Member

sinhrks commented May 9, 2015

@cgevans Thanks for your cooperation :) What I meant in current behavior is:

s = pd.Series([1.1, '2.2'])

# current behavior (non-strings are left as NaN)
s.str.split('.', expand=True)
#      0    1
# 0  NaN  NaN
# 1    2    2

# numpy (non-strings are forced to be converted)
pd.DataFrame(list(np.core.defchararray.split(s.values.astype(str), '.')))
#    0  1
# 0  1  1
# 1  2  2

# cant use numpy method without changing dtype
pd.DataFrame(list(np.core.defchararray.split(s.values, '.')))
# TypeError: string operation on non-string array

I think numpy funcs can be used when the values are all string or unicode (maybe regular case). One idea is to use pandas.lib.is_string_array for check and use numpy logic if possible.

@cgevans
Copy link
Contributor

cgevans commented May 9, 2015

@sinhrks, I think what you're showing is behavior in branches for 10085 / 9847 , not the current pydata/master. I don't have those branches, and am not sure where that work is right now. But I think what you keep referring to as changing behavior is the workaround in the first post here, which is not what I'm discussing, and not what jreback is discussing. I'll make a PR momentarily.

With that said, there is a NaN issue that I'm working on addressing, but it's not quite the same.

@cgevans
Copy link
Contributor

cgevans commented May 24, 2015

Looking into this further, the problem is not necessarily the string operations themselves, but oddness and seeming inconsistencies with DataFrame construction.

In the current code, str_split itself takes a series, splits each string into lists of strings, and returns an ndarray of dtype 'object'. If there are nans in the input data, these stay as nans, and are not lists.

If you take this array of list objects (and potentially nans), and input it to pd.DataFrame, it will output a frame with one column, containing the objects. No expansion takes place.

If you convert the array to a list (of list objects) and input it to pd.DataFrame, it will output a frame with multiple columns, containing the values in the list objects. Shorter lists will be padded with Nones. If there are any nan values, the constructor fails, because it assumes the nan value should be a list with a length.

If you instead convert the array to a list of Series objects, and input it to pd.DataFrame, it will output a frame with multiple columns, containing the values of the series objects. Shorter lists will be padded with NaNs, not Nones. NaN values are expanded to rows of all NaNs.

This thus leads to a few questions:

  • Is having None for missing values from a list of lists, and NaN for missing values from a list of Series, the desired behavior, or should the DataFrame constructor only use one of them?
  • Is there a way to handle NaN values when constructing a DataFrame from a list of lists?
  • While I can somewhat understand the differences in behavior, do we want all these slightly different input types containing the same data to lead to different DataFrames?

Here is an example:

In [1]: import pandas as pd

In [2]: import pandas.core.strings as st

In [3]: import numpy as np

In [4]: s4 = pd.Series(['asdf.asdf','asdf',np.nan,'asdf.asdf.asdf','asdf.asdf.asdf'])

In [5]: s4.str.split('.')
Out[5]:
0          [asdf, asdf]
1                [asdf]
2                   NaN
3    [asdf, asdf, asdf]
4    [asdf, asdf, asdf]
dtype: object

In [6]: s4.str.split('.',expand=True)
Out[6]:
      0     1     2
0  asdf  asdf   NaN
1  asdf   NaN   NaN
2   NaN   NaN   NaN
3  asdf  asdf  asdf
4  asdf  asdf  asdf

In [7]: st.str_split(s4,'.')
Out[7]:
array([['asdf', 'asdf'], ['asdf'], nan, ['asdf', 'asdf', 'asdf'],
       ['asdf', 'asdf', 'asdf']], dtype=object)

In [8]: pd.DataFrame( st.str_split(s4,'.') )
Out[8]:
                    0
0        [asdf, asdf]
1              [asdf]
2                 NaN
3  [asdf, asdf, asdf]
4  [asdf, asdf, asdf]

In [9]: pd.DataFrame( list( st.str_split(s4,'.') ) )
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-41-0aa1ea9b709b> in <module>()
----> 1 pd.DataFrame( list( st.str_split(s4,'.') ) )

/Users/cge/Dev/pandas/pandas/core/frame.pyc in __init__(self, data, index, columns, dtype, copy)
    249             if len(data) > 0:
    250                 if is_list_like(data[0]) and getattr(data[0], 'ndim', 1) == 1:
--> 251                     arrays, columns = _to_arrays(data, columns, dtype=dtype)
    252                     columns = _ensure_index(columns)
    253

/Users/cge/Dev/pandas/pandas/core/frame.pyc in _to_arrays(data, columns, coerce_float, dtype)
   4907     if isinstance(data[0], (list, tuple)):
   4908         return _list_to_arrays(data, columns, coerce_float=coerce_float,
-> 4909                                dtype=dtype)
   4910     elif isinstance(data[0], collections.Mapping):
   4911         return _list_of_dict_to_arrays(data, columns,

/Users/cge/Dev/pandas/pandas/core/frame.pyc in _list_to_arrays(data, columns, coerce_float, dtype)
   4988     else:
   4989         # list of lists
-> 4990         content = list(lib.to_object_array(data).T)
   4991     return _convert_object_array(content, columns, dtype=dtype,
   4992                                  coerce_float=coerce_float)

/Users/cge/Dev/pandas/pandas/src/inference.pyx in pandas.lib.to_object_array (pandas/lib.c:58812)()
   1090     k = 0
   1091     for i from 0 <= i < n:
-> 1092         tmp = len(rows[i])
   1093         if tmp > k:
   1094             k = tmp

TypeError: object of type 'float' has no len()

In [10]: s4[2] = 'asdf'

In [11]: pd.DataFrame( list( st.str_split(s4,'.') ) )
Out[11]:
      0     1     2
0  asdf  asdf  None
1  asdf  None  None
2  asdf  None  None
3  asdf  asdf  asdf
4  asdf  asdf  asdf

In [12]: pd.DataFrame( [pd.Series(x) for x in st.str_split(s4,'.')] )
Out[12]:
      0     1     2
0  asdf  asdf   NaN
1  asdf   NaN   NaN
2  asdf   NaN   NaN
3  asdf  asdf  asdf
4  asdf  asdf  asdf

In [13]: s4[2] = np.nan

In [14]: pd.DataFrame( [pd.Series(x) for x in st.str_split(s4,'.')] )
Out[14]:
      0     1     2
0  asdf  asdf   NaN
1  asdf   NaN   NaN
2   NaN   NaN   NaN
3  asdf  asdf  asdf
4  asdf  asdf  asdf

@jreback
Copy link
Contributor Author

jreback commented May 26, 2015

np.nan is the marker for missing values. None is accepted (and converted), however it is somewhat ambiguous as it can also be a 'valid' element. list-of-lists needs specific conversions, e.g. to see if for example a list is a Series.

@jreback jreback modified the milestones: 0.17.0, Next Major Release, 0.16.2 Jun 2, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance Strings String extension data type and string data
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants