-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: speed up certain string operations #10081
Comments
Interestingly, my initial thought that this was because Pandas' split just iterates through in Python was wrong: Using the obvious and pure Python It seems like vast amounts of time are spent in |
it seems that this operation creates a Series then passes to the DataFrame constructor. No need to do this, the list-like operation should effectively be this.
|
simple change fixes this.
|
It seems like avoiding the list creation could be done as well, though it may not make much difference, and I'm not quite sure how Pandas handles iterators internally:
Ah; it just converts to a list. Oh well. |
Because this changes current behavior (non-str values are all converted to str), it may be an option to prepare a shortpath for all-string values. On my environment, numpy's string method is faster than above workaround.
|
sinhrks: the CSV write/read method is a horrible hack. I don't think jrevack's legitimate solution changes current behavior, and it is significantly faster than the hack, likely in line with numpy's performance. jreback: while I'd be happy to do a PR for this if necessary, I assume you have it dealt with? |
pull requests are welcome on this |
@cgevans Thanks for your cooperation :) What I meant in
I think numpy funcs can be used when the values are all string or unicode (maybe regular case). One idea is to use |
@sinhrks, I think what you're showing is behavior in branches for 10085 / 9847 , not the current pydata/master. I don't have those branches, and am not sure where that work is right now. But I think what you keep referring to as changing behavior is the workaround in the first post here, which is not what I'm discussing, and not what jreback is discussing. I'll make a PR momentarily. With that said, there is a NaN issue that I'm working on addressing, but it's not quite the same. |
Looking into this further, the problem is not necessarily the string operations themselves, but oddness and seeming inconsistencies with DataFrame construction. In the current code, If you take this array of list objects (and potentially nans), and input it to pd.DataFrame, it will output a frame with one column, containing the objects. No expansion takes place. If you convert the array to a list (of list objects) and input it to pd.DataFrame, it will output a frame with multiple columns, containing the values in the list objects. Shorter lists will be padded with If you instead convert the array to a list of Series objects, and input it to pd.DataFrame, it will output a frame with multiple columns, containing the values of the series objects. Shorter lists will be padded with This thus leads to a few questions:
Here is an example:
|
|
from SO
on big enough strings this might be quite useful for a number of string ops.
The text was updated successfully, but these errors were encountered: