-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pandas Series Construction Extremely Slow for Array of Large Series #25364
Comments
Code Sample: import pandas as pd
a = pd.Series([[100.0]]*1000000)
b = pd.Series([a]*100) Problem description: It is easy to trace through the code that in the new version, the bottleneck is in line 1228 in pd.core.dtypes.cast.construct_1d_object_array_from_listlike: result[:] = values The original performance can be restored by replacing the above line with for i, value in enumerate(values):
result[i] = value pd.show_versions() for pandas 0.23.4:
|
Could you please check your code sample? I couldn't copy / paste as the variable |
Sorry, missed copying one line That is, a is a pandas series of a million rows. |
Do you see same performance regression on master? IIRC we fixed something with this recently @jbrockmendel |
Looks like a dupe of #24368
for i, value in enumerate(values):
result[i] = value does that pass all the constructor tests? Especially for Series with nested data of different lengths? |
From the discussion chain of 23368, I don't think the problem was fixed yet. In fact, performance degradation was worse on 0.24.1 and on master - on 0.23.4, it was 3.4 second, but from 0.24.1 onward, the same test takes over 12 seconds. |
As for #23368, it does have a lot in common with this one, although that one seems to be addressing a lot other issues with dataframes. if dtype is not None:
try:
subarr = _try_cast(data, False, dtype, copy,
raise_cast_failure)
except Exception:
if raise_cast_failure: # pragma: no cover
raise
subarr = np.array(data, dtype=object, copy=copy)
subarr = lib.maybe_convert_objects(subarr)
else:
subarr = maybe_convert_platform(data) A new place to fix is subarr = np.array(data, dtype=object, copy=copy) It is just as slow as the "else" branch (this branch eventually calls construct_1d_object_array_from_listlike, which is slow) that follows. |
are you actually trying to create a Series with list elements? |
As a work around, we can use numpy array of pd.Series. But this requires the user to actually know the pitfall. |
What is the use case for doing this? Understood it's a performance regression but unless I'm missing something why wouldn't you opt for a DataFrame here? Just asking as I'd be hesitant to burden the codebase with a potential fix for this if it doesn't have a practical application |
Right, it's still open.
Potentially for ragged arrays, where each element is of a different length. @ChoiwahChow can you edit the original post to include a (nicely formatted) minimal example? We'll keep this issue open specifically for the Series constructor, likely focusing on fixing or avoiding the line in |
Right but why is a Series within a Series useful here as opposed to say a DataFrame with SparseSeries or built-in containers? Not trying to be overly difficult here just hesitant to optimize what I would perceive (perhaps mistakenly) as very non-idiomatic code |
Here is another case of possible the same problem In [1]: import numpy as np, pandas as pd, pyarrow as pa
In [2]: df = pd.DataFrame({'x': np.arange(1000000)})
In [3]: %time t = pa.Table.from_pandas(df)
CPU times: user 5.36 ms, sys: 3.22 ms, total: 8.58 ms
Wall time: 7.47 ms
In [4]: %time s = pd.Series([t], dtype=object)
CPU times: user 2.7 s, sys: 114 ms, total: 2.82 s
Wall time: 2.81 s Originally posted in #25389 |
I think the "Series within as Series" doesn't fully capture the issue. I believe this affects any iterable object that doesn't implement the buffer protocol. In [19]: %timeit pd.Series([ser])
14.8 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
In [20]: %timeit pd.Series([arr])
91.2 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)
In [21]: %timeit pd.Series([mem])
98.1 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each) |
The use case was not specifically for creating a pandas series of pandas series. But the problem was easy to demonstrate with a pandas series of pandas series. A more general use case would be: suppose we have a list of homogeneous data structures, which could be scalars, pandas data frames, dictionaries etc (but we don't know the data type in advance) and we want to put them into a pandas series so that we can add indices to them. The index could be pandas datatime. The problem will not show up when the data structures are integers, or floats, but it will when the data structures are pandas series. |
@ChoiwahChow Regarding your comment above about replacing
|
Possible NumPy issue numpy/numpy#13308, but I wouldn't be surprised if "won't fix" is the answer. This is a strange case for NumPy. Note that if the object doesn't define |
The linked NumPy issue is now fixed. Is this still an issue at Pandas? |
Can you compare the timings of some of those code snippets with older and newer versions of numpy? |
Code Sample, a copy-pastable example if possible
# Your code here
Problem description
[this should explain why the current behaviour is a problem and why the expected output is a better solution.]
Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!
Note: Many problems can be resolved by simply upgrading
pandas
to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check ifmaster
addresses this issue, but that is not necessary.For documentation-related issues, you can check the latest versions of the docs on
master
here:https://pandas-docs.github.io/pandas-docs-travis/
If the issue has not been resolved there, go ahead and file it in the issue tracker.
Expected Output
Output of
pd.show_versions()
[paste the output of
pd.show_versions()
here below this line]The text was updated successfully, but these errors were encountered: