Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pandas Series Construction Extremely Slow for Array of Large Series #25364

Open
ChoiwahChow opened this issue Feb 19, 2019 · 19 comments
Open

Pandas Series Construction Extremely Slow for Array of Large Series #25364

ChoiwahChow opened this issue Feb 19, 2019 · 19 comments
Labels
Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Performance Memory or execution speed performance

Comments

@ChoiwahChow
Copy link

Code Sample, a copy-pastable example if possible

# Your code here

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of pd.show_versions()

[paste the output of pd.show_versions() here below this line]

@ChoiwahChow
Copy link
Author

ChoiwahChow commented Feb 19, 2019

Code Sample:

import pandas as pd

a = pd.Series([[100.0]]*1000000)
b = pd.Series([a]*100)

Problem description:
The above code snippet completes in less than 1 ms when run with pandas version 0.22.0, but takes 3.376 seconds when run with pandas version 0.23.4 - a 3000-fold longer. We do have use cases that are impacted by this change.

It is easy to trace through the code that in the new version, the bottleneck is in line 1228 in pd.core.dtypes.cast.construct_1d_object_array_from_listlike:

    result[:] = values

The original performance can be restored by replacing the above line with

for i, value in enumerate(values):
    result[i] = value

pd.show_versions() for pandas 0.23.4:
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.3
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

pd.show_versions() for pandas 0.22.0
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 3.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@WillAyd
Copy link
Member

WillAyd commented Feb 19, 2019

Could you please check your code sample? I couldn't copy / paste as the variable a is undefined

@WillAyd WillAyd added the Needs Info Clarification about behavior needed to assess issue label Feb 19, 2019
@ChoiwahChow
Copy link
Author

Sorry, missed copying one line
a = pd.Series([[100.0]]*1000000)

That is, a is a pandas series of a million rows.

@WillAyd
Copy link
Member

WillAyd commented Feb 19, 2019

Do you see same performance regression on master? IIRC we fixed something with this recently @jbrockmendel

@gfyoung gfyoung added Performance Memory or execution speed performance and removed Needs Info Clarification about behavior needed to assess issue labels Feb 19, 2019
@TomAugspurger
Copy link
Contributor

Looks like a dupe of #24368

The original performance can be restored by replacing the above line with

for i, value in enumerate(values):
    result[i] = value

does that pass all the constructor tests? Especially for Series with nested data of different lengths?

@ChoiwahChow
Copy link
Author

From the discussion chain of 23368, I don't think the problem was fixed yet. In fact, performance degradation was worse on 0.24.1 and on master - on 0.23.4, it was 3.4 second, but from 0.24.1 onward, the same test takes over 12 seconds.

@ChoiwahChow
Copy link
Author

As for #23368, it does have a lot in common with this one, although that one seems to be addressing a lot other issues with dataframes.
As for the solution for this problem, no, I have not started thorough testing (although it does work for our use cases), as there may be a lot of contending (or even contradicting) ways of fixing the same problem, so it is just a starting point.
Judging from the code in master, there are more places to fix:

        if dtype is not None:
            try:
                subarr = _try_cast(data, False, dtype, copy,
                                   raise_cast_failure)
            except Exception:
                if raise_cast_failure:  # pragma: no cover
                    raise
                subarr = np.array(data, dtype=object, copy=copy)
                subarr = lib.maybe_convert_objects(subarr)

        else:
            subarr = maybe_convert_platform(data)

A new place to fix is

                subarr = np.array(data, dtype=object, copy=copy)

It is just as slow as the "else" branch (this branch eventually calls construct_1d_object_array_from_listlike, which is slow) that follows.

@jreback
Copy link
Contributor

jreback commented Feb 20, 2019

are you actually trying to create a Series with list elements?

@ChoiwahChow
Copy link
Author

As a work around, we can use numpy array of pd.Series. But this requires the user to actually know the pitfall.

@WillAyd
Copy link
Member

WillAyd commented Feb 20, 2019

What is the use case for doing this? Understood it's a performance regression but unless I'm missing something why wouldn't you opt for a DataFrame here?

Just asking as I'd be hesitant to burden the codebase with a potential fix for this if it doesn't have a practical application

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 20, 2019

From the discussion chain of 23368, I don't think the problem was fixed yet.

Right, it's still open.

why wouldn't you opt for a DataFrame here?

Potentially for ragged arrays, where each element is of a different length.


@ChoiwahChow can you edit the original post to include a (nicely formatted) minimal example? We'll keep this issue open specifically for the Series constructor, likely focusing on fixing or avoiding the line in pd.core.dtypes.cast.construct_1d_object_array_from_listlike. We'll leave #24368 for the dataframe constructor.

@TomAugspurger TomAugspurger added this to the Contributions Welcome milestone Feb 20, 2019
@WillAyd
Copy link
Member

WillAyd commented Feb 20, 2019

Potentially for ragged arrays, where each element is of a different length.

Right but why is a Series within a Series useful here as opposed to say a DataFrame with SparseSeries or built-in containers? Not trying to be overly difficult here just hesitant to optimize what I would perceive (perhaps mistakenly) as very non-idiomatic code

@mrocklin
Copy link
Contributor

Here is another case of possible the same problem

In [1]: import numpy as np, pandas as pd, pyarrow as pa

In [2]: df = pd.DataFrame({'x': np.arange(1000000)})

In [3]: %time t = pa.Table.from_pandas(df)
CPU times: user 5.36 ms, sys: 3.22 ms, total: 8.58 ms
Wall time: 7.47 ms

In [4]: %time s = pd.Series([t], dtype=object)
CPU times: user 2.7 s, sys: 114 ms, total: 2.82 s
Wall time: 2.81 s

Originally posted in #25389

@TomAugspurger
Copy link
Contributor

Right but why is a Series within a Series useful here as opposed to say a DataFrame with SparseSeries or built-in containers?

I think the "Series within as Series" doesn't fully capture the issue. I believe this affects any iterable object that doesn't implement the buffer protocol.

In [19]: %timeit pd.Series([ser])
14.8 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: %timeit pd.Series([arr])
91.2 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit pd.Series([mem])
98.1 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@ChoiwahChow
Copy link
Author

ChoiwahChow commented Feb 22, 2019

What is the use case for doing this? Understood it's a performance regression but unless I'm missing
something why wouldn't you opt for a DataFrame here?

The use case was not specifically for creating a pandas series of pandas series. But the problem was easy to demonstrate with a pandas series of pandas series.

A more general use case would be: suppose we have a list of homogeneous data structures, which could be scalars, pandas data frames, dictionaries etc (but we don't know the data type in advance) and we want to put them into a pandas series so that we can add indices to them. The index could be pandas datatime. The problem will not show up when the data structures are integers, or floats, but it will when the data structures are pandas series.

@jorisvandenbossche
Copy link
Member

@ChoiwahChow Regarding your comment above about replacing result[:] = values with an explicit loop to restore the original performance. I don't see a speed up with a test case, rather a slowdown:

In [23]: data = [[1, 2], [2, 3], [3, 4]] * 10000

In [26]: %%timeit 
    ...: result = np.empty(30000, dtype=object) 
    ...: result[:] = data 
    ...:  
1.03 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: %%timeit 
    ...: result = np.empty(30000, dtype=object) 
    ...: for i, value in enumerate(data): 
    ...:     result[i] = value 
    ...:       
2.83 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

@TomAugspurger
Copy link
Contributor

Possible NumPy issue numpy/numpy#13308, but I wouldn't be surprised if "won't fix" is the answer. This is a strange case for NumPy.

Note that if the object doesn't define __len__, we don't go down NumPy's slow route. Perhaps someone can come up with a clever way to use that to our advantage.

@jbrockmendel jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020
@tqa236
Copy link
Contributor

tqa236 commented Oct 16, 2020

The linked NumPy issue is now fixed. Is this still an issue at Pandas?

@TomAugspurger
Copy link
Contributor

Can you compare the timings of some of those code snippets with older and newer versions of numpy?

@mroeschke mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). Performance Memory or execution speed performance
Projects
None yet
Development

No branches or pull requests

10 participants