Pandas Series Construction Extremely Slow for Array of Large Series #25364

ChoiwahChow · 2019-02-19T01:32:02Z

Code Sample, a copy-pastable example if possible

# Your code here

Problem description

[this should explain why the current behaviour is a problem and why the expected output is a better solution.]

Note: We receive a lot of issues on our GitHub tracker, so it is very possible that your issue has been posted before. Please check first before submitting so that we do not have to handle and close duplicates!

Note: Many problems can be resolved by simply upgrading pandas to the latest version. Before submitting, please check if that solution works for you. If possible, you may want to check if master addresses this issue, but that is not necessary.

For documentation-related issues, you can check the latest versions of the docs on master here:

https://pandas-docs.github.io/pandas-docs-travis/

If the issue has not been resolved there, go ahead and file it in the issue tracker.

Expected Output

Output of `pd.show_versions()`

[paste the output of pd.show_versions() here below this line]

The text was updated successfully, but these errors were encountered:

ChoiwahChow · 2019-02-19T02:02:26Z

Code Sample:

import pandas as pd

a = pd.Series([[100.0]]*1000000)
b = pd.Series([a]*100)

Problem description:
The above code snippet completes in less than 1 ms when run with pandas version 0.22.0, but takes 3.376 seconds when run with pandas version 0.23.4 - a 3000-fold longer. We do have use cases that are impacted by this change.

It is easy to trace through the code that in the new version, the bottleneck is in line 1228 in pd.core.dtypes.cast.construct_1d_object_array_from_listlike:

    result[:] = values

The original performance can be restored by replacing the above line with

for i, value in enumerate(values):
    result[i] = value

pd.show_versions() for pandas 0.23.4:
INSTALLED VERSIONS

commit: None
python: 3.7.1.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.23.4
pytest: 4.0.2
pip: 18.1
setuptools: 40.6.3
Cython: 0.29.2
numpy: 1.15.4
scipy: 1.2.0
pyarrow: None
xarray: None
IPython: 7.2.0
sphinx: 1.8.3
patsy: 0.5.0
dateutil: 2.7.5
pytz: 2018.7
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.8
feather: None
matplotlib: 3.0.2
openpyxl: 2.5.12
xlrd: 1.2.0
xlwt: 1.3.0
xlsxwriter: 1.1.2
lxml: 4.2.5
bs4: 4.6.3
html5lib: 1.0.1
sqlalchemy: 1.2.15
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

pd.show_versions() for pandas 0.22.0
INSTALLED VERSIONS

commit: None
python: 3.6.4.final.0
python-bits: 64
OS: Windows
OS-release: 7
machine: AMD64
processor: Intel64 Family 6 Model 63 Stepping 2, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 3.4.0
pip: 9.0.1
setuptools: 38.5.1
Cython: None
numpy: 1.14.0
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 6.2.1
sphinx: None
patsy: 0.5.0
dateutil: 2.6.1
pytz: 2018.3
blosc: None
bottleneck: 1.2.1
tables: None
numexpr: 2.6.4
feather: None
matplotlib: 3.1.2
openpyxl: 2.5.0
xlrd: 1.1.0
xlwt: 1.3.0
xlsxwriter: None
lxml: 4.1.1
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

WillAyd · 2019-02-19T02:19:53Z

Could you please check your code sample? I couldn't copy / paste as the variable a is undefined

ChoiwahChow · 2019-02-19T03:29:23Z

Sorry, missed copying one line
a = pd.Series([[100.0]]*1000000)

That is, a is a pandas series of a million rows.

WillAyd · 2019-02-19T03:54:03Z

Do you see same performance regression on master? IIRC we fixed something with this recently @jbrockmendel

TomAugspurger · 2019-02-19T15:45:37Z

Looks like a dupe of #24368

The original performance can be restored by replacing the above line with

for i, value in enumerate(values):
    result[i] = value

does that pass all the constructor tests? Especially for Series with nested data of different lengths?

ChoiwahChow · 2019-02-19T23:52:54Z

From the discussion chain of 23368, I don't think the problem was fixed yet. In fact, performance degradation was worse on 0.24.1 and on master - on 0.23.4, it was 3.4 second, but from 0.24.1 onward, the same test takes over 12 seconds.

ChoiwahChow · 2019-02-20T00:13:47Z

As for #23368, it does have a lot in common with this one, although that one seems to be addressing a lot other issues with dataframes.
As for the solution for this problem, no, I have not started thorough testing (although it does work for our use cases), as there may be a lot of contending (or even contradicting) ways of fixing the same problem, so it is just a starting point.
Judging from the code in master, there are more places to fix:

        if dtype is not None:
            try:
                subarr = _try_cast(data, False, dtype, copy,
                                   raise_cast_failure)
            except Exception:
                if raise_cast_failure:  # pragma: no cover
                    raise
                subarr = np.array(data, dtype=object, copy=copy)
                subarr = lib.maybe_convert_objects(subarr)

        else:
            subarr = maybe_convert_platform(data)

A new place to fix is

                subarr = np.array(data, dtype=object, copy=copy)

It is just as slow as the "else" branch (this branch eventually calls construct_1d_object_array_from_listlike, which is slow) that follows.

jreback · 2019-02-20T00:14:59Z

are you actually trying to create a Series with list elements?

ChoiwahChow · 2019-02-20T00:18:58Z

As a work around, we can use numpy array of pd.Series. But this requires the user to actually know the pitfall.

WillAyd · 2019-02-20T05:31:08Z

What is the use case for doing this? Understood it's a performance regression but unless I'm missing something why wouldn't you opt for a DataFrame here?

Just asking as I'd be hesitant to burden the codebase with a potential fix for this if it doesn't have a practical application

TomAugspurger · 2019-02-20T11:57:37Z

From the discussion chain of 23368, I don't think the problem was fixed yet.

Right, it's still open.

why wouldn't you opt for a DataFrame here?

Potentially for ragged arrays, where each element is of a different length.

@ChoiwahChow can you edit the original post to include a (nicely formatted) minimal example? We'll keep this issue open specifically for the Series constructor, likely focusing on fixing or avoiding the line in pd.core.dtypes.cast.construct_1d_object_array_from_listlike. We'll leave #24368 for the dataframe constructor.

WillAyd · 2019-02-20T15:06:48Z

Potentially for ragged arrays, where each element is of a different length.

Right but why is a Series within a Series useful here as opposed to say a DataFrame with SparseSeries or built-in containers? Not trying to be overly difficult here just hesitant to optimize what I would perceive (perhaps mistakenly) as very non-idiomatic code

mrocklin · 2019-02-20T17:08:17Z

Here is another case of possible the same problem

In [1]: import numpy as np, pandas as pd, pyarrow as pa

In [2]: df = pd.DataFrame({'x': np.arange(1000000)})

In [3]: %time t = pa.Table.from_pandas(df)
CPU times: user 5.36 ms, sys: 3.22 ms, total: 8.58 ms
Wall time: 7.47 ms

In [4]: %time s = pd.Series([t], dtype=object)
CPU times: user 2.7 s, sys: 114 ms, total: 2.82 s
Wall time: 2.81 s

Originally posted in #25389

TomAugspurger · 2019-02-20T17:21:19Z

Right but why is a Series within a Series useful here as opposed to say a DataFrame with SparseSeries or built-in containers?

I think the "Series within as Series" doesn't fully capture the issue. I believe this affects any iterable object that doesn't implement the buffer protocol.

In [19]: %timeit pd.Series([ser])
14.8 ms ± 270 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

In [20]: %timeit pd.Series([arr])
91.2 µs ± 1.73 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [21]: %timeit pd.Series([mem])
98.1 µs ± 4.34 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

ChoiwahChow · 2019-02-22T00:46:09Z

What is the use case for doing this? Understood it's a performance regression but unless I'm missing
something why wouldn't you opt for a DataFrame here?

The use case was not specifically for creating a pandas series of pandas series. But the problem was easy to demonstrate with a pandas series of pandas series.

A more general use case would be: suppose we have a list of homogeneous data structures, which could be scalars, pandas data frames, dictionaries etc (but we don't know the data type in advance) and we want to put them into a pandas series so that we can add indices to them. The index could be pandas datatime. The problem will not show up when the data structures are integers, or floats, but it will when the data structures are pandas series.

jorisvandenbossche · 2019-03-04T13:45:59Z

@ChoiwahChow Regarding your comment above about replacing result[:] = values with an explicit loop to restore the original performance. I don't see a speed up with a test case, rather a slowdown:

In [23]: data = [[1, 2], [2, 3], [3, 4]] * 10000

In [26]: %%timeit 
    ...: result = np.empty(30000, dtype=object) 
    ...: result[:] = data 
    ...:  
1.03 ms ± 39.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: %%timeit 
    ...: result = np.empty(30000, dtype=object) 
    ...: for i, value in enumerate(data): 
    ...:     result[i] = value 
    ...:       
2.83 ms ± 105 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

TomAugspurger · 2019-04-11T20:05:45Z

Possible NumPy issue numpy/numpy#13308, but I wouldn't be surprised if "won't fix" is the answer. This is a strange case for NumPy.

Note that if the object doesn't define __len__, we don't go down NumPy's slow route. Perhaps someone can come up with a clever way to use that to our advantage.

tqa236 · 2020-10-16T11:58:40Z

The linked NumPy issue is now fixed. Is this still an issue at Pandas?

TomAugspurger · 2020-10-16T14:09:34Z

Can you compare the timings of some of those code snippets with older and newer versions of numpy?

WillAyd added the Needs Info Clarification about behavior needed to assess issue label Feb 19, 2019

gfyoung added Performance Memory or execution speed performance and removed Needs Info Clarification about behavior needed to assess issue labels Feb 19, 2019

TomAugspurger added this to the Contributions Welcome milestone Feb 20, 2019

jreback mentioned this issue Feb 20, 2019

Putting an Arrow table into a Pandas series takes a long time #25389

Closed

TomAugspurger mentioned this issue Apr 12, 2019

Slow creation of object-dtype array when elements define __len__ numpy/numpy#13308

Closed

jbrockmendel added the Nested Data Data where the values are collections (lists, sets, dicts, objects, etc.). label Sep 22, 2020

mroeschke removed this from the Contributions Welcome milestone Oct 13, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pandas Series Construction Extremely Slow for Array of Large Series #25364

Pandas Series Construction Extremely Slow for Array of Large Series #25364

ChoiwahChow commented Feb 19, 2019

ChoiwahChow commented Feb 19, 2019 •

edited

Loading

WillAyd commented Feb 19, 2019

ChoiwahChow commented Feb 19, 2019

WillAyd commented Feb 19, 2019

TomAugspurger commented Feb 19, 2019

ChoiwahChow commented Feb 19, 2019

ChoiwahChow commented Feb 20, 2019

jreback commented Feb 20, 2019

ChoiwahChow commented Feb 20, 2019

WillAyd commented Feb 20, 2019

TomAugspurger commented Feb 20, 2019 •

edited by jorisvandenbossche

Loading

WillAyd commented Feb 20, 2019

mrocklin commented Feb 20, 2019

TomAugspurger commented Feb 20, 2019

ChoiwahChow commented Feb 22, 2019 •

edited

Loading

jorisvandenbossche commented Mar 4, 2019

TomAugspurger commented Apr 11, 2019

tqa236 commented Oct 16, 2020

TomAugspurger commented Oct 16, 2020

Pandas Series Construction Extremely Slow for Array of Large Series #25364

Pandas Series Construction Extremely Slow for Array of Large Series #25364

Comments

ChoiwahChow commented Feb 19, 2019

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

ChoiwahChow commented Feb 19, 2019 • edited Loading

pd.show_versions() for pandas 0.23.4: INSTALLED VERSIONS

pd.show_versions() for pandas 0.22.0 INSTALLED VERSIONS

WillAyd commented Feb 19, 2019

ChoiwahChow commented Feb 19, 2019

WillAyd commented Feb 19, 2019

TomAugspurger commented Feb 19, 2019

ChoiwahChow commented Feb 19, 2019

ChoiwahChow commented Feb 20, 2019

jreback commented Feb 20, 2019

ChoiwahChow commented Feb 20, 2019

WillAyd commented Feb 20, 2019

TomAugspurger commented Feb 20, 2019 • edited by jorisvandenbossche Loading

WillAyd commented Feb 20, 2019

mrocklin commented Feb 20, 2019

TomAugspurger commented Feb 20, 2019

ChoiwahChow commented Feb 22, 2019 • edited Loading

jorisvandenbossche commented Mar 4, 2019

TomAugspurger commented Apr 11, 2019

tqa236 commented Oct 16, 2020

TomAugspurger commented Oct 16, 2020

Output of `pd.show_versions()`

ChoiwahChow commented Feb 19, 2019 •

edited

Loading

pd.show_versions() for pandas 0.23.4:
INSTALLED VERSIONS

pd.show_versions() for pandas 0.22.0
INSTALLED VERSIONS

TomAugspurger commented Feb 20, 2019 •

edited by jorisvandenbossche

Loading

ChoiwahChow commented Feb 22, 2019 •

edited

Loading