Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: regression in Series constructor #30564

Closed
TomAugspurger opened this issue Dec 30, 2019 · 2 comments · Fixed by #30571
Closed

PERF: regression in Series constructor #30564

TomAugspurger opened this issue Dec 30, 2019 · 2 comments · Fixed by #30571
Labels
Performance Memory or execution speed performance

Comments

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Dec 30, 2019

This is captured in the SeriesConstructor.time_constructor asv.

In [8]: import pandas as pd
   ...: import numpy as np
   ...:
   ...: data = np.arange(1000)
   ...: index = pd.date_range('2000', periods=len(data))
   ...: data = dict(zip(index, data))
   ...: s = pd.Series(data, index=index)

On master

848 ms ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

On 0.25.x

   ...: %timeit Series(data, index=index)
82.5 ms ± 2.05 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Looking into a bit now. We're spending a lot more time in ensure_index -> is_period_dtype / is_dtype / construct_from_string.

@TomAugspurger TomAugspurger added the Performance Memory or execution speed performance label Dec 30, 2019
@TomAugspurger
Copy link
Contributor Author

Hmm, so in Series._init_dict we go convert the DatetimeIndex to a tuple

        if data:
            keys, values = zip(*data.items())
            values = list(values)

and then go back through Series.__init__ with index=keys, the tuple of Timestamps, which is slow.

@TomAugspurger
Copy link
Contributor Author

Oh fun.

In Index.__init__ we check is_period_dtype(data) (the list of tuples), which eventually calls PeriodDtype.construct_from_string with the data, to support is_period_dtype("Period[D]"). But of course a tuple of timestamps isn't PeriodDtype, so we raise a TypeError.

In #3047, we include the string in the error message, which takes a long time to format for the long tuples. A solution it to raise with a different error message for non-string inputs.

TomAugspurger added a commit to TomAugspurger/pandas that referenced this issue Dec 30, 2019
jbrockmendel pushed a commit that referenced this issue Dec 31, 2019
* PERF: Fixed performance regression in Series init

Closes #30564

* avoid calling
hweecat pushed a commit to hweecat/pandas that referenced this issue Jan 1, 2020
* PERF: Fixed performance regression in Series init

Closes pandas-dev#30564

* avoid calling
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant