Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Improve perf initalizing DataFrame with a range #30171

Merged
merged 2 commits into from
Dec 10, 2019

Conversation

topper-123
Copy link
Contributor

@topper-123 topper-123 commented Dec 9, 2019

Performance improvement when initalizing a DataFrame with a range:

>>> %timeit pd.DataFrame(range(1_000_000)) 
347 ms ± 4.46 ms per loop  # master
5.98 ms ± 61.9 µs per loop  # this PR
# other initialization types don't seem to be affected by this problem:
>>> %timeit pd.Series(range(1_000_000))
5.75 ms ± 27.6 µs per loop  # master and this PR
>>> %t pd.DataFrame({"a": range(1_000_000)})
11.2 ms ± 119 µs per loop  # master and this PR

@WillAyd WillAyd added the Performance Memory or execution speed performance label Dec 9, 2019
Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@WillAyd
Copy link
Member

WillAyd commented Dec 9, 2019

Can you check we already have a test for this construction?

@WillAyd WillAyd added this to the 1.0 milestone Dec 9, 2019
@jbrockmendel
Copy link
Member

nice. does this need/merit an asv?

@topper-123
Copy link
Contributor Author

Can you check we already have a test for this construction?

It's tested in pandas\tests\frame\test_constructors.py line 1079 and used in a few more places.

@topper-123
Copy link
Contributor Author

nice. does this need/merit an asv?

IMO that would not be really important, though I'm not against it either. This pattern is probably mostly used interactively, in ipython and jupyter etc. and little otherwise.

@topper-123 topper-123 added Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure and removed Constructors Series/DataFrame/Index/pd.array Constructors labels Dec 9, 2019
if not isinstance(values, (np.ndarray, ABCSeries, Index)):
if len(values) == 0:
return np.empty((0, 0), dtype=object)
elif isinstance(values, range):
arr = np.arange(values.start, values.stop, values.step, dtype="int64")
Copy link
Member

@jschendel jschendel Dec 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like some care is needed here in respect to dtypes. Specifically if the range contains values only supported by uint64, or values only supported by Python integers.

For example, the following works on master:

In [2]: pd.DataFrame(range(2**63, 2**63 + 4))
Out[2]: 
                     0
0  9223372036854775808
1  9223372036854775809
2  9223372036854775810
3  9223372036854775811

In [3]: _.dtypes
Out[3]: 
0    uint64
dtype: object

In [4]: pd.DataFrame(range(2**73, 2**73 + 4))
Out[4]: 
                        0
0  9444732965739290427392
1  9444732965739290427393
2  9444732965739290427394
3  9444732965739290427395

In [5]: _.dtypes
Out[5]: 
0    object
dtype: object

But both fail with the changes in this PR:

In [2]: pd.DataFrame(range(2**63, 2**63 + 4))
---------------------------------------------------------------------------
OverflowError: Python int too large to convert to C long

In [3]: pd.DataFrame(range(2**73, 2**73 + 4))
---------------------------------------------------------------------------
OverflowError: Python int too large to convert to C long

Admittedly, this is a bit of a corner case. It also looks like the issue isn't limited to the PR, as the Series equivalent of the above fails on master.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

xref #30173 for the failing Series example on master; I'd expect that both issues could be solved in a similar way.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

bugs can certainly be addressed in a followup, but can you add an asv ? I think we have a number of construction asv's already

@topper-123
Copy link
Contributor Author

I've added a ASV. I can take on #30173 in a followup for both Series and Dataframes.

@jreback
Copy link
Contributor

jreback commented Dec 10, 2019

lgtm. merge on green.

@topper-123 topper-123 merged commit 2470690 into pandas-dev:master Dec 10, 2019
@topper-123 topper-123 deleted the frame_range branch December 10, 2019 15:03
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DataFrame DataFrame data structure Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants