PERF: Improve perf initalizing DataFrame with a range #30171

topper-123 · 2019-12-09T22:11:14Z

Performance improvement when initalizing a DataFrame with a range:

>>> %timeit pd.DataFrame(range(1_000_000)) 
347 ms ± 4.46 ms per loop  # master
5.98 ms ± 61.9 µs per loop  # this PR
# other initialization types don't seem to be affected by this problem:
>>> %timeit pd.Series(range(1_000_000))
5.75 ms ± 27.6 µs per loop  # master and this PR
>>> %t pd.DataFrame({"a": range(1_000_000)})
11.2 ms ± 119 µs per loop  # master and this PR

WillAyd

lgtm

WillAyd · 2019-12-09T22:47:05Z

Can you check we already have a test for this construction?

jbrockmendel · 2019-12-09T23:04:43Z

nice. does this need/merit an asv?

topper-123 · 2019-12-09T23:18:21Z

Can you check we already have a test for this construction?

It's tested in pandas\tests\frame\test_constructors.py line 1079 and used in a few more places.

topper-123 · 2019-12-09T23:21:44Z

nice. does this need/merit an asv?

IMO that would not be really important, though I'm not against it either. This pattern is probably mostly used interactively, in ipython and jupyter etc. and little otherwise.

jschendel · 2019-12-10T00:29:45Z

pandas/core/internals/construction.py

    if not isinstance(values, (np.ndarray, ABCSeries, Index)):
        if len(values) == 0:
            return np.empty((0, 0), dtype=object)
+        elif isinstance(values, range):
+            arr = np.arange(values.start, values.stop, values.step, dtype="int64")


It seems like some care is needed here in respect to dtypes. Specifically if the range contains values only supported by uint64, or values only supported by Python integers.

For example, the following works on master:

In [2]: pd.DataFrame(range(2**63, 2**63 + 4)) Out[2]: 0 0 9223372036854775808 1 9223372036854775809 2 9223372036854775810 3 9223372036854775811 In [3]: _.dtypes Out[3]: 0 uint64 dtype: object In [4]: pd.DataFrame(range(2**73, 2**73 + 4)) Out[4]: 0 0 9444732965739290427392 1 9444732965739290427393 2 9444732965739290427394 3 9444732965739290427395 In [5]: _.dtypes Out[5]: 0 object dtype: object

But both fail with the changes in this PR:

In [2]: pd.DataFrame(range(2**63, 2**63 + 4)) --------------------------------------------------------------------------- OverflowError: Python int too large to convert to C long In [3]: pd.DataFrame(range(2**73, 2**73 + 4)) --------------------------------------------------------------------------- OverflowError: Python int too large to convert to C long

Admittedly, this is a bit of a corner case. It also looks like the issue isn't limited to the PR, as the Series equivalent of the above fails on master.

xref #30173 for the failing Series example on master; I'd expect that both issues could be solved in a similar way.

jreback · 2019-12-10T12:31:49Z

bugs can certainly be addressed in a followup, but can you add an asv ? I think we have a number of construction asv's already

topper-123 · 2019-12-10T14:29:42Z

I've added a ASV. I can take on #30173 in a followup for both Series and Dataframes.

jreback · 2019-12-10T14:48:48Z

lgtm. merge on green.

topper-123 force-pushed the frame_range branch from 31f1a6e to fca3760 Compare December 9, 2019 22:12

WillAyd added the Performance Memory or execution speed performance label Dec 9, 2019

WillAyd approved these changes Dec 9, 2019

View reviewed changes

WillAyd added this to the 1.0 milestone Dec 9, 2019

topper-123 added Constructors Series/DataFrame/Index/pd.array Constructors DataFrame DataFrame data structure and removed Constructors Series/DataFrame/Index/pd.array Constructors labels Dec 9, 2019

jschendel reviewed Dec 10, 2019

View reviewed changes

topper-123 mentioned this pull request Dec 10, 2019

BUG: Series(range(...)) fails when range contains values not supported by int64 #30173

Closed

topper-123 added 2 commits December 10, 2019 14:24

Improve perf initalizing DataFrame with a range

50f9100

Add ASV

b0a3e94

topper-123 force-pushed the frame_range branch from fca3760 to b0a3e94 Compare December 10, 2019 14:25

topper-123 merged commit 2470690 into pandas-dev:master Dec 10, 2019

topper-123 deleted the frame_range branch December 10, 2019 15:03

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

PERF: Improve perf initalizing DataFrame with a range (pandas-dev#30171)

a08142d

proost pushed a commit to proost/pandas that referenced this pull request Dec 19, 2019

PERF: Improve perf initalizing DataFrame with a range (pandas-dev#30171)

56bbce9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: Improve perf initalizing DataFrame with a range #30171

PERF: Improve perf initalizing DataFrame with a range #30171

topper-123 commented Dec 9, 2019 •

edited

Loading

WillAyd left a comment

WillAyd commented Dec 9, 2019

jbrockmendel commented Dec 9, 2019

topper-123 commented Dec 9, 2019

topper-123 commented Dec 9, 2019

jschendel Dec 10, 2019 •

edited

Loading

jschendel Dec 10, 2019

jreback commented Dec 10, 2019

topper-123 commented Dec 10, 2019

jreback commented Dec 10, 2019

PERF: Improve perf initalizing DataFrame with a range #30171

PERF: Improve perf initalizing DataFrame with a range #30171

Conversation

topper-123 commented Dec 9, 2019 • edited Loading

WillAyd left a comment

Choose a reason for hiding this comment

WillAyd commented Dec 9, 2019

jbrockmendel commented Dec 9, 2019

topper-123 commented Dec 9, 2019

topper-123 commented Dec 9, 2019

jschendel Dec 10, 2019 • edited Loading

Choose a reason for hiding this comment

jschendel Dec 10, 2019

Choose a reason for hiding this comment

jreback commented Dec 10, 2019

topper-123 commented Dec 10, 2019

jreback commented Dec 10, 2019

topper-123 commented Dec 9, 2019 •

edited

Loading

jschendel Dec 10, 2019 •

edited

Loading