-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Improve perf initalizing DataFrame with a range #30171
Conversation
31f1a6e
to
fca3760
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Can you check we already have a test for this construction? |
nice. does this need/merit an asv? |
It's tested in |
IMO that would not be really important, though I'm not against it either. This pattern is probably mostly used interactively, in ipython and jupyter etc. and little otherwise. |
if not isinstance(values, (np.ndarray, ABCSeries, Index)): | ||
if len(values) == 0: | ||
return np.empty((0, 0), dtype=object) | ||
elif isinstance(values, range): | ||
arr = np.arange(values.start, values.stop, values.step, dtype="int64") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like some care is needed here in respect to dtypes. Specifically if the range
contains values only supported by uint64
, or values only supported by Python integers.
For example, the following works on master
:
In [2]: pd.DataFrame(range(2**63, 2**63 + 4))
Out[2]:
0
0 9223372036854775808
1 9223372036854775809
2 9223372036854775810
3 9223372036854775811
In [3]: _.dtypes
Out[3]:
0 uint64
dtype: object
In [4]: pd.DataFrame(range(2**73, 2**73 + 4))
Out[4]:
0
0 9444732965739290427392
1 9444732965739290427393
2 9444732965739290427394
3 9444732965739290427395
In [5]: _.dtypes
Out[5]:
0 object
dtype: object
But both fail with the changes in this PR:
In [2]: pd.DataFrame(range(2**63, 2**63 + 4))
---------------------------------------------------------------------------
OverflowError: Python int too large to convert to C long
In [3]: pd.DataFrame(range(2**73, 2**73 + 4))
---------------------------------------------------------------------------
OverflowError: Python int too large to convert to C long
Admittedly, this is a bit of a corner case. It also looks like the issue isn't limited to the PR, as the Series
equivalent of the above fails on master
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
xref #30173 for the failing Series
example on master
; I'd expect that both issues could be solved in a similar way.
bugs can certainly be addressed in a followup, but can you add an asv ? I think we have a number of construction asv's already |
fca3760
to
b0a3e94
Compare
I've added a ASV. I can take on #30173 in a followup for both Series and Dataframes. |
lgtm. merge on green. |
Performance improvement when initalizing a DataFrame with a range: