-
-
Notifications
You must be signed in to change notification settings - Fork 18.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
REGR: DataFrame.agg with axis=1, EA dtype, and duplicate index #42449
Conversation
…anspose_regression � Conflicts: � doc/source/whatsnew/v1.3.1.rst
pandas/core/frame.py
Outdated
result = self._constructor( | ||
dict(zip(self.index, new_values)), index=self.columns | ||
values.T, index=self.columns, columns=self.index, dtype=dtype |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
values.T is going to cast to ndarray, which often means object
i think could use DataFrame.from_arrays?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see a MultiIndex.from_arrays, but not DataFrame. Here are some timings:
size = 1000
df = pd.DataFrame({'a': size * [1, 2, 3]}, dtype='category')
df_T = df.T
%timeit df.T
%timeit df_T.T
PR:
249 ms ± 6.39 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
92.3 ms ± 6.51 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Master:
237 ms ± 4.78 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
89.7 ms ± 6.87 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
so certainly a slight regression. Will continue looking to see if avoiding object is possible.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, found it - _from_arrays
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New timing with the update
234 ms ± 4.46 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
85 ms ± 4.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
thanks @rhshadrach |
@meeseeksdev backport 1.3.x |
…pe, and duplicate index
Something went wrong ... Please have a look at my logs. |
…plicate index (#42460) Co-authored-by: Richard Shadrach <[email protected]>
in pandas 0.25.3 the returned DataFrame had object dtype.
on master, we now have value dependent behaviour
|
@simonjayhawkins I think the column dtypes, not values, are what determines the resulting dtype. As such, I would not describe this as value-dependent behavior. The result looks like the correct one to me, but can open an issue if you would like this discussed further. |
The result is now certainly consistent with the result for a unique index. (which also changed behavior between 0.25.3 and 1.0.0)
this has been backported for a patch release. and we could release anytime dependent on number and severity of reported regression. My preference is to not having an open issue, but to keep the 1.3.x ready for release. Any reason not just to revert #40428 instead? |
Are you saying we should not fix the transpose regression from 0.25.3 -> 1.0.0 in 1.3.1? If that's the case, then I would support reverting #40428, assuming it does fix the issue in |
I am not saying that. I basically just wanted clarification of why #40428 was not reverted. I assume that because the root cause was determined as the change in #30091, that was fixed instead? However, that change does not revert to the behavior in 0.25.3 because of the dtype change. (but that is also reasonable as the new behavior matches that of a unique index)
it is ok to fix prior regressions or bug fixes in patch releases, so that alone is not a reason to prefer the revert option.
that is odd. |
Thanks @simonjayhawkins, your assessment is correct - the root cause of the agg regression was the transpose regression. Both are tested in this PR.
The behavior in 0.25.3 is undesirable. When a DataFrame has a single EA dtype, 0.25.3's transpose would convert it to object whereas it should remain the EA dtype. #30091 fixed this, but at the expense of silently dropping data when the index has duplicates. This PR maintains the desired behavior of #30091, but in a way that does not drop data when the index has duplicates. |
The issue in transpose was a bug that was introduced in 1.0.0, but induced a regression in 1.3.0 due to its use in
DataFrame.agg
.