-
-
Notifications
You must be signed in to change notification settings - Fork 18.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: Preserve Series/DataFrame subclasses through groupby operations #33884
BUG: Preserve Series/DataFrame subclasses through groupby operations #33884
Conversation
I updated my branch with additional tests, and a revised entry in the whatsnew rst file. Please let me know if there are any additional recommendations for how to proceed with this PR. |
Thanks for the mypy advice -- I managed to refactor that section a bit to only call The one additional change made here is to make sure |
@jreback -- please let me know if there are any additional changes to this PR that you think are necessary. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
small comment, ping on green.
@jreback I made the recommend changes:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JBGreisman generally lgtm. one question.
@simonjayhawkins Thanks for the comment -- I agree that line should have used |
pandas/core/groupby/groupby.py
Outdated
@@ -1182,6 +1182,16 @@ class GroupBy(_GroupBy[FrameOrSeries]): | |||
more | |||
""" | |||
|
|||
@property | |||
def _constructor(self) -> Type["Series"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm this should be FrameOrSeries I think.
why is there an else here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I had thought this should be a Series because if self.obj
is a DataFrame, it returns self.obj._constructor_sliced
. Do you think I should name this property _series_constructor
or something comparable to make that behavior more apparent?
The else was there because mypy
was complaining about a missing return statement. I can restructure this with an assertion to avoid an else statement and keep mypy
from complaining:
@property
def _series_constructor(self) -> Type["Series"]:
# GH28330 preserve subclassed Series/DataFrames
if isinstance(self.obj, DataFrame):
return self.obj._constructor_sliced
assert isinstance(self.obj, Series)
return self.obj._constructor
Please let me know if you have a better way to structure this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes pls use an assertion
maybe @simonjayhawkins or @WillAyd can help with the annotation itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks -- I changed _constructor
to use the assertion as above. I think it would also make sense to change the name of the property to _series_constructor
in order to clarify the return type, but I'll hold off for additional comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the annotation is correct given the way it is implemented, though I am not sure about the implementation. Why do we need to dispatch to constructor_sliced
for DataFrames? Seems slightly unnatural to have to force that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ends up being used in GroupBy.ngroup()
, GroupBy.cumcount()
, and GroupBy.size()
. Prior to these changes, these methods had returned Series
, regardless of whether they were called from a SeriesGroupBy
or DataFrameGroupBy
object. I had updated this to still return a Series
-type while preserving the subclasses -- to avoid things reverting back to pd.Series
if they were called from a subclassed DataFrame/Series.
As such, the motivation for dispatching to constructor_sliced
for DataFrames was to avoid changing the return type for these different GroupBy
methods from their prior behavior. Do you think it would make sense to make a larger change here that alters these functions to have different return types if called from SeriesGroupBy
vs. DataFrameGroupBy
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm that's unfortunate... Can you rename this property to _obj_constructor
instead? I find the current name a little confusing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or maybe even _obj_1d_constructor
to be even more explicit. This is definitely for special cases so want to signal as such
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- I think _obj_1d_constructor
is most clear. I'll make the changes.
As per the discussion above, the GroupBy property has been renamed to |
thanks @JBGreisman very nice, thanks for sticking with it and being responsive! |
no worries -- thanks for the help/suggestions! |
This pull request fixes a bug that caused subclassed Series/DataFrames to revert to
pd.Series
andpd.DataFrame
aftergroupby()
operations. This is a follow-up on a previously abandoned pull request (#28573)