Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Preserve Series/DataFrame subclasses through groupby operations #33884

Merged
merged 38 commits into from
May 13, 2020

Conversation

JBGreisman
Copy link
Contributor

@JBGreisman JBGreisman commented Apr 30, 2020

This pull request fixes a bug that caused subclassed Series/DataFrames to revert to pd.Series and pd.DataFrame after groupby() operations. This is a follow-up on a previously abandoned pull request (#28573)

doc/source/whatsnew/v1.1.0.rst Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved
pandas/tests/groupby/test_groupby.py Outdated Show resolved Hide resolved
@JBGreisman
Copy link
Contributor Author

I updated my branch with additional tests, and a revised entry in the whatsnew rst file.

Please let me know if there are any additional recommendations for how to proceed with this PR.

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@JBGreisman JBGreisman requested a review from jreback May 1, 2020 21:08
@jreback jreback added this to the 1.1 milestone May 2, 2020
pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@JBGreisman
Copy link
Contributor Author

Thanks for the mypy advice -- I managed to refactor that section a bit to only call cast() if needed and avoid calling the constructor again.

The one additional change made here is to make sure GroupBy.size(), GroupBy.ngroup(), and GroupBy.cumcount() preserve subclassed types. While you had previously mentioned just keeping these return types as Series, I think these changes minimize the possibility for surprise when subclassing DataFrame/Series by ensuring that groupby reductions/transformations preserve the subclasses.

@JBGreisman JBGreisman requested a review from jreback May 4, 2020 14:51
@JBGreisman
Copy link
Contributor Author

@jreback -- please let me know if there are any additional changes to this PR that you think are necessary.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

small comment, ping on green.

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
pandas/core/groupby/groupby.py Outdated Show resolved Hide resolved
@JBGreisman
Copy link
Contributor Author

@jreback I made the recommend changes:

  • I reverted the changes in DataFrameGroupBy._cython_agg_blocks() because they did not prove necessary to pass my tests.
  • I added a GroupBy._constructor property to simplify the calls in ngroup, cumcount, and size.

Copy link
Member

@simonjayhawkins simonjayhawkins left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JBGreisman generally lgtm. one question.

pandas/core/groupby/generic.py Outdated Show resolved Hide resolved
@JBGreisman
Copy link
Contributor Author

@simonjayhawkins Thanks for the comment -- I agree that line should have used self.obj._constructor, and changed it

@@ -1182,6 +1182,16 @@ class GroupBy(_GroupBy[FrameOrSeries]):
more
"""

@property
def _constructor(self) -> Type["Series"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm this should be FrameOrSeries I think.

why is there an else here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had thought this should be a Series because if self.obj is a DataFrame, it returns self.obj._constructor_sliced. Do you think I should name this property _series_constructor or something comparable to make that behavior more apparent?

The else was there because mypy was complaining about a missing return statement. I can restructure this with an assertion to avoid an else statement and keep mypy from complaining:

@property
def _series_constructor(self) -> Type["Series"]:
    # GH28330 preserve subclassed Series/DataFrames                                                                                                                                                   
    if isinstance(self.obj, DataFrame):
        return self.obj._constructor_sliced
    assert isinstance(self.obj, Series)
    return self.obj._constructor

Please let me know if you have a better way to structure this.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes pls use an assertion

maybe @simonjayhawkins or @WillAyd can help with the annotation itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks -- I changed _constructor to use the assertion as above. I think it would also make sense to change the name of the property to _series_constructor in order to clarify the return type, but I'll hold off for additional comments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the annotation is correct given the way it is implemented, though I am not sure about the implementation. Why do we need to dispatch to constructor_sliced for DataFrames? Seems slightly unnatural to have to force that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ends up being used in GroupBy.ngroup(), GroupBy.cumcount(), and GroupBy.size(). Prior to these changes, these methods had returned Series, regardless of whether they were called from a SeriesGroupBy or DataFrameGroupBy object. I had updated this to still return a Series-type while preserving the subclasses -- to avoid things reverting back to pd.Series if they were called from a subclassed DataFrame/Series.

As such, the motivation for dispatching to constructor_sliced for DataFrames was to avoid changing the return type for these different GroupBy methods from their prior behavior. Do you think it would make sense to make a larger change here that alters these functions to have different return types if called from SeriesGroupBy vs. DataFrameGroupBy?

Copy link
Member

@WillAyd WillAyd May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm that's unfortunate... Can you rename this property to _obj_constructor instead? I find the current name a little confusing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe even _obj_1d_constructor to be even more explicit. This is definitely for special cases so want to signal as such

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -- I think _obj_1d_constructor is most clear. I'll make the changes.

@JBGreisman
Copy link
Contributor Author

As per the discussion above, the GroupBy property has been renamed to _obj_1d_constructor to clarify its expected return type, and I refactored it to avoid using an else statement.

@jreback jreback merged commit aa3e611 into pandas-dev:master May 13, 2020
@jreback
Copy link
Contributor

jreback commented May 13, 2020

thanks @JBGreisman very nice, thanks for sticking with it and being responsive!

@JBGreisman
Copy link
Contributor Author

no worries -- thanks for the help/suggestions!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

groupby and resample methods do not preserve subclassed data structures
4 participants