Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

API: _constructor and subclasses #32638

Open
TomAugspurger opened this issue Mar 11, 2020 · 8 comments
Open

API: _constructor and subclasses #32638

TomAugspurger opened this issue Mar 11, 2020 · 8 comments
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects

Comments

@TomAugspurger
Copy link
Contributor

Followup to #31925.

Over there, we removed _ensure_type because it broke things like

import pandas as pd
import pandas.testing as pdt

class MyDataFrame(pd.DataFrame):
    pass

mdf = MyDataFrame()
df = pd.DataFrame()
# In DataFrame.reindex_like, the derived class is compared to
# the base class. The following line will throw
# 'AssertionError: <class 'pandas.core.frame.DataFrame'>'
mdf.reindex_like(df)

However, I don't think that example is very compelling. It's strange for MyDataFrame.reindex to return anything other than MyDataFrame. The root cause is MyDataFrame._constructor returning DataFrame, rather than MyDataFrame.

I think we should somehow get subclasses to have their _constructor return something that returns their subclass. We can

  1. Require this, by doing a runtime check on the DataFrame._constructor
  2. Change DataFrame._constructor to return type(self)

2 is nicer, but it requires that the subclasses __init__ method be compatible enough with DataFrame's. It's probably safe to assume that, but it might be an API breaking change.

@jbrockmendel
Copy link
Member

If we were to go for option 2, why not just get rid of _constructor and use type(self) in the relevant places? The important ones are _constructor_sliced and _constructor_expanddim

@jorisvandenbossche
Copy link
Member

I don't think we need to require this (we also don't require _constructor to be a class, it can be a constructor function).

When having a DataFrame subclass, not every operation in pandas necessarily needs to return that subclass, depending on your use case and characteristics of your subclass.
For example, in GeoPandas, an operation like reindex can remove the geometry column, in which case it makes sense to return a normal DataFrame.

@jorisvandenbossche
Copy link
Member

Whether to change the default (item 2), that might make sense (it seems a more sensible default, maybe).
But, that's also a breaking change, so if we want to change that, I would keep that for 2.0.

If we want to do larger changes on how this _constructor mechanism is working, I think we should also seek feedback of more people that are actually using subclasses.

@jorisvandenbossche jorisvandenbossche added API Design Needs Discussion Requires discussion from core team before further action labels Mar 12, 2020
@jorisvandenbossche jorisvandenbossche changed the title _constructor and subclasses API: _constructor and subclasses Mar 12, 2020
@jbrockmendel
Copy link
Member

Another candidate criterion that I'm finding in #33357 would be desirable: make this a class attribute instead of an instance attribute, like Timestamp.min, so that DataFrame._constructor is DataFrame instead of a property object.

Granted, the semantics of implementing that are pretty ugly.

@jbrockmendel
Copy link
Member

If we want to do larger changes on how this _constructor mechanism is working, I think we should also seek feedback of more people that are actually using subclasses.

@jorisvandenbossche you mentioned geopandas returns a callable instead of a class. How difficult would it be to change that?

Are we aware of any other downstream libraries that do that?

@jorisvandenbossche
Copy link
Member

See my explanation above #32638 (comment), about that this is actually a feature for geopandas that we don't have to return a class.

I suppose I could make a dummy class with a __new__ that has the same behaviour as the callable, but not sure what the value of that would be then.

So my question back is: what are the use cases for requiring it to be a class? (apart from the PR with the _init_mgr)

@jbrockmendel
Copy link
Member

Just grepping for \._constructor(_sliced|_expanddim)?\. shows that in core.groupby.generic we use self._obj._constructor._from_arrays

@jorisvandenbossche
Copy link
Member

That was recently added: #33884

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design Enhancement Needs Discussion Requires discussion from core team before further action Subclassing Subclassing pandas objects
Projects
None yet
Development

No branches or pull requests

4 participants