Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

_metadata items of subclassed pd.Series are not propagated into corresponding SubclassedDataFrame #32860

Open
johannes-mueller opened this issue Mar 20, 2020 · 9 comments
Labels
Bug metadata _metadata, .attrs Subclassing Subclassing pandas objects

Comments

@johannes-mueller
Copy link
Contributor

Code Sample

import pandas as pd

# Define a subclass for pd.Series with metadata myprop
class SubclassedSeries(pd.Series):

    _metadata = ['myprop']

    @property
    def _constructor(self):
        return SubclassedSeries

    @property
    def _constructor_expanddim(self):
        return SubclassedDataFrame

# Define a subclass for pd.DataFrame that slices to SubclassedSeries
class SubclassedDataFrame(pd.DataFrame):
    @property
    def _constructor(self):
        return SubclassedDataFrame

    @property
    def _constructor_sliced(self):
        return SubclassedSeries

# make an instance of SubclassedSeries and set myprop
sr = SubclassedSeries([1,2,3])
sr.myprop = 'foo'
print("myprob is", sr.myprop) # Works

# put the SubclassedSeries object into a SubclassedDataFrame and try to get myprop
df = SubclassedDataFrame({'a': sr})
print(type(df.a)) # Works (prints <class '__main__.SubclassedSeries'>)
print("myprob is", df.a.myprop) # does not work (AttributeError: 'SubclassedSeries' object has no attribute 'myprop')

Problem description

_metadata items of pd.Series subclasses are not propagated when the SubclassedSeries object is put into a SubclassedDataFrame. I would expect myprop to be available in the new SubclassedDataFrame.

Expected Output

myprob is foo
<class '__main__.SubclassedSeries'>
myprob is foo

Output of pd.show_versions()

INSTALLED VERSIONS ------------------ commit : None python : 3.8.1.final.0 python-bits : 64 OS : Linux OS-release : 5.3.0-42-lowlatency machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : de_DE.UTF-8 LOCALE : de_DE.UTF-8

pandas : 1.0.2
numpy : 1.18.1
pytz : 2019.3
dateutil : 2.8.1
pip : 20.0.2
setuptools : 46.0.0.post20200309
Cython : 0.29.15
pytest : 5.4.1
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.1
IPython : 7.13.0
pandas_datareader: None
bs4 : None
bottleneck : None
fastparquet : None
gcsfs : None
lxml.etree : None
matplotlib : 3.1.3
numexpr : 2.7.1
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : None
pytables : None
pytest : 5.4.1
pyxlsb : None
s3fs : None
scipy : 1.4.1
sqlalchemy : None
tables : 3.6.1
tabulate : None
xarray : 0.15.0
xlrd : 1.2.0
xlwt : None
xlsxwriter : None
numba : None

@johannes-mueller
Copy link
Contributor Author

Maybe related to #24685

@jorisvandenbossche
Copy link
Member

@johannes-mueller the metadata item lives on the Series object. But, when putting a Series in a DataFrame, pandas does not actually store the columns as Series objects, but as arrays in a internal data structure (the BlockManager). So when accessing a column of SubclassedDataFrame, a new SubclassedSeries is created (using the _constructor_sliced callable).

So what you want it right now not possible.

Some options:

  • if you are using ExtensionArrays, you can try to store the metadata on the array-level, and then both SubclassedSeries as SubclassedDataFrame can get the metadata value from there
  • Doing more work in the constructors to preserve this information between dataframe / series (eg the SubclassedDataFrame constructor could check for the metadata of the passed values, and store this as well. Then when accessing a column, it could again be passed through to the SubclassedSeries)
  • there is also a new attrs machinery in the latest pandas release, see REF: Store metadata in an attrs dict #29062. But I am not yet familiar enough with that to know how the series/dataframe interaction works with that (cc @TomAugspurger)

@TomAugspurger
Copy link
Contributor

attrs doesn't currently propagate through, since Dataframe.__getitem__ doesn't call finalize on a single-key indexer. It probably should.

@jbrockmendel jbrockmendel added the metadata _metadata, .attrs label Jun 7, 2020
@Flix6x
Copy link
Contributor

Flix6x commented Jul 4, 2020

Issue #19850 seems related (thanks for that metadata label), which is about keeping around the metadata when going the other way, from a SubclassedDataFrame to a SubclassedSeries. Adapting the workaround posted there #19850 (comment) solved this problem for me, although I created my SubclassedDataFrame using the to_frame method rather than initializing with a dict as @johannes-mueller did.

First, add the _metadata to the SubclassedDataFrame, too:

_metadata = ['myprop']

Second, call __finalize__ when constructing:

    @property
    def _constructor_expanddim(self):
        def f(*args, **kwargs):
            # adapted from https://github.com/pandas-dev/pandas/issues/19850#issuecomment-367934440
            return SubclassedDataFrame(*args, **kwargs).__finalize__(self, method='inherit')

        return f

Then use:

df = sr.to_frame()
print("myprob is", df.myprop)

to obtain what you expected:

myprob is foo

However, for me, combining this workaround with initializing with a dict as you did still behaves unexpectedly. It stopped the AttributeError, but sets the property to None:

df = SubclassedDataFrame({'a': sr})
print("myprob is", df.a.myprop)

gives:

myprob is None

@samuelduchesne
Copy link

@Flix6x Somehow, returning a function instead of a class fails at line 396:

# Standardize axis parameter to int
if isinstance(sample, ABCSeries):
axis = sample._constructor_expanddim._get_axis_number(axis)
else:
axis = sample._get_axis_number(axis)

with error AttributeError: 'function' object has no attribute '_get_axis_number'

Any workarounds?

@samuelduchesne
Copy link

samuelduchesne commented Sep 23, 2020

with error AttributeError: 'function' object has no attribute '_get_axis_number'

Issue happens on repr when number of lines is high enough and concat gets called. One workaround is to specify the attribute to function f

@property
def _constructor_expanddim(self):
    def f(*args, **kwargs):
        # adapted from https://github.com/pandas-dev/pandas/issues/19850#issuecomment-367934440
        return SubclassedDataFrame(*args, **kwargs).__finalize__(self, method="inherit")
    f._get_axis_number = super(EnergySeries, self)._get_axis_number

    return f

It seems that this is the only place where a function is called from _constructor_expanddim.

@jorisvandenbossche
Copy link
Member

@samuelduchesne I would say that is a bug: we generally should not assume that _constructor_expanddim is a class with such a method, but only assume it is a general callable.

It seems this was introduced as part of trying to avoid a DataFrame import in #34837. But so a PR to remove this usage of ._constructor_expanddim._get_axis_number is certainly welcome.

@samuelduchesne
Copy link

@jorisvandenbossche Not sure what the usage should be replaced with in an eventual PR. Not familiar with _get_axis_number...

@jorisvandenbossche
Copy link
Member

I opened a PR to fix this _get_axis_number case: #46018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug metadata _metadata, .attrs Subclassing Subclassing pandas objects
Projects
None yet
Development

No branches or pull requests

7 participants