Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

Open
shoyer opened this issue Oct 22, 2024 · 9 comments
Open

Failure in pandas TestDataFrameToXArray.test_to_xarray_index_types #9661

shoyer opened this issue Oct 22, 2024 · 9 comments
Labels

Comments

@shoyer
Copy link
Member

shoyer commented Oct 22, 2024

It appears that #9520 may have broken some upstream pandas tests, specifically testing round-trips with various index types:
https://github.com/pandas-dev/pandas/blob/e78ebd3f845c086af1d71c0604701ec49df97228/pandas/tests/generic/test_to_xarray.py#L32

Here's a minimal test case:

import pandas as pd
import numpy as np

cat = pd.Categorical(list("abcd"))
df = pd.DataFrame({"f": cat}, index=cat)
restored = df.to_xarray().to_dataframe()
print(restored.index)  # Index(['a', 'b', 'c', 'd'], dtype='object', name='index')
print(df.index)  # CategoricalIndex(['a', 'b', 'c', 'd'], categories=['a', 'b', 'c', 'd'], ordered=False, dtype='category')

I'm not sure if this is a pandas or xarray issue, but it's one or the other!

(My guess is that most of these tests in pandas should probably live in xarray instead, given that we implement all the conversion logic.)

Originally posted by @shoyer in #9520 (comment)

@shoyer
Copy link
Member Author

shoyer commented Oct 22, 2024

Here's the error message from pandas's TestDataFrameToXArray.test_to_xarray_index_types[string]:

AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentAttribute "dtype" are different[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)[right]: objectself = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],      dtype='object')df = bar       a  b  c    d      e  f          g                         hfoo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00using_infer_string = False    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):        index = index_flat        # MultiIndex is tested in test_to_xarray_with_multiindex        if len(index) == 0:            pytest.skip("Test doesn't make sense for empty index")            from xarray import Dataset            df.index = index[:4]        df.index.name = "foo"        df.columns.name = "bar"        result = df.to_xarray()        assert result.sizes["foo"] == 4        assert len(result.coords) == 1        assert len(result.data_vars) == 8        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])        assert isinstance(result, Dataset)            # idempotency        # datetimes w/tz are preserved        # column names are lost        expected = df.copy()        expected["f"] = expected["f"].astype(            object if not using_infer_string else "string[pyarrow_numpy]"        )        expected.columns.name = None>       tm.assert_frame_equal(result.to_dataframe(), expected)E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are differentE       E       Attribute "dtype" are differentE       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)E       [right]: objecttests/generic/test_to_xarray.py:58: AssertionError
Failed

<br class="Apple-interchange-newline">AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different

Attribute "dtype" are different
[left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
[right]: object
self = <pandas.tests.generic.test_to_xarray.TestDataFrameToXArray object at 0x13d4fa7cbe90>
index_flat = Index(['pandas_0', 'pandas_1', 'pandas_2', 'pandas_3', 'pandas_4', 'pandas_5',
       'pandas_6', 'pandas_7', 'pandas_...pandas_93', 'pandas_94', 'pandas_95',
       'pandas_96', 'pandas_97', 'pandas_98', 'pandas_99'],
      dtype='object')
df = bar       a  b  c    d      e  f          g                         h
foo                                             ....0   True  c 2013-01-03 2013-01-03 00:00:00-05:00
pandas_3  d  4  6  7.0  False  d 2013-01-04 2013-01-04 00:00:00-05:00
using_infer_string = False

    def test_to_xarray_index_types(self, index_flat, df, using_infer_string):
        index = index_flat
        # MultiIndex is tested in test_to_xarray_with_multiindex
        if len(index) == 0:
            pytest.skip("Test doesn't make sense for empty index")
    
        from xarray import Dataset
    
        df.index = index[:4]
        [df.index.name](https://www.google.com/url?q=http://df.index.name&sa=D) = "foo"
        [df.columns.name](https://www.google.com/url?q=http://df.columns.name&sa=D) = "bar"
        result = df.to_xarray()
        assert result.sizes["foo"] == 4
        assert len(result.coords) == 1
        assert len(result.data_vars) == 8
        tm.assert_almost_equal(list(result.coords.keys()), ["foo"])
        assert isinstance(result, Dataset)
    
        # idempotency
        # datetimes w/tz are preserved
        # column names are lost
        expected = df.copy()
        expected["f"] = expected["f"].astype(
            object if not using_infer_string else "string[pyarrow_numpy]"
        )
        [expected.columns.name](https://www.google.com/url?q=http://expected.columns.name&sa=D) = None
>       tm.assert_frame_equal(result.to_dataframe(), expected)
E       AssertionError: Attributes of DataFrame.iloc[:, 5] (column name="f") are different
E       
E       Attribute "dtype" are different
E       [left]:  CategoricalDtype(categories=['a', 'b', 'c', 'd'], ordered=False, categories_dtype=object)
E       [right]: object

tests/generic/test_to_xarray.py:58: AssertionError

@shoyer
Copy link
Member Author

shoyer commented Oct 22, 2024

cc @ilan-gold

@shoyer shoyer added the bug label Oct 22, 2024
@ilan-gold
Copy link
Contributor

ilan-gold commented Oct 23, 2024

On it! More generally @shoyer with this extension array stuff, I would be happy for a zoom call to go over what all the various pandas adapters in the codebase (since I think they can be somewhat cut down as a lot of the code has to do with numpy conversion) and/or sound out running the pandas integration tests in this repo. We are doing that now: https://github.com/scverse/integration-testing/pull/1/files where we check out everyone's repo and then test it against the core data structure on main of both. We could do something similar here if you wanted! That would minimize friction I think (i.e., no need to migrate tests).

@ilan-gold
Copy link
Contributor

@shoyer This issue is too tied up with datetimes, see: #9618. I will need to redo what I've done to work off that branch now. The issue is that pandas>2.0 has their datetime handling as extension arrays - so if we start letting in categorical indices in our indexing adapter, we let everything in, which means we break almost all converting of the datetime stuff.

@dcherian
Copy link
Contributor

dcherian commented Oct 23, 2024

Can we explicitly cast DatetimeArray to datetime64[ns] for now? This won't always work, but we can just error out in that case.

@nataziel
Copy link

This is definitely causing problems on v2024.10.0, I'm now getting an error when going from DataFrame -> DataSet -(error here)> DataArray. I'm starting with a DataFrame with a DateTime index and 20ish columns. Relevant parts of the error trace:

/usr/local/lib/python3.10/dist-packages/xarray/core/dataset.py:7274: in to_dataarray
    data = duck_array_ops.stack([b.data for b in broadcast_vars], axis=0)
/usr/local/lib/python3.10/dist-packages/xarray/core/duck_array_ops.py:384: in stack
    return xp.stack(as_shared_dtype(arrays, xp=xp), axis=axis)
/usr/local/lib/python3.10/dist-packages/xarray/core/extension_array.py:100: in __array_function__
    res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

arr = [<FloatingArray>
[-0.7418799885470463, -0.8171209666969853, -0.8805057639294221]
Length: 3, dtype: Float64, <FloatingA...: Float64, <FloatingArray>
[0.8056743037815669, 0.9631290662236986, 0.9960250670661343]
Length: 3, dtype: Float64, ...]
axis = 0

    @implements(np.stack)
    def __extension_duck_array__stack(arr: T_ExtensionArray, axis: int):
>       raise NotImplementedError("Cannot stack 1d-only pandas categorical array.")
E       NotImplementedError: Cannot stack 1d-only pandas categorical array.

/usr/local/lib/python3.10/dist-packages/xarray/core/extension_array.py:41: NotImplementedError

@ilan-gold
Copy link
Contributor

@nataziel Can you send the dataframe you were using? Or ideally a minimal reproducer?

@shoyer I was not away xarray "stacked" pandas 1D objects - we can jujst make it do this with .to_numpy or similar and hope for the best.

@nataziel
Copy link

nataziel commented Oct 25, 2024

df = pd.DataFrame(
    {
        "sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
        "date": [
            Timestamp("2022-08-15 00:00:00"),
            Timestamp("2022-08-22 00:00:00"),
            Timestamp("2022-08-29 00:00:00"),
        ],
    },
)
df = df.astype("Float64", errors="ignore")

mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()

the above works in 2024.9.0 but not 2024.10.0

@nataziel
Copy link

nataziel commented Dec 6, 2024

I was just testing the above with xarray==2024.11.0 and realised it's not as easily reproducible as it could be so try this:

import pandas as pd
import xarray as xr
from pandas import Timestamp

def main():
    df = pd.DataFrame(
    {
            "sin_order_1_year": [-0.7418799885470463, -0.8171209666969853, -0.8805057639294221],
            "date": [
                Timestamp("2022-08-15 00:00:00"),
                Timestamp("2022-08-22 00:00:00"),
                Timestamp("2022-08-29 00:00:00"),
            ],
        },
    )
    df = df.astype("Float64", errors="ignore")

    mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()


if __name__ == "__main__":
    main()

still failing in 2024.11.0 with this error

Traceback (most recent call last):
  File "C:\Users\me\git\xarray_test\main.py", line 22, in <module>
    main()
    ~~~~^^
  File "C:\Users\me\git\xarray_test\main.py", line 18, in main
    mydataarray = xr.Dataset.from_dataframe(df.set_index("date")).to_array()
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\dataset.py", line 7374, in to_array
    return self.to_dataarray(dim=dim, name=name)
           ~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\dataset.py", line 7357, in to_dataarray
    data = duck_array_ops.stack([b.data for b in broadcast_vars], axis=0)
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\duck_array_ops.py", line 397, in stack
    return xp.stack(as_shared_dtype(arrays, xp=xp), axis=axis)
           ~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\extension_array.py", line 100, in __array_function__
    res = HANDLED_EXTENSION_ARRAY_FUNCTIONS[func](*args, **kwargs)
  File "C:\Users\me\git\xarray_test\.venv\Lib\site-packages\xarray\core\extension_array.py", line 41, in __extension_duck_array__stack
    raise NotImplementedError("Cannot stack 1d-only pandas categorical array.")
NotImplementedError: Cannot stack 1d-only pandas categorical array.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants