Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

Open
3 tasks done
zpincus opened this issue Dec 6, 2024 · 4 comments · May be fixed by #60519
Open
3 tasks done

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

zpincus opened this issue Dec 6, 2024 · 4 comments · May be fixed by #60519
Assignees
Labels
Bug IO Parquet parquet, feather Upstream issue Issue related to pandas dependency

Comments

@zpincus
Copy link

zpincus commented Dec 6, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(True, 'B'), (False, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # fails

# now save out with multi-index on index instead of columns:
df.T.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds

# now save out with int instead of bool index:
df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(1, 'B'), (0, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds

Issue Description

Parquet IO with multi-index indices or columns is supported. However, if the multi-index contains a level with bools and if that multi-index is on the columns, then while the parquet can be written with the pyarrow engine, it cannot be read back in using pyarrow.

The traceback I get is below:

Traceback (most recent call last):
  File "<python-input-0>", line 5, in <module>
    pd.read_parquet('test.parquet', engine='pyarrow') # fails
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 649, in read_parquet
    return impl.read(
           ~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 270, in read
    result = arrow_table_to_pandas(
        pa_table,
        dtype_backend=dtype_backend,
        to_pandas_kwargs=to_pandas_kwargs,
    )
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/_util.py", line 86, in arrow_table_to_pandas
    df = table.to_pandas(types_mapper=types_mapper, **to_pandas_kwargs)
  File "pyarrow/array.pxi", line 887, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 5132, in pyarrow.lib.Table._to_pandas
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 790, in table_to_dataframe
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 928, in _deserialize_column_index
    columns = _reconstruct_columns_from_metadata(columns, column_indexes)
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 1145, in _reconstruct_columns_from_metadata
    return pd.MultiIndex(new_levels, labels, names=columns.names)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 341, in __new__
    new_codes = result._verify_integrity()
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 427, in _verify_integrity
    raise ValueError(
        f"Level values must be unique: {list(level)} on level {i}"
    )
ValueError: Level values must be unique: [True, True] on level 0

Further note that the fastparquet can neither read nor write such dataframes. There are a panoply of different errors on read/write with multi-index with fastparquet depending on whether the multi-index is on the index or columns, and whether the index has level names or not. I (or someone) should probably open separate bugs on that...

NB. the issue repros in a clean environment with only python, pip, pandas (dev), and pyarrow/fastparquet directly installed.

Expected Behavior

Parquet IO should support bool multi-index levels on columns.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : a36c44e129bd2f70c25d5dec89cb2893716bdbf6
python                : 3.13.1
python-bits           : 64
OS                    : Darwin
OS-release            : 23.6.0
Version               : Darwin Kernel Version 23.6.0: Wed Jul 31 20:50:00 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T6031
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1757.ga36c44e129
numpy                 : 2.1.3
dateutil              : 2.9.0.post0
pip                   : 24.3.1
Cython                : None
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
fastparquet           : 2024.11.0
fsspec                : 2024.10.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
psycopg2              : None
pymysql               : None
pyarrow               : 18.1.0
pyreadstat            : None
pytest                : None
python-calamine       : None
pytz                  : 2024.1
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None
@zpincus zpincus added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 6, 2024
@kevkle
Copy link

kevkle commented Dec 6, 2024

i would like to contribute

@sunlight798
Copy link
Contributor

  1. The cause of the problem should be that the type of the index was not correctly marked when reading back with pyarrow, resulting in the inability to convert the index that should have been of bool type into bool type during subsequent type conversion in pandas.
  2. Another point is the special aspect when converting the object type to the bool type. Any non-zero and non-empty values will be converted to True of the bool type, which leads to all the indexes shown in the error message being True.

@sunlight798
Copy link
Contributor

take

@rhshadrach
Copy link
Member

Thanks for the report! As @sunlight798 found, this is due to PyArrow's to_pandas converting values:

https://github.com/apache/arrow/blob/1b3caf6b232b7855956d3ec45ee95ede0492e78f/python/pyarrow/pandas_compat.py#L1136-L1137

PyArrow stores the columns as strings. I think what needs to happen here is that PyArrow needs to special case when dtype == bool and not use astype.

cc @jorisvandenbossche

@rhshadrach rhshadrach added IO Parquet parquet, feather Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO Parquet parquet, feather Upstream issue Issue related to pandas dependency
Projects
None yet
4 participants