BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

zpincus · 2024-12-06T18:03:15Z

Pandas version checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

import pandas as pd

df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(True, 'B'), (False, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # fails

# now save out with multi-index on index instead of columns:
df.T.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds

# now save out with int instead of bool index:
df = pd.DataFrame([[1, 2], [4, 5]], columns=pd.MultiIndex.from_tuples([(1, 'B'), (0, 'C')]))
df.to_parquet('test.parquet', engine='pyarrow')
pd.read_parquet('test.parquet', engine='pyarrow') # succeeds

Issue Description

Parquet IO with multi-index indices or columns is supported. However, if the multi-index contains a level with bools and if that multi-index is on the columns, then while the parquet can be written with the pyarrow engine, it cannot be read back in using pyarrow.

The traceback I get is below:

Traceback (most recent call last):
  File "<python-input-0>", line 5, in <module>
    pd.read_parquet('test.parquet', engine='pyarrow') # fails
    ~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 649, in read_parquet
    return impl.read(
           ~~~~~~~~~^
        path,
        ^^^^^
    ...<6 lines>...
        **kwargs,
        ^^^^^^^^^
    )
    ^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/parquet.py", line 270, in read
    result = arrow_table_to_pandas(
        pa_table,
        dtype_backend=dtype_backend,
        to_pandas_kwargs=to_pandas_kwargs,
    )
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/io/_util.py", line 86, in arrow_table_to_pandas
    df = table.to_pandas(types_mapper=types_mapper, **to_pandas_kwargs)
  File "pyarrow/array.pxi", line 887, in pyarrow.lib._PandasConvertible.to_pandas
  File "pyarrow/table.pxi", line 5132, in pyarrow.lib.Table._to_pandas
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 790, in table_to_dataframe
    columns = _deserialize_column_index(table, all_columns, column_indexes)
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 928, in _deserialize_column_index
    columns = _reconstruct_columns_from_metadata(columns, column_indexes)
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pyarrow/pandas_compat.py", line 1145, in _reconstruct_columns_from_metadata
    return pd.MultiIndex(new_levels, labels, names=columns.names)
           ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 341, in __new__
    new_codes = result._verify_integrity()
  File "/Users/zpincus/Documents/Research/Hexagon/pandas-test/.pixi/envs/default/lib/python3.13/site-packages/pandas/core/indexes/multi.py", line 427, in _verify_integrity
    raise ValueError(
        f"Level values must be unique: {list(level)} on level {i}"
    )
ValueError: Level values must be unique: [True, True] on level 0

Further note that the fastparquet can neither read nor write such dataframes. There are a panoply of different errors on read/write with multi-index with fastparquet depending on whether the multi-index is on the index or columns, and whether the index has level names or not. I (or someone) should probably open separate bugs on that...

NB. the issue repros in a clean environment with only python, pip, pandas (dev), and pyarrow/fastparquet directly installed.

Expected Behavior

Parquet IO should support bool multi-index levels on columns.

Installed Versions

INSTALLED VERSIONS
------------------
commit                : a36c44e129bd2f70c25d5dec89cb2893716bdbf6
python                : 3.13.1
python-bits           : 64
OS                    : Darwin
OS-release            : 23.6.0
Version               : Darwin Kernel Version 23.6.0: Wed Jul 31 20:50:00 PDT 2024; root:xnu-10063.141.1.700.5~1/RELEASE_ARM64_T6031
machine               : arm64
processor             : arm
byteorder             : little
LC_ALL                : None
LANG                  : en_US.UTF-8
LOCALE                : en_US.UTF-8

pandas                : 3.0.0.dev0+1757.ga36c44e129
numpy                 : 2.1.3
dateutil              : 2.9.0.post0
pip                   : 24.3.1
Cython                : None
sphinx                : None
IPython               : None
adbc-driver-postgresql: None
adbc-driver-sqlite    : None
bs4                   : None
blosc                 : None
bottleneck            : None
fastparquet           : 2024.11.0
fsspec                : 2024.10.0
html5lib              : None
hypothesis            : None
gcsfs                 : None
jinja2                : None
lxml.etree            : None
matplotlib            : None
numba                 : None
numexpr               : None
odfpy                 : None
openpyxl              : None
psycopg2              : None
pymysql               : None
pyarrow               : 18.1.0
pyreadstat            : None
pytest                : None
python-calamine       : None
pytz                  : 2024.1
pyxlsb                : None
s3fs                  : None
scipy                 : None
sqlalchemy            : None
tables                : None
tabulate              : None
xarray                : None
xlrd                  : None
xlsxwriter            : None
zstandard             : None
tzdata                : 2024.2
qtpy                  : None
pyqt5                 : None

The text was updated successfully, but these errors were encountered:

kevkle · 2024-12-06T22:22:13Z

i would like to contribute

sunlight798 · 2024-12-07T12:45:51Z

The cause of the problem should be that the type of the index was not correctly marked when reading back with pyarrow, resulting in the inability to convert the index that should have been of bool type into bool type during subsequent type conversion in pandas.
Another point is the special aspect when converting the object type to the bool type. Any non-zero and non-empty values will be converted to True of the bool type, which leads to all the indexes shown in the error message being True.

sunlight798 · 2024-12-07T12:48:47Z

take

rhshadrach · 2024-12-08T14:15:47Z

Thanks for the report! As @sunlight798 found, this is due to PyArrow's to_pandas converting values:

https://github.com/apache/arrow/blob/1b3caf6b232b7855956d3ec45ee95ede0492e78f/python/pyarrow/pandas_compat.py#L1136-L1137

PyArrow stores the columns as strings. I think what needs to happen here is that PyArrow needs to special case when dtype == bool and not use astype.

cc @jorisvandenbossche

zpincus added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 6, 2024

github-actions bot assigned sunlight798 Dec 7, 2024

sunlight798 linked a pull request Dec 7, 2024 that will close this issue

BUG: Fix multi-index on columns with bool level values does not roundtrip through parquet #60519

Open

4 tasks

rhshadrach added IO Parquet parquet, feather Upstream issue Issue related to pandas dependency and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Dec 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

zpincus commented Dec 6, 2024

kevkle commented Dec 6, 2024

sunlight798 commented Dec 7, 2024

sunlight798 commented Dec 7, 2024

rhshadrach commented Dec 8, 2024

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

BUG: multi-index on columns with bool level values does not roundtrip through parquet #60508

Comments

zpincus commented Dec 6, 2024

Pandas version checks

Reproducible Example

Issue Description

Expected Behavior

Installed Versions

kevkle commented Dec 6, 2024

sunlight798 commented Dec 7, 2024

sunlight798 commented Dec 7, 2024

rhshadrach commented Dec 8, 2024