You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
adrienpacifico
changed the title
BUG: parquet roundtrip does not work with numerical categorical dtype
BUG: Parquet roundtrip fails with numerical categorical dtype
Dec 4, 2024
I've tried using the fastparquet engine, and it seems to work. Whatever the problem is, it lies with the way the pyarrow engine reads the Parquet file.
Here is the code example:
importpandasaspddf=pd.DataFrame({"A": [1, 2, 3], "B": [4, 5, 6]})
df=df.astype({'A':'category'})
print(df.dtypes)
# A category# B int64# dtype: objectdf.to_parquet('test.parquet')
df_roundtrip=pd.read_parquet('test.parquet')
print(df_roundtrip.dtypes)
# A int64# B int64# dtype: objectdf_roundtrip_fp=pd.read_parquet('test.parquet', engine='fastparquet')
print(df_roundtrip_fp.dtypes)
# A category# B int64# dtype: objectresult=df_roundtrip.equals(df)
print(result)
# Falseresult_fp=df_roundtrip_fp.equals(df)
print(result_fp)
# True
Pandas version checks
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
Issue Description
Roundtrip does not work.
Expected Behavior
df_roundtrip has the same dtypes as df.dtypes
Hot-Fix
Installed Versions
INSTALLED VERSIONS
commit : 0691c5c
python : 3.12.2
python-bits : 64
OS : Darwin
OS-release : 23.5.0
Version : Darwin Kernel Version 23.5.0: Wed May 1 20:16:51 PDT 2024; root:xnu-10063.121.3~5/RELEASE_ARM64_T8103
machine : arm64
processor : arm
byteorder : little
LC_ALL : None
LANG : fr_FR.UTF-8
LOCALE : fr_FR.UTF-8
pandas : 2.2.3
numpy : 2.0.2
pytz : 2024.2
dateutil : 2.9.0.post0
pip : 24.0
Cython : None
sphinx : None
IPython : 8.24.0
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : 4.12.3
blosc : None
bottleneck : 1.4.2
dataframe-api-compat : None
fastparquet : 2024.11.0
fsspec : 2024.10.0
html5lib : 1.1
hypothesis : 6.122.1
gcsfs : 2024.10.0
jinja2 : 3.1.4
lxml.etree : 5.3.0
matplotlib : 3.9.3
numba : 0.60.0
numexpr : 2.10.2
odfpy : None
openpyxl : 3.1.5
pandas_gbq : 0.24.0
psycopg2 : 2.9.10
pymysql : 1.4.6
pyarrow : 18.1.0
pyreadstat : 1.2.8
pytest : 8.3.4
python-calamine : None
pyxlsb : 1.0.10
s3fs : 2024.10.0
scipy : 1.14.1
sqlalchemy : 2.0.36
tables : 3.10.1
tabulate : 0.9.0
xarray : 2024.11.0
xlrd : 2.0.1
xlsxwriter : 3.2.0
zstandard : 0.23.0
tzdata : 2024.2
qtpy : 2.4.2
pyqt5 : None
The text was updated successfully, but these errors were encountered: