BUG: uint32 is not being preserved while round-tripping through parquet file #37327

galipremsagar · 2020-10-22T00:19:07Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.

Code Sample, a copy-pastable example

In[42]: df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint64")})
In[43]: df.to_parquet('a')
In[44]: pd.read_parquet('a').dtypes
Out[44]: 
a    uint64
dtype: object
In[45]: df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
In[46]: df.to_parquet('a')
In[47]: pd.read_parquet('a').dtypes
Out[47]: 
a    int64
dtype: object

Problem description

It appears to be that uint32 is not being preserved like uint64 is being preserved while round-tripping through a parquet file.

Expected Output

Preserve the uint32 dtype.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit : db08276
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-52-generic
Version : #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.1
hypothesis : 5.37.3
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2

Crosslinking to cudf fuzz-testing for tracking purpose: rapidsai/cudf#6001

The text was updated successfully, but these errors were encountered:

allenmac347 · 2020-11-10T03:43:57Z

take
Hello I am a first time contributor and I'm willing to examine this further!

allenmac347 · 2020-11-25T05:06:35Z

@galipremsagar Hi, after further testing it seems like uint32 is preserved when using 'fastparquet' as the engine for to_parquet and read_parquet. However, the closed issue #31896 seems to acknowledge this behavior and a fix was introduced and merged into main branch to make pandas interpret the written uint32 data as int64 data. I was wondering if you think this could be expected behavior or would this still be considered a ongoing issue?

galipremsagar · 2020-11-28T15:40:00Z

@allenmac347 I think it'd still be considered an issue, we might probably need a fix similar to : #31918

allenmac347 · 2020-12-01T03:43:13Z

@jorisvandenbossche hey so I noticed that in the issue you fixed in issue #31896, you said that parquet does not seem to be able to store uint32. Would you happen to know more about this issue and if this is an issue with pyarrow or pandas? thanks!

allenmac347 · 2020-12-12T05:39:01Z

@phofl Hi phofl. I'm currently trying to debug this issue, but it seems like this might be an external problem with pyarrow. Here's some interesting output I get:

//The datatype uint32 is preserved here
df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
df.to_parquet('a', engine='fastparquet')
dataframe = pd.read_parquet('a', engine='fastparquet')

//The datatype uint32 is preserved here
df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
df.to_parquet('a', engine='fastparquet')
dataframe = pd.read_parquet('a', engine='pyarrow')

//The datatype uint32 is read in as a int64 here
df = pd.DataFrame({'a':pd.Series([1, 2, 3], dtype="uint32")})
df.to_parquet('a', engine='pyarrow')
dataframe = pd.read_parquet('a', engine='fastparquet')

I feel like this means there's something wrong with how pyarrow writes uint32 to a file. I was wondering if you've had any suggestions? I've tried using the pandas metadata of the parquet file to just convert the dataframe back to uint32 after reading it in, but that made a lot of test cases fail.

phofl · 2020-12-19T19:15:58Z

Unfortunately I am not that familiar here. Do you know or could find out who implemented the pyarrow engine?

jorisvandenbossche · 2021-02-03T10:29:02Z

Sorry for the slow reply here. This is not directly related to #31896 (that was a bug on the conversion on our side, specifically for nullable dtypes), but it is actually a limiation of pyarrow.

You can specify version="2.0", and then pyarrow will use additional type annotations in the parquet file, in which case it can actually preserve uint32. But by default it indeed does not.

So there is nothing to do on the pandas side about it (apart from maybe better documenting this). A similar issue about this on the pyarrow side is https://issues.apache.org/jira/browse/ARROW-9215

miccoli · 2022-03-18T11:40:51Z

While googling for solving this very same problem, I found this very useful thread.

Let me just add that version="2.0" is deprecated now, use instead "2.4" or "2.6".

I understand that it may sound obvious, but maybe adding in the pandas docs an explicit link on where to look for the additional **kwargs to be passed to the undelying engine could be useful. In fact it took me a while to figure out that the relevant docs for pyarrow are in pyarrow.parquet.ParquetWriter.

galipremsagar added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Oct 22, 2020

github-actions bot assigned allenmac347 Nov 10, 2020

jorisvandenbossche added Upstream issue Issue related to pandas dependency IO Parquet parquet, feather and removed Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Feb 3, 2021

mroeschke added Docs and removed Upstream issue Issue related to pandas dependency labels Aug 13, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: uint32 is not being preserved while round-tripping through parquet file #37327

BUG: uint32 is not being preserved while round-tripping through parquet file #37327

galipremsagar commented Oct 22, 2020

INSTALLED VERSIONS

allenmac347 commented Nov 10, 2020 •

edited

Loading

allenmac347 commented Nov 25, 2020

galipremsagar commented Nov 28, 2020

allenmac347 commented Dec 1, 2020

allenmac347 commented Dec 12, 2020

phofl commented Dec 19, 2020

jorisvandenbossche commented Feb 3, 2021

miccoli commented Mar 18, 2022

BUG: uint32 is not being preserved while round-tripping through parquet file #37327

BUG: uint32 is not being preserved while round-tripping through parquet file #37327

Comments

galipremsagar commented Oct 22, 2020

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

allenmac347 commented Nov 10, 2020 • edited Loading

allenmac347 commented Nov 25, 2020

galipremsagar commented Nov 28, 2020

allenmac347 commented Dec 1, 2020

allenmac347 commented Dec 12, 2020

phofl commented Dec 19, 2020

jorisvandenbossche commented Feb 3, 2021

miccoli commented Mar 18, 2022

Output of `pd.show_versions()`

allenmac347 commented Nov 10, 2020 •

edited

Loading