-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: uint32 is not being preserved while round-tripping through parquet file #37327
Comments
take |
@galipremsagar Hi, after further testing it seems like uint32 is preserved when using 'fastparquet' as the engine for to_parquet and read_parquet. However, the closed issue #31896 seems to acknowledge this behavior and a fix was introduced and merged into main branch to make pandas interpret the written uint32 data as int64 data. I was wondering if you think this could be expected behavior or would this still be considered a ongoing issue? |
@allenmac347 I think it'd still be considered an issue, we might probably need a fix similar to : #31918 |
@jorisvandenbossche hey so I noticed that in the issue you fixed in issue #31896, you said that parquet does not seem to be able to store uint32. Would you happen to know more about this issue and if this is an issue with pyarrow or pandas? thanks! |
@phofl Hi phofl. I'm currently trying to debug this issue, but it seems like this might be an external problem with pyarrow. Here's some interesting output I get: //The datatype uint32 is preserved here //The datatype uint32 is preserved here //The datatype uint32 is read in as a int64 here I feel like this means there's something wrong with how pyarrow writes uint32 to a file. I was wondering if you've had any suggestions? I've tried using the pandas metadata of the parquet file to just convert the dataframe back to uint32 after reading it in, but that made a lot of test cases fail. |
Unfortunately I am not that familiar here. Do you know or could find out who implemented the pyarrow engine? |
Sorry for the slow reply here. This is not directly related to #31896 (that was a bug on the conversion on our side, specifically for nullable dtypes), but it is actually a limiation of You can specify So there is nothing to do on the pandas side about it (apart from maybe better documenting this). A similar issue about this on the pyarrow side is https://issues.apache.org/jira/browse/ARROW-9215 |
While googling for solving this very same problem, I found this very useful thread. Let me just add that I understand that it may sound obvious, but maybe adding in the pandas docs an explicit link on where to look for the additional |
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Note: Please read this guide detailing how to provide the necessary information for us to reproduce your bug.
Code Sample, a copy-pastable example
Problem description
It appears to be that
uint32
is not being preserved likeuint64
is being preserved while round-tripping through a parquet file.Expected Output
Preserve the
uint32
dtype.Output of
pd.show_versions()
INSTALLED VERSIONS
commit : db08276
python : 3.7.8.final.0
python-bits : 64
OS : Linux
OS-release : 5.4.0-52-generic
Version : #57-Ubuntu SMP Thu Oct 15 10:57:00 UTC 2020
machine : x86_64
processor : x86_64
byteorder : little
LC_ALL : None
LANG : en_US.UTF-8
LOCALE : en_US.UTF-8
pandas : 1.1.3
numpy : 1.19.2
pytz : 2020.1
dateutil : 2.8.1
pip : 20.2.4
setuptools : 49.6.0.post20201009
Cython : 0.29.21
pytest : 6.1.1
hypothesis : 5.37.3
sphinx : 3.2.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : 2.11.2
IPython : 7.18.1
pandas_datareader: None
bs4 : None
bottleneck : None
fsspec : 0.8.4
fastparquet : None
gcsfs : None
matplotlib : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 1.0.1
pytables : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
xlwt : None
numba : 0.51.2
Crosslinking to cudf fuzz-testing for tracking purpose: rapidsai/cudf#6001
The text was updated successfully, but these errors were encountered: