-
Notifications
You must be signed in to change notification settings - Fork 933
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250
Comments
Thank you for filing the issue! The nrows/ncols isolation is interesting, I'm hoping that can help root cause the issue. |
Assigning myself to provide more isolation info. |
Here's a hint, when you try to read the file with pandas' fastparquet engine the error looks like an out-of-bounds dictionary-encoding problem.
|
Thank you @kkranen for reporting this. The next thing I noticed was that the error is intermittent! I wrote the file 100 times with cuDF and it failed 27 times, while succeeding 73 times. It is an intermittent parquet writer failure. I'm going to keep digging.
Doing the same test for each column gives the following failure counts. The column Here are passing and failing variants of the same dataframe, only containing To my horror, there is a small region of this file getting randomized from one write to the next! For Converting the data to There is a weird interaction with compression. Setting compression to
Update: |
Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the `bit-packed-header` used when encoding bit-packed literal runs. This term originally used a value of `256`, but it's hard to see where that value comes from. This PR change the value to `8`, with a possible justification being the minimum length of a literal run is 8 values. Worst case would be multiple runs of 8, with required overhead bytes then being `num_values/8`. This also adds a test that has been verified to fail for values larger than 16 in the problematic term. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #13364
Describe the bug
After a parquet file is written to the disk through
cudf.core.dataframe.DataFrame.to_parquet
function, this parquet file can't be loaded topandas
usingpandas.read_parquet
.Steps/Code to reproduce bug
We have tried our best to narrow down the size of the parquet file by binary searching on which rows and which columns of the dataset that causes this issue. A full parquet file that triggers this issue is also available and we can attach it here if necessary. The issue can be reproduced with the following code and parquet file:
error.parquet.zip
Interestingly, if we continue to bisect the dataframe to
nrows < 25000
orncols < 25
, export the smaller part to parquet files, and use pandas to load them back, then the error disappears and we can successfully load the dataframe through pandas. For instance:Expected behavior
We expect this parquet file that is exported from cudf can be loaded to pandas.
Environment overview (please complete the following information)
docker pull nvcr.io/nvidia/pytorch:23.03-py3
docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it nvcr.io/nvidia/pytorch:23.03-py3 /bin/bash
Environment details
Click here to see environment details
Additional context
The text was updated successfully, but these errors were encountered: