[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

Elnifio · 2023-04-28T21:30:59Z

Describe the bug

After a parquet file is written to the disk through cudf.core.dataframe.DataFrame.to_parquet function, this parquet file can't be loaded to pandas using pandas.read_parquet.

Steps/Code to reproduce bug

We have tried our best to narrow down the size of the parquet file by binary searching on which rows and which columns of the dataset that causes this issue. A full parquet file that triggers this issue is also available and we can attach it here if necessary. The issue can be reproduced with the following code and parquet file:

import cudf
import pandas

df = cudf.read_parquet("error.parquet")
df.shape # works correctly, should have shape (25000, 25)

pandas.read_parquet("error.parquet").shape # OSError: Unexpected End of Stream

error.parquet.zip

Interestingly, if we continue to bisect the dataframe to nrows < 25000 or ncols < 25, export the smaller part to parquet files, and use pandas to load them back, then the error disappears and we can successfully load the dataframe through pandas. For instance:

# select the first 10000 rows
cudf.read_parquet("error.parquet")[:10000].to_parquet("temp.parquet")
pandas.read_parquet("temp.parquet").shape # correct (10000, 25) without error

# select the first 10 columns
dataframe = cudf.read_parquet("error.parquet")
columns = dataframe.columns

dataframe[list(columns)[:10]].to_parquet("temp.parquet")
pandas.read_parquet("temp.parquet").shape # correct (25000, 10) without error

Expected behavior

We expect this parquet file that is exported from cudf can be loaded to pandas.

Environment overview (please complete the following information)

Environment location: Docker
Method of cuDF install: Docker
- docker pull nvcr.io/nvidia/pytorch:23.03-py3
- docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it nvcr.io/nvidia/pytorch:23.03-py3 /bin/bash

Environment details

Click here to see environment details

 **git***
 Not inside a git repository
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=20.04
 DISTRIB_CODENAME=focal
 DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
 NAME="Ubuntu"
 VERSION="20.04.5 LTS (Focal Fossa)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 20.04.5 LTS"
 VERSION_ID="20.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=focal
 UBUNTU_CODENAME=focal
 Linux kkranen-ws 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Fri Apr 28 20:44:20 2023
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0 Off |                  Off |
 | 30%   50C    P8    27W / 300W |    750MiB / 49140MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   1  NVIDIA RTX 6000...  Off  | 00000000:21:00.0 Off |                  Off |
 | 30%   47C    P8    23W / 300W |     13MiB / 49140MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 +-----------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:                    x86_64
 CPU op-mode(s):                  32-bit, 64-bit
 Byte Order:                      Little Endian
 Address sizes:                   43 bits physical, 48 bits virtual
 CPU(s):                          48
 On-line CPU(s) list:             0-47
 Thread(s) per core:              2
 Core(s) per socket:              24
 Socket(s):                       1
 NUMA node(s):                    1
 Vendor ID:                       AuthenticAMD
 CPU family:                      23
 Model:                           49
 Model name:                      AMD Ryzen Threadripper 3960X 24-Core Processor
 Stepping:                        0
 Frequency boost:                 enabled
 CPU MHz:                         2200.000
 CPU max MHz:                     3800.0000
 CPU min MHz:                     2200.0000
 BogoMIPS:                        7586.19
 Virtualization:                  AMD-V
 L1d cache:                       768 KiB
 L1i cache:                       768 KiB
 L2 cache:                        12 MiB
 L3 cache:                        128 MiB
 NUMA node0 CPU(s):               0-47
 Vulnerability Itlb multihit:     Not affected
 Vulnerability L1tf:              Not affected
 Vulnerability Mds:               Not affected
 Vulnerability Meltdown:          Not affected
 Vulnerability Mmio stale data:   Not affected
 Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
 Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
 Vulnerability Srbds:             Not affected
 Vulnerability Tsx async abort:   Not affected
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
 
 ***CMake***
 /usr/local/bin/cmake
 cmake version 3.24.1
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
 Copyright (C) 2019 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 /usr/local/cuda/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2023 NVIDIA Corporation
 Built on Tue_Feb__7_19:32:13_PST_2023
 Cuda compilation tools, release 12.1, V12.1.66
 Build cuda_12.1.r12.1/compiler.32415258_0
 
 ***Python***
 /usr/bin/python
 Python 3.8.10
 
 ***Environment Variables***
 PATH                            : /usr/local/nvm/versions/node/v16.19.1/bin:/usr/local/lib/python3.8/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
 LD_LIBRARY_PATH                 : /usr/local/cuda/compat/lib.real:/usr/local/lib/python3.8/dist-packages/torch/lib:/usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-11/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    :
 PYTHON_PATH                     :
 
 conda not found
 ***pip packages***
 /usr/local/bin/pip
 Package                 Version
 ----------------------- -------------------------------
 absl-py                 1.4.0
 apex                    0.1
 argon2-cffi             21.3.0
 argon2-cffi-bindings    21.2.0
 asttokens               2.2.1
 astunparse              1.6.3
 attrs                   22.2.0
 audioread               3.0.0
 backcall                0.2.0
 beautifulsoup4          4.11.2
 bleach                  6.0.0
 blis                    0.7.9
 cachetools              5.3.0
 catalogue               2.0.8
 certifi                 2022.12.7
 cffi                    1.15.1
 charset-normalizer      3.1.0
 click                   8.1.3
 cloudpickle             2.2.1
 cmake                   3.24.1.1
 comm                    0.1.2
 confection              0.0.4
 contourpy               1.0.7
 cubinlinker             0.2.2+2.g4de3e99
 cuda-python             12.1.0rc5+1.gc7fd38c.dirty
 cudf                    23.2.0
 cugraph                 23.2.0
 cugraph-dgl             23.2.0
 cugraph-service-client  23.2.0
 cugraph-service-server  23.2.0
 cuml                    23.2.0
 cupy-cuda12x            12.0.0b3
 cycler                  0.11.0
 cymem                   2.0.7
 Cython                  0.29.33
 dask                    2023.1.1
 dask-cuda               23.2.0
 dask-cudf               23.2.0
 debugpy                 1.6.6
 decorator               5.1.1
 defusedxml              0.7.1
 distributed             2023.1.1
 exceptiongroup          1.1.1
 execnet                 1.9.0
 executing               1.2.0
 expecttest              0.1.3
 fastjsonschema          2.16.3
 fastrlock               0.8.1
 filelock                3.10.0
 flash-attn              0.2.8.dev0
 fonttools               4.38.0
 fsspec                  2023.1.0
 gast                    0.4.0
 google-auth             2.16.2
 google-auth-oauthlib    0.4.6
 graphsurgeon            0.4.6
 grpcio                  1.51.3
 HeapDict                1.0.1
 hypothesis              5.35.1
 idna                    3.4
 importlib-metadata      6.0.0
 importlib-resources     5.12.0
 iniconfig               2.0.0
 intel-openmp            2021.4.0
 ipykernel               6.21.3
 ipython                 8.11.0
 ipython-genutils        0.2.0
 jedi                    0.18.2
 Jinja2                  3.1.2
 joblib                  1.2.0
 json5                   0.9.11
 jsonschema              4.17.3
 jupyter_client          8.0.3
 jupyter_core            5.2.0
 jupyter-tensorboard     0.2.0
 jupyterlab              2.3.2
 jupyterlab-pygments     0.2.2
 jupyterlab-server       1.2.0
 jupytext                1.14.5
 kiwisolver              1.4.4
 langcodes               3.3.0
 librosa                 0.9.2
 lit                     15.0.7
 llvmlite                0.39.1
 locket                  1.0.0
 Markdown                3.4.1
 markdown-it-py          2.2.0
 MarkupSafe              2.1.2
 matplotlib              3.7.0
 matplotlib-inline       0.1.6
 mdit-py-plugins         0.3.5
 mdurl                   0.1.2
 mistune                 2.0.5
 mkl                     2021.1.1
 mkl-devel               2021.1.1
 mkl-include             2021.1.1
 mock                    5.0.1
 mpmath                  1.3.0
 msgpack                 1.0.4
 murmurhash              1.0.9
 nbclient                0.7.2
 nbconvert               7.2.10
 nbformat                5.7.3
 nest-asyncio            1.5.6
 networkx                2.6.3
 notebook                6.4.10
 numba                   0.56.4+1.g9a03de713
 numpy                   1.22.2
 nvidia-dali-cuda110     1.23.0
 nvidia-pyindex          1.0.9
 nvtx                    0.2.5
 oauthlib                3.2.2
 onnx                    1.13.0
 opencv                  4.6.0
 packaging               23.0
 pandas                  1.5.2
 pandocfilters           1.5.0
 parso                   0.8.3
 partd                   1.3.0
 pathy                   0.10.1
 pexpect                 4.8.0
 pickleshare             0.7.5
 Pillow                  9.2.0
 pip                     21.2.4
 pkgutil_resolve_name    1.3.10
 platformdirs            3.1.1
 pluggy                  1.0.0
 ply                     3.11
 polygraphy              0.44.2
 pooch                   1.7.0
 preshed                 3.0.8
 prettytable             3.6.0
 prometheus-client       0.16.0
 prompt-toolkit          3.0.38
 protobuf                3.20.3
 psutil                  5.9.4
 ptxcompiler             0.7.0+27.gbcb4096
 ptyprocess              0.7.0
 pure-eval               0.2.2
 pyarrow                 10.0.1.dev0+ga6eabc2b.d20230220
 pyasn1                  0.4.8
 pyasn1-modules          0.2.8
 pybind11                2.10.3
 pycocotools             2.0+nv0.7.1
 pycparser               2.21
 pydantic                1.10.6
 Pygments                2.14.0
 pylibcugraph            23.2.0
 pylibcugraphops         23.2.0
 pylibraft               23.2.0
 pynvml                  11.5.0
 pyparsing               3.0.9
 pyrsistent              0.19.3
 pytest                  7.2.2
 pytest-rerunfailures    11.1.2
 pytest-shard            0.1.2
 pytest-xdist            3.2.1
 python-dateutil         2.8.2
 python-hostlist         1.23.0
 pytorch-quantization    2.1.2
 pytz                    2022.7.1
 PyYAML                  6.0
 pyzmq                   25.0.1
 raft-dask               23.2.0
 regex                   2022.10.31
 requests                2.28.2
 requests-oauthlib       1.3.1
 resampy                 0.4.2
 rmm                     23.2.0
 rsa                     4.9
 scikit-learn            1.2.0
 scipy                   1.6.3
 seaborn                 0.12.2
 Send2Trash              1.8.0
 setuptools              65.5.1
 six                     1.16.0
 smart-open              6.3.0
 sortedcontainers        2.4.0
 soundfile               0.12.1
 soupsieve               2.4
 spacy                   3.5.1
 spacy-legacy            3.0.12
 spacy-loggers           1.0.4
 sphinx-glpi-theme       0.3
 srsly                   2.4.6
 stack-data              0.6.2
 strings-udf             23.2.0
 sympy                   1.11.1
 tbb                     2021.8.0
 tblib                   1.7.0
 tensorboard             2.9.0
 tensorboard-data-server 0.6.1
 tensorboard-plugin-wit  1.8.1
 tensorrt                8.5.3.1
 terminado               0.17.1
 thinc                   8.1.9
 threadpoolctl           3.1.0
 thriftpy2               0.4.16
 tinycss2                1.2.1
 toml                    0.10.2
 tomli                   2.0.1
 toolz                   0.12.0
 torch                   2.0.0a0+1767026
 torch-tensorrt          1.4.0.dev0
 torchtext               0.13.0a0+fae8e8c
 torchvision             0.15.0a0
 tornado                 6.2
 tqdm                    4.65.0
 traitlets               5.9.0
 transformer-engine      0.6.0
 treelite                3.1.0
 treelite-runtime        3.1.0
 triton                  2.0.0
 typer                   0.7.0
 types-dataclasses       0.6.6
 typing_extensions       4.5.0
 ucx-py                  0.30.0
 uff                     0.6.9
 urllib3                 1.26.14
 wasabi                  1.1.1
 wcwidth                 0.2.6
 webencodings            0.5.1
 Werkzeug                2.2.3
 wheel                   0.38.4
 xdoctest                1.0.2
 xgboost                 1.7.1
 zict                    2.2.0
 zipp                    3.14.0

Additional context

The text was updated successfully, but these errors were encountered:

vuule · 2023-04-28T23:25:10Z

Thank you for filing the issue! The nrows/ncols isolation is interesting, I'm hoping that can help root cause the issue.

vuule · 2023-04-28T23:31:12Z

Assigning myself to provide more isolation info.

GregoryKimball · 2023-05-09T23:30:43Z

Here's a hint, when you try to read the file with pandas' fastparquet engine the error looks like an out-of-bounds dictionary-encoding problem.

>>> df = pd.read_parquet('temp.parquet', engine='fastparquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/pandas/io/parquet.py", line 347, in read
    result = parquet_file.to_pandas(columns=columns, **kwargs)
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/api.py", line 778, in to_pandas
    self.read_row_group_file(rg, columns, categories, index,
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/api.py", line 380, in read_row_group_file
    core.read_row_group(
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/core.py", line 621, in read_row_group
    read_row_group_arrays(file, rg, columns, categories, schema_helper,
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/core.py", line 591, in read_row_group_arrays
    read_col(column, schema_helper, file, use_cat=name+'-catdef' in out,
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/core.py", line 558, in read_col
    piece[:] = dic[val]
IndexError: index 8164 is out of bounds for axis 0 with size 8131

kkranen · 2023-05-11T21:18:32Z

@vuule any updates on this? @Elnifio can provide more info on dataset ablation tests to determine where exactly the error occurs?

GregoryKimball · 2023-05-12T22:30:32Z

Thank you @kkranen for reporting this. The next thing I noticed was that the error is intermittent! I wrote the file 100 times with cuDF and it failed 27 times, while succeeding 73 times. It is an intermittent parquet writer failure. I'm going to keep digging.

df = cudf.read_parquet('temp.parquet')
fail = 0
for i in range(100):
    df.to_parquet('f.pq') 
    try:
        pdf = pd.read_parquet('f.pq')
    except Exception as err:            
        print('Failure', i)
        fail += 1
        continue
    print('Success', i)
print(fail)

Doing the same test for each column gives the following failure counts. The column feat_17 fails to write correctly 25/100 attempts. And oddly the column feat_15 failed once.

Here are passing and failing variants of the same dataframe, only containing feat_17:
pass.zip
fail.zip

To my horror, there is a small region of this file getting randomized from one write to the next!

For feat_17, the error always occurs at these physical (iloc) positions:
[19979, 19980, 19981, 19982, 19984, 19985, 19986, 19987, 19988, 19989, 19990, 19991]

Converting the data to float64 prevents this error, but only when also writing with snappy compression. Converting the data to float64 does NOT prevent the error when writing uncompressed!

There is a weird interaction with compression. feat_17 fails 25% of the time with compression='snappy', but 100% of the time with compression=None or compression='ZTSD'

Setting compression to None makes the failures more reproducible.

feat_15: fails with Unexpected end of stream
feat_17: fails with Unexpected end of stream
feat_6: pandas doesn't fail - but 14 data points are corrupted!
feat_20: pandas doesn't fail - but 13 data points are corrupted!

Update:
It seems that root cause is in dictionary encoding, where somehow cudf is writing dictionary keys that exceed the number of dictionary entries.

Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the `bit-packed-header` used when encoding bit-packed literal runs. This term originally used a value of `256`, but it's hard to see where that value comes from. This PR change the value to `8`, with a possible justification being the minimum length of a literal run is 8 values. Worst case would be multiple runs of 8, with required overhead bytes then being `num_values/8`. This also adds a test that has been verified to fail for values larger than 16 in the problematic term. Authors: - Ed Seidl (https://github.com/etseidl) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) - Nghia Truong (https://github.com/ttnghia) URL: #13364

Elnifio added Needs Triage Need team to review and classify bug Something isn't working labels Apr 28, 2023

vuule added the cuIO cuIO issue label Apr 28, 2023

vuule self-assigned this Apr 28, 2023

GregoryKimball added this to libcudf May 15, 2023

GregoryKimball moved this to Needs owner in libcudf May 15, 2023

GregoryKimball added 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 15, 2023

etseidl mentioned this issue May 16, 2023

Fix page size estimation in Parquet writer #13364

Merged

3 tasks

rapids-bot bot closed this as completed in #13364 May 18, 2023

GregoryKimball mentioned this issue Jun 27, 2023

[FEA] Add Parquet and ORC unit tests based on Apache sample files #13627

Open

GregoryKimball removed the status in libcudf Jun 28, 2023

GregoryKimball removed this from libcudf Jun 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

Elnifio commented Apr 28, 2023 •

edited

Loading

vuule commented Apr 28, 2023

vuule commented Apr 28, 2023

GregoryKimball commented May 9, 2023

kkranen commented May 11, 2023

GregoryKimball commented May 12, 2023 •

edited

Loading

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

Comments

Elnifio commented Apr 28, 2023 • edited Loading

vuule commented Apr 28, 2023

vuule commented Apr 28, 2023

GregoryKimball commented May 9, 2023

kkranen commented May 11, 2023

GregoryKimball commented May 12, 2023 • edited Loading

Elnifio commented Apr 28, 2023 •

edited

Loading

GregoryKimball commented May 12, 2023 •

edited

Loading