Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

Closed
Elnifio opened this issue Apr 28, 2023 · 5 comments · Fixed by #13364
Closed

[BUG] Written Parquet File Cannot be Loaded by Other Packages (pandas & dask) #13250

Elnifio opened this issue Apr 28, 2023 · 5 comments · Fixed by #13364
Assignees
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.

Comments

@Elnifio
Copy link

Elnifio commented Apr 28, 2023

Describe the bug

After a parquet file is written to the disk through cudf.core.dataframe.DataFrame.to_parquet function, this parquet file can't be loaded to pandas using pandas.read_parquet.

Steps/Code to reproduce bug

We have tried our best to narrow down the size of the parquet file by binary searching on which rows and which columns of the dataset that causes this issue. A full parquet file that triggers this issue is also available and we can attach it here if necessary. The issue can be reproduced with the following code and parquet file:

import cudf
import pandas

df = cudf.read_parquet("error.parquet")
df.shape # works correctly, should have shape (25000, 25)

pandas.read_parquet("error.parquet").shape # OSError: Unexpected End of Stream

error.parquet.zip

Interestingly, if we continue to bisect the dataframe to nrows < 25000 or ncols < 25, export the smaller part to parquet files, and use pandas to load them back, then the error disappears and we can successfully load the dataframe through pandas. For instance:

# select the first 10000 rows
cudf.read_parquet("error.parquet")[:10000].to_parquet("temp.parquet")
pandas.read_parquet("temp.parquet").shape # correct (10000, 25) without error

# select the first 10 columns
dataframe = cudf.read_parquet("error.parquet")
columns = dataframe.columns

dataframe[list(columns)[:10]].to_parquet("temp.parquet")
pandas.read_parquet("temp.parquet").shape # correct (25000, 10) without error

Expected behavior

We expect this parquet file that is exported from cudf can be loaded to pandas.

Environment overview (please complete the following information)

  • Environment location: Docker
  • Method of cuDF install: Docker
    • docker pull nvcr.io/nvidia/pytorch:23.03-py3
    • docker run --gpus all --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 -it nvcr.io/nvidia/pytorch:23.03-py3 /bin/bash

Environment details

Click here to see environment details
 **git***
 Not inside a git repository
 
 ***OS Information***
 DISTRIB_ID=Ubuntu
 DISTRIB_RELEASE=20.04
 DISTRIB_CODENAME=focal
 DISTRIB_DESCRIPTION="Ubuntu 20.04.5 LTS"
 NAME="Ubuntu"
 VERSION="20.04.5 LTS (Focal Fossa)"
 ID=ubuntu
 ID_LIKE=debian
 PRETTY_NAME="Ubuntu 20.04.5 LTS"
 VERSION_ID="20.04"
 HOME_URL="https://www.ubuntu.com/"
 SUPPORT_URL="https://help.ubuntu.com/"
 BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
 PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
 VERSION_CODENAME=focal
 UBUNTU_CODENAME=focal
 Linux kkranen-ws 5.15.0-69-generic #76~20.04.1-Ubuntu SMP Mon Mar 20 15:54:19 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux
 
 ***GPU Information***
 Fri Apr 28 20:44:20 2023
 +-----------------------------------------------------------------------------+
 | NVIDIA-SMI 525.105.17   Driver Version: 525.105.17   CUDA Version: 12.1     |
 |-------------------------------+----------------------+----------------------+
 | GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
 | Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
 |                               |                      |               MIG M. |
 |===============================+======================+======================|
 |   0  NVIDIA RTX 6000...  Off  | 00000000:01:00.0 Off |                  Off |
 | 30%   50C    P8    27W / 300W |    750MiB / 49140MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 |   1  NVIDIA RTX 6000...  Off  | 00000000:21:00.0 Off |                  Off |
 | 30%   47C    P8    23W / 300W |     13MiB / 49140MiB |      0%      Default |
 |                               |                      |                  N/A |
 +-------------------------------+----------------------+----------------------+
 
 +-----------------------------------------------------------------------------+
 | Processes:                                                                  |
 |  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
 |        ID   ID                                                   Usage      |
 |=============================================================================|
 +-----------------------------------------------------------------------------+
 
 ***CPU***
 Architecture:                    x86_64
 CPU op-mode(s):                  32-bit, 64-bit
 Byte Order:                      Little Endian
 Address sizes:                   43 bits physical, 48 bits virtual
 CPU(s):                          48
 On-line CPU(s) list:             0-47
 Thread(s) per core:              2
 Core(s) per socket:              24
 Socket(s):                       1
 NUMA node(s):                    1
 Vendor ID:                       AuthenticAMD
 CPU family:                      23
 Model:                           49
 Model name:                      AMD Ryzen Threadripper 3960X 24-Core Processor
 Stepping:                        0
 Frequency boost:                 enabled
 CPU MHz:                         2200.000
 CPU max MHz:                     3800.0000
 CPU min MHz:                     2200.0000
 BogoMIPS:                        7586.19
 Virtualization:                  AMD-V
 L1d cache:                       768 KiB
 L1i cache:                       768 KiB
 L2 cache:                        12 MiB
 L3 cache:                        128 MiB
 NUMA node0 CPU(s):               0-47
 Vulnerability Itlb multihit:     Not affected
 Vulnerability L1tf:              Not affected
 Vulnerability Mds:               Not affected
 Vulnerability Meltdown:          Not affected
 Vulnerability Mmio stale data:   Not affected
 Vulnerability Retbleed:          Mitigation; untrained return thunk; SMT enabled with STIBP protection
 Vulnerability Spec store bypass: Mitigation; Speculative Store Bypass disabled via prctl and seccomp
 Vulnerability Spectre v1:        Mitigation; usercopy/swapgs barriers and __user pointer sanitization
 Vulnerability Spectre v2:        Mitigation; Retpolines, IBPB conditional, STIBP always-on, RSB filling, PBRSB-eIBRS Not affected
 Vulnerability Srbds:             Not affected
 Vulnerability Tsx async abort:   Not affected
 Flags:                           fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl nonstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_llc mwaitx cpb cat_l3 cdp_l3 hw_pstate ssbd mba ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr rdpru wbnoinvd arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif v_spec_ctrl umip rdpid overflow_recov succor smca sme sev sev_es
 
 ***CMake***
 /usr/local/bin/cmake
 cmake version 3.24.1
 
 CMake suite maintained and supported by Kitware (kitware.com/cmake).
 
 ***g++***
 /usr/bin/g++
 g++ (Ubuntu 9.4.0-1ubuntu1~20.04.1) 9.4.0
 Copyright (C) 2019 Free Software Foundation, Inc.
 This is free software; see the source for copying conditions.  There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 
 
 ***nvcc***
 /usr/local/cuda/bin/nvcc
 nvcc: NVIDIA (R) Cuda compiler driver
 Copyright (c) 2005-2023 NVIDIA Corporation
 Built on Tue_Feb__7_19:32:13_PST_2023
 Cuda compilation tools, release 12.1, V12.1.66
 Build cuda_12.1.r12.1/compiler.32415258_0
 
 ***Python***
 /usr/bin/python
 Python 3.8.10
 
 ***Environment Variables***
 PATH                            : /usr/local/nvm/versions/node/v16.19.1/bin:/usr/local/lib/python3.8/dist-packages/torch_tensorrt/bin:/usr/local/mpi/bin:/usr/local/nvidia/bin:/usr/local/cuda/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/local/ucx/bin:/opt/tensorrt/bin
 LD_LIBRARY_PATH                 : /usr/local/cuda/compat/lib.real:/usr/local/lib/python3.8/dist-packages/torch/lib:/usr/local/lib/python3.8/dist-packages/torch_tensorrt/lib:/usr/local/cuda/compat/lib:/usr/local/nvidia/lib:/usr/local/nvidia/lib64:/usr/local/cuda-11/lib64
 NUMBAPRO_NVVM                   :
 NUMBAPRO_LIBDEVICE              :
 CONDA_PREFIX                    :
 PYTHON_PATH                     :
 
 conda not found
 ***pip packages***
 /usr/local/bin/pip
 Package                 Version
 ----------------------- -------------------------------
 absl-py                 1.4.0
 apex                    0.1
 argon2-cffi             21.3.0
 argon2-cffi-bindings    21.2.0
 asttokens               2.2.1
 astunparse              1.6.3
 attrs                   22.2.0
 audioread               3.0.0
 backcall                0.2.0
 beautifulsoup4          4.11.2
 bleach                  6.0.0
 blis                    0.7.9
 cachetools              5.3.0
 catalogue               2.0.8
 certifi                 2022.12.7
 cffi                    1.15.1
 charset-normalizer      3.1.0
 click                   8.1.3
 cloudpickle             2.2.1
 cmake                   3.24.1.1
 comm                    0.1.2
 confection              0.0.4
 contourpy               1.0.7
 cubinlinker             0.2.2+2.g4de3e99
 cuda-python             12.1.0rc5+1.gc7fd38c.dirty
 cudf                    23.2.0
 cugraph                 23.2.0
 cugraph-dgl             23.2.0
 cugraph-service-client  23.2.0
 cugraph-service-server  23.2.0
 cuml                    23.2.0
 cupy-cuda12x            12.0.0b3
 cycler                  0.11.0
 cymem                   2.0.7
 Cython                  0.29.33
 dask                    2023.1.1
 dask-cuda               23.2.0
 dask-cudf               23.2.0
 debugpy                 1.6.6
 decorator               5.1.1
 defusedxml              0.7.1
 distributed             2023.1.1
 exceptiongroup          1.1.1
 execnet                 1.9.0
 executing               1.2.0
 expecttest              0.1.3
 fastjsonschema          2.16.3
 fastrlock               0.8.1
 filelock                3.10.0
 flash-attn              0.2.8.dev0
 fonttools               4.38.0
 fsspec                  2023.1.0
 gast                    0.4.0
 google-auth             2.16.2
 google-auth-oauthlib    0.4.6
 graphsurgeon            0.4.6
 grpcio                  1.51.3
 HeapDict                1.0.1
 hypothesis              5.35.1
 idna                    3.4
 importlib-metadata      6.0.0
 importlib-resources     5.12.0
 iniconfig               2.0.0
 intel-openmp            2021.4.0
 ipykernel               6.21.3
 ipython                 8.11.0
 ipython-genutils        0.2.0
 jedi                    0.18.2
 Jinja2                  3.1.2
 joblib                  1.2.0
 json5                   0.9.11
 jsonschema              4.17.3
 jupyter_client          8.0.3
 jupyter_core            5.2.0
 jupyter-tensorboard     0.2.0
 jupyterlab              2.3.2
 jupyterlab-pygments     0.2.2
 jupyterlab-server       1.2.0
 jupytext                1.14.5
 kiwisolver              1.4.4
 langcodes               3.3.0
 librosa                 0.9.2
 lit                     15.0.7
 llvmlite                0.39.1
 locket                  1.0.0
 Markdown                3.4.1
 markdown-it-py          2.2.0
 MarkupSafe              2.1.2
 matplotlib              3.7.0
 matplotlib-inline       0.1.6
 mdit-py-plugins         0.3.5
 mdurl                   0.1.2
 mistune                 2.0.5
 mkl                     2021.1.1
 mkl-devel               2021.1.1
 mkl-include             2021.1.1
 mock                    5.0.1
 mpmath                  1.3.0
 msgpack                 1.0.4
 murmurhash              1.0.9
 nbclient                0.7.2
 nbconvert               7.2.10
 nbformat                5.7.3
 nest-asyncio            1.5.6
 networkx                2.6.3
 notebook                6.4.10
 numba                   0.56.4+1.g9a03de713
 numpy                   1.22.2
 nvidia-dali-cuda110     1.23.0
 nvidia-pyindex          1.0.9
 nvtx                    0.2.5
 oauthlib                3.2.2
 onnx                    1.13.0
 opencv                  4.6.0
 packaging               23.0
 pandas                  1.5.2
 pandocfilters           1.5.0
 parso                   0.8.3
 partd                   1.3.0
 pathy                   0.10.1
 pexpect                 4.8.0
 pickleshare             0.7.5
 Pillow                  9.2.0
 pip                     21.2.4
 pkgutil_resolve_name    1.3.10
 platformdirs            3.1.1
 pluggy                  1.0.0
 ply                     3.11
 polygraphy              0.44.2
 pooch                   1.7.0
 preshed                 3.0.8
 prettytable             3.6.0
 prometheus-client       0.16.0
 prompt-toolkit          3.0.38
 protobuf                3.20.3
 psutil                  5.9.4
 ptxcompiler             0.7.0+27.gbcb4096
 ptyprocess              0.7.0
 pure-eval               0.2.2
 pyarrow                 10.0.1.dev0+ga6eabc2b.d20230220
 pyasn1                  0.4.8
 pyasn1-modules          0.2.8
 pybind11                2.10.3
 pycocotools             2.0+nv0.7.1
 pycparser               2.21
 pydantic                1.10.6
 Pygments                2.14.0
 pylibcugraph            23.2.0
 pylibcugraphops         23.2.0
 pylibraft               23.2.0
 pynvml                  11.5.0
 pyparsing               3.0.9
 pyrsistent              0.19.3
 pytest                  7.2.2
 pytest-rerunfailures    11.1.2
 pytest-shard            0.1.2
 pytest-xdist            3.2.1
 python-dateutil         2.8.2
 python-hostlist         1.23.0
 pytorch-quantization    2.1.2
 pytz                    2022.7.1
 PyYAML                  6.0
 pyzmq                   25.0.1
 raft-dask               23.2.0
 regex                   2022.10.31
 requests                2.28.2
 requests-oauthlib       1.3.1
 resampy                 0.4.2
 rmm                     23.2.0
 rsa                     4.9
 scikit-learn            1.2.0
 scipy                   1.6.3
 seaborn                 0.12.2
 Send2Trash              1.8.0
 setuptools              65.5.1
 six                     1.16.0
 smart-open              6.3.0
 sortedcontainers        2.4.0
 soundfile               0.12.1
 soupsieve               2.4
 spacy                   3.5.1
 spacy-legacy            3.0.12
 spacy-loggers           1.0.4
 sphinx-glpi-theme       0.3
 srsly                   2.4.6
 stack-data              0.6.2
 strings-udf             23.2.0
 sympy                   1.11.1
 tbb                     2021.8.0
 tblib                   1.7.0
 tensorboard             2.9.0
 tensorboard-data-server 0.6.1
 tensorboard-plugin-wit  1.8.1
 tensorrt                8.5.3.1
 terminado               0.17.1
 thinc                   8.1.9
 threadpoolctl           3.1.0
 thriftpy2               0.4.16
 tinycss2                1.2.1
 toml                    0.10.2
 tomli                   2.0.1
 toolz                   0.12.0
 torch                   2.0.0a0+1767026
 torch-tensorrt          1.4.0.dev0
 torchtext               0.13.0a0+fae8e8c
 torchvision             0.15.0a0
 tornado                 6.2
 tqdm                    4.65.0
 traitlets               5.9.0
 transformer-engine      0.6.0
 treelite                3.1.0
 treelite-runtime        3.1.0
 triton                  2.0.0
 typer                   0.7.0
 types-dataclasses       0.6.6
 typing_extensions       4.5.0
 ucx-py                  0.30.0
 uff                     0.6.9
 urllib3                 1.26.14
 wasabi                  1.1.1
 wcwidth                 0.2.6
 webencodings            0.5.1
 Werkzeug                2.2.3
 wheel                   0.38.4
 xdoctest                1.0.2
 xgboost                 1.7.1
 zict                    2.2.0
 zipp                    3.14.0

Additional context

@Elnifio Elnifio added Needs Triage Need team to review and classify bug Something isn't working labels Apr 28, 2023
@vuule vuule added the cuIO cuIO issue label Apr 28, 2023
@vuule
Copy link
Contributor

vuule commented Apr 28, 2023

Thank you for filing the issue! The nrows/ncols isolation is interesting, I'm hoping that can help root cause the issue.

@vuule vuule self-assigned this Apr 28, 2023
@vuule
Copy link
Contributor

vuule commented Apr 28, 2023

Assigning myself to provide more isolation info.

@GregoryKimball
Copy link
Contributor

Here's a hint, when you try to read the file with pandas' fastparquet engine the error looks like an out-of-bounds dictionary-encoding problem.

>>> df = pd.read_parquet('temp.parquet', engine='fastparquet')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/pandas/io/parquet.py", line 493, in read_parquet
    return impl.read(
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/pandas/io/parquet.py", line 347, in read
    result = parquet_file.to_pandas(columns=columns, **kwargs)
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/api.py", line 778, in to_pandas
    self.read_row_group_file(rg, columns, categories, index,
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/api.py", line 380, in read_row_group_file
    core.read_row_group(
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/core.py", line 621, in read_row_group
    read_row_group_arrays(file, rg, columns, categories, schema_helper,
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/core.py", line 591, in read_row_group_arrays
    read_col(column, schema_helper, file, use_cat=name+'-catdef' in out,
  File "/Users/gregorykimball/mambaforge/envs/cenv/lib/python3.9/site-packages/fastparquet/core.py", line 558, in read_col
    piece[:] = dic[val]
IndexError: index 8164 is out of bounds for axis 0 with size 8131

@kkranen
Copy link

kkranen commented May 11, 2023

@vuule any updates on this? @Elnifio can provide more info on dataset ablation tests to determine where exactly the error occurs?

@GregoryKimball
Copy link
Contributor

GregoryKimball commented May 12, 2023

Thank you @kkranen for reporting this. The next thing I noticed was that the error is intermittent! I wrote the file 100 times with cuDF and it failed 27 times, while succeeding 73 times. It is an intermittent parquet writer failure. I'm going to keep digging.

df = cudf.read_parquet('temp.parquet')
fail = 0
for i in range(100):
    df.to_parquet('f.pq') 
    try:
        pdf = pd.read_parquet('f.pq')
    except Exception as err:            
        print('Failure', i)
        fail += 1
        continue
    print('Success', i)
print(fail)

Doing the same test for each column gives the following failure counts. The column feat_17 fails to write correctly 25/100 attempts. And oddly the column feat_15 failed once.

Here are passing and failing variants of the same dataframe, only containing feat_17:
pass.zip
fail.zip

To my horror, there is a small region of this file getting randomized from one write to the next!
image

For feat_17, the error always occurs at these physical (iloc) positions:
[19979, 19980, 19981, 19982, 19984, 19985, 19986, 19987, 19988, 19989, 19990, 19991]

Converting the data to float64 prevents this error, but only when also writing with snappy compression. Converting the data to float64 does NOT prevent the error when writing uncompressed!

There is a weird interaction with compression. feat_17 fails 25% of the time with compression='snappy', but 100% of the time with compression=None or compression='ZTSD'

Setting compression to None makes the failures more reproducible.

  • feat_15: fails with Unexpected end of stream
  • feat_17: fails with Unexpected end of stream
  • feat_6: pandas doesn't fail - but 14 data points are corrupted!
  • feat_20: pandas doesn't fail - but 13 data points are corrupted!

Update:
It seems that root cause is in dictionary encoding, where somehow cudf is writing dictionary keys that exceed the number of dictionary entries.

@GregoryKimball GregoryKimball moved this to Needs owner in libcudf May 15, 2023
@GregoryKimball GregoryKimball added 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels May 15, 2023
rapids-bot bot pushed a commit that referenced this issue May 18, 2023
Fixes #13250. The page size estimation for dictionary encoded pages adds a term to estimate overhead bytes for the `bit-packed-header` used when encoding bit-packed literal runs. This term originally used a value of `256`, but it's hard to see where that value comes from. This PR change the value to `8`, with a possible justification being the minimum length of a literal run is 8 values.  Worst case would be multiple runs of 8, with required overhead bytes  then being `num_values/8`. This also adds a test that has been verified to fail for values larger than 16 in the problematic term.

Authors:
  - Ed Seidl (https://github.com/etseidl)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)
  - Nghia Truong (https://github.com/ttnghia)

URL: #13364
@GregoryKimball GregoryKimball removed the status in libcudf Jun 28, 2023
@GregoryKimball GregoryKimball removed this from libcudf Jun 28, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants