[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

wence- · 2022-11-25T16:34:55Z

Describe the bug

[This is (possibly) related to https://github.com/pandas-dev/pandas/issues/37897]

An empty dataframe cannot be round-tripped through to_parquet/read_parquet:

import cudf
from io import BytesIO

df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))

buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = cudf.read_parquet(buf) # => IndexError
File parquet.pyx:123, in cudf._lib.parquet.read_parquet()

File parquet.pyx:145, in cudf._lib.parquet.read_parquet()

File utils.pyx:52, in cudf._lib.io.utils.make_source_info()

IndexError: Out of bounds on buffer access (axis 0)

If I try and read the buffer in pandas I also get an issue

import cudf
import pandas as pd
from io import BytesIO

df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))

buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = pd.read_parquet(buf) # => ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes

If I create the buffer from pandas, Iget a bad allocation in libcudf

import cudf
import pandas as pd
from io import BytesIO

df = pd.DataFrame(index=pd.RangeIndex(0, 10))

buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = cudf.read_parquet(buf)

File parquet.pyx:123, in cudf._lib.parquet.read_parquet()

File parquet.pyx:162, in cudf._lib.parquet.read_parquet()

MemoryError: std::bad_alloc

Expected behavior

~~cuDF should not segfault on invalid inputs. It should also produce a valid parquet file in this case.~~

cuDF should be able to round-trip empty dataframes through a parquet file.

The text was updated successfully, but these errors were encountered:

wence- · 2023-01-19T18:08:43Z

In current 23.02 the segmentation faults are gone (good!). But it continues to be the case that none of these examples completes successfully.

It appears that read_parquet and to_parquet don't play well with empty dataframes.

wence- · 2023-01-19T18:16:37Z

@GregoryKimball I think this probably needs some handling in libcudf, although I guess we could special-case empty dataframes in python, it feels hacky.

wence- · 2023-11-23T15:36:31Z

An update, with 24.02:

Case 1: segfaults
Case 2: KeyError: '__index_level_0__'
Case 3: Works

mhaseeb123 · 2024-03-20T05:55:17Z

Investigating this. Some insights:

The problems occur when the index column is encoded into the binary data. i.e. cudf.to_parquet(index=True). The following works perfectly because the index column is encoded as {"kind": "range", "name": null, "start": 0, "stop": 3, "step": 1}:

import cudf
import pandas as pd
from io import BytesIO
df = (cudf or pd).DataFrame(index=(cudf or pd).RangeIndex(0, 10))
buf = BytesIO()
df.to_parquet(buf, index=None)
df_read = cudf.read_parquet(buf)

When index column is encoded (index=True), the to_parquet() function from pandas and cudf results in very different parquet data files. Specifically, the pandas version writes an extra column (__index_level_0__) in binary and metadata whereas the cudf version doesn't and leaves column=[]. This leads to segfault when trying to read the data.

# written by pd.to_parquet()

b'PAR1\x15\x04\x15\xa0\x01\x15dL\x15\x14\x15\x00\x12\x00\x00P\x00\x00\r\x01\x00\x01\r\x08\x00
\x02\r\x08\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08<\x08\x00\x00
\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x15$,\x15\x14\x15\x10\x15\x06
\x15\x06\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00
\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x10<\x02\x00\x00\x00\x14\x01\x04\x05\x102Tv\x98\x00\x00\x00&\xa8\x02\x1c\x15\x04
\x195\x00\x06\x10\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16\xd8\x02\x16\xa0\x02&\x8a
\x01&\x08\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00
\x00\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x19,5\x00
\x18\x06schema\x15\x02\x00\x15\x04%\x02\x18\x11__index_level_0__\x00\x16\x14\x19\x1c\x19
\x1c&\xa8\x02\x1c\x15\x04\x195\x00\x06\x10\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16
\xd8\x02\x16\xa0\x02&\x8a\x01&\x08\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00
\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00
\x00\x00\x16\xd8\x02\x16\x14&\x08\x16\xa0\x02\x14\x00\x00\x19,\x18\x06pandas\x18\xf5\x02
{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, 
"pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "columns": [{"name": null, 
"field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], 
"creator": {"library": "pyarrow", "version": "14.0.2"}, "pandas_version": "2.2.1"}\x00\x18\x0cARROW:schema
\x18\x80\x06/////zgCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAKw
BAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAHUBAAB7ImluZGV4X2
NvbHVtbnMiOiBbIl9faW5kZXhfbGV2ZWxfMF9fIl0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAi
ZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgI
m1ldGFkYXRhIjogbnVsbH1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6ICJfX2luZGV
4X2xldmVsXzBfXyIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkY
XRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMTQuMC4yIn0sI
CJwYW5kYXNfdmVyc2lvbiI6ICIyLjIuMSJ9AAAAAQAAABQAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQI
QAAAALAAAAAQAAAAAAAAAEQAAAF9faW5kZXhfbGV2ZWxfMF9fAAAACAAMAAgABwAIAAAAAAAAAUAAA
AAAAAAA\x00\x18 parquet-cpp-arrow version 14.0.2\x19\x1c\x1c\x00\x00\x00e\x05\x00\x00PAR1'

# written by cudf
b'PAR1\x15\x00\x15\xa0\x01\x15R,\x15\x14\x15\x00\x15\x06\x15\x06\x00\x00P\x00\x00\r\x01\x00\x01\r\x08
\x00\x02\r\x08\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08\x00\x08\r\x08\x00
\t\r\x08\x15\x02\x19,H\x06schema\x15\x02\x00\x15\x04%\x00\x18\x11__index_level_0__\x00\x16\x14\x19\x1c
\x19\x1c&\x00\x1c\x15\x04\x19\x15\x00\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16\xc4\x01\x16v&\x08
<6\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16\xc4
\x01\x16\x14\x00\x19\x1c\x18\x06pandas\x18\x84\x02{"index_columns": ["__index_level_0__"], "column_indexes":
 [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], 
"columns": [], "creator": {"library": "pyarrow", "version": "14.0.2"}, "pandas_version": "2.2.1"}\x00\x18\x15cudf 
version 24.06.00\x19\x1c\x1c\x00\x00\x00\xa3\x01\x00\x00PAR1'

Still investigating where exactly does it fail to add a fix in either the writer or the reader. Debugging with printfs isn't showing the point of failure as it is possible the code is crashing at a later point and debug points are printing throughout the libcudf.read_parquet() API.

wence- · 2024-03-21T12:19:03Z

Could you try with this patch?

diff --git a/python/cudf/cudf/_lib/utils.pyx b/python/cudf/cudf/_lib/utils.pyx
index b6637e9df0..aac2fb1de0 100644
--- a/python/cudf/cudf/_lib/utils.pyx
+++ b/python/cudf/cudf/_lib/utils.pyx
@@ -59,7 +59,7 @@ cpdef generate_pandas_metadata(table, index):
     types = []
     index_levels = []
     index_descriptors = []
-
+    columns_to_convert = list(table._columns)
     # Columns
     for name, col in table._data.items():
         if cudf.get_option("mode.pandas_compatible"):
@@ -109,20 +109,23 @@ cpdef generate_pandas_metadata(table, index):
                 else:
                     # When `index=True`, RangeIndex needs to be materialized.
                     materialized_idx = cudf.Index(idx._values, name=idx.name)
-                    descr = \
-                        _index_level_name(
-                            index_name=materialized_idx.name,
-                            level=level,
-                            column_names=col_names
-                        )
-                    index_levels.append(materialized_idx)
-            else:
-                descr = \
-                    _index_level_name(
-                        index_name=idx.name,
+                    descr = _index_level_name(
+                        index_name=materialized_idx.name,
                         level=level,
                         column_names=col_names
                     )
+                    index_levels.append(materialized_idx)
+                    columns_to_convert.append(materialized_idx._values)
+                    col_names.append(descr)
+                    types.append(np_to_pa_dtype(idx.dtype))
+            else:
+                descr = _index_level_name(
+                    index_name=idx.name,
+                    level=level,
+                    column_names=col_names
+                )
+                columns_to_convert.append(idx._values)
+                col_names.append(descr)
                 if isinstance(idx.dtype, cudf.CategoricalDtype):
                     raise ValueError(
                         "'category' column dtypes are currently not "
@@ -141,14 +144,10 @@ cpdef generate_pandas_metadata(table, index):
                         types.append(np_to_pa_dtype(idx.dtype))
 
                 index_levels.append(idx)
-            col_names.append(name)
             index_descriptors.append(descr)
 
     metadata = pa.pandas_compat.construct_metadata(
-        columns_to_convert=[
-            col
-            for col in table._columns
-        ],
+        columns_to_convert=columns_to_convert,
         # It is OKAY to do `.head(0).to_pandas()` because
         # this method will extract `.columns` metadata only
         df=table.head(0).to_pandas(),

I think that should produce the correct metadata in the write.

I think one should still not segfault on incorrect data though.

wence- · 2024-03-21T12:31:32Z

To fix the segfault in read with invalid metadata, can you please try:

diff --git a/python/cudf/cudf/_lib/parquet.pyx b/python/cudf/cudf/_lib/parquet.pyx
index d3f5b42337..0fdf2cc287 100644
--- a/python/cudf/cudf/_lib/parquet.pyx
+++ b/python/cudf/cudf/_lib/parquet.pyx
@@ -297,6 +297,8 @@ cpdef read_parquet(filepaths_or_buffers, columns=None, row_groups=None,
         elif set(index_col).issubset(names):
             index_data = df[index_col]
             actual_index_names = list(index_col_names.values())
+            if len(actual_index_names) != len(index_data._data):
+                raise RuntimeError("Encoded metadata for index names doesn't match index data")
             if len(index_data._data) == 1:
                 idx = cudf.Index(
                     index_data._data.columns[0],

See also #15360.

How did I debug this:

Install gdb from conda (this gives us pretty-printers for Python object types)
Compile parquet.pyx with debug symbols:
a. Edit cudf/_lib/CMakeLists.txt to add set_source_files_properties(parquet.cxx PROPERTIES COMPILE_OPTIONS "-O0;-g")
b. Build cudf python library without stripping symbols from the installed library: export SKBUILD_INSTALL_STRIP=false and then build-cudf-python or however you do it.
Run in gdb and hit segfault gdb --args python bug.py

Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__pyx_f_4cudf_4_lib_7parquet_read_parquet (__pyx_v_filepaths_or_buffers=[<_io.BytesIO at remote 0x7ffff78365c0>], __pyx_skip_dispatch=0, __pyx_optional_args=0x7fffffffabb0)
    at /home/coder/cudf/python/cudf/build/conda/cuda-12.2/release/cudf/_lib/parquet.cxx:28731

28731           if (PyDict_SetItem(__pyx_t_5, __pyx_n_s_name, PyList_GET_ITEM(__pyx_v_actual_index_names, 0)) < 0) __PYX_ERR(0, 303, __pyx_L1_error)
(gdb) 
(gdb) p __pyx_v_actual_index_names
$1 = []
(gdb)

Hmm, interesting, what does PyList_GET_ITEM do? (Helps to have a cpython checkout)

#define PyList_GET_ITEM(op, i) (((PyListObject *)(op))->ob_item[i])

So it just indexes the data without error checking (due to the boundscheck=False line in the cython file):

(gdb) p *(PyListObject *)(__pyx_v_actual_index_names)
$3 = {ob_base = {ob_base = {ob_refcnt = 1, ob_type = 0x5555558dbc40 <PyList_Type>}, ob_size = 0}, ob_item = 0x0, allocated = 0}

And there's our null pointer dereference.

mhaseeb123 · 2024-03-21T18:20:53Z

@wence- Thank you for taking up from my notes and investigating this further. Also thank you for helpful instructions on debugging cython code with gdb, which I didn't know was possible like this (hence me using prints everywhere). It will be very useful for me in my ongoing investigation of other cython issues. Let me apply the above two patches and circle back to you if they solve the problem. Thanks again!

mhaseeb123 · 2024-03-21T19:19:19Z

@wence- : Confirming that the above patches solve the problem. Please feel free to commit and add a PR. 😊

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from io import BytesIO
import cudf


def empty_dfs():
    df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))
    buf = BytesIO()
    df.to_parquet(buf, index=True)
    df_read = cudf.read_parquet(buf)
    pd_read = pq.read_table(buf)

    print(df, df_read, pd_read)

    df2 = pd.DataFrame(index=pd.RangeIndex(0, 10))
    buf2 = BytesIO()
    df.to_parquet(buf2, index=True)
    pd2_read = pd.read_parquet(buf2)
    df2_read = cudf.read_parquet(buf2)

    print(df2, df2_read, pd2_read)

if __name__ == "__main__":
    empty_dfs()

For an empty data frame (with no columns) we would previously not write the correct metadata, resulting in not correctly round-tripping through the parquet read/write cycle. - Closes rapidsai#12243

wence- added bug Something isn't working Needs Triage Need team to review and classify labels Nov 25, 2022

GregoryKimball added this to the Parquet continuous improvement milestone Nov 26, 2022

GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 29, 2022

wence- mentioned this issue Dec 15, 2022

Turn on xfail_strict=true #12244

Merged

3 tasks

shwina assigned wence- Jan 19, 2023

wence- changed the title ~~[BUG] Segfault in read_parquet for empty dataframe~~ [BUG] read_parquet/to_parquet don't handle empty dataframes correctly Jan 19, 2023

GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023

GregoryKimball assigned PointKernel and bdice and unassigned wence- Apr 11, 2023

GregoryKimball unassigned bdice and PointKernel Feb 15, 2024

GregoryKimball added the good first issue Good for newcomers label Feb 15, 2024

wence- mentioned this issue Mar 21, 2024

[FEA] Remove boundscheck=False from parquet.pyx and json.pyx #15360

Closed

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add test of rapidsai#12243

39a2839

wence- mentioned this issue Mar 22, 2024

Fix arrow-based round trip of empty dataframes #15373

Merged

3 tasks

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add test of rapidsai#12243

c158bda

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add test of rapidsai#12243

5c00e79

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024

Add tests of rapidsai#12243 and rapidsai#14159

9bc0344

rapids-bot bot closed this as completed in #15373 Mar 23, 2024

rapids-bot bot closed this as completed in dda3f31 Mar 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

wence- commented Nov 25, 2022 •

edited

Loading

wence- commented Jan 19, 2023

wence- commented Jan 19, 2023

wence- commented Nov 23, 2023

mhaseeb123 commented Mar 20, 2024

wence- commented Mar 21, 2024

wence- commented Mar 21, 2024

mhaseeb123 commented Mar 21, 2024

mhaseeb123 commented Mar 21, 2024

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

Comments

wence- commented Nov 25, 2022 • edited Loading

wence- commented Jan 19, 2023

wence- commented Jan 19, 2023

wence- commented Nov 23, 2023

mhaseeb123 commented Mar 20, 2024

wence- commented Mar 21, 2024

wence- commented Mar 21, 2024

mhaseeb123 commented Mar 21, 2024

mhaseeb123 commented Mar 21, 2024

wence- commented Nov 25, 2022 •

edited

Loading