Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

Closed
wence- opened this issue Nov 25, 2022 · 8 comments · Fixed by #15373
Closed

[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243

wence- opened this issue Nov 25, 2022 · 8 comments · Fixed by #15373
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.

Comments

@wence-
Copy link
Contributor

wence- commented Nov 25, 2022

Describe the bug

[This is (possibly) related to https://github.com/pandas-dev/pandas/issues/37897]

An empty dataframe cannot be round-tripped through to_parquet/read_parquet:

import cudf
from io import BytesIO

df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))

buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = cudf.read_parquet(buf) # => IndexError
File parquet.pyx:123, in cudf._lib.parquet.read_parquet()

File parquet.pyx:145, in cudf._lib.parquet.read_parquet()

File utils.pyx:52, in cudf._lib.io.utils.make_source_info()

IndexError: Out of bounds on buffer access (axis 0)

If I try and read the buffer in pandas I also get an issue

import cudf
import pandas as pd
from io import BytesIO

df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))

buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = pd.read_parquet(buf) # => ArrowInvalid: Could not open Parquet input source '<Buffer>': Parquet file size is 0 bytes

If I create the buffer from pandas, Iget a bad allocation in libcudf

import cudf
import pandas as pd
from io import BytesIO

df = pd.DataFrame(index=pd.RangeIndex(0, 10))

buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = cudf.read_parquet(buf)

File parquet.pyx:123, in cudf._lib.parquet.read_parquet()

File parquet.pyx:162, in cudf._lib.parquet.read_parquet()

MemoryError: std::bad_alloc

Expected behavior

cuDF should not segfault on invalid inputs. It should also produce a valid parquet file in this case.

cuDF should be able to round-trip empty dataframes through a parquet file.

@wence- wence- added bug Something isn't working Needs Triage Need team to review and classify labels Nov 25, 2022
@GregoryKimball GregoryKimball added 0 - Backlog In queue waiting for assignment cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 29, 2022
@wence- wence- mentioned this issue Dec 15, 2022
3 tasks
@wence-
Copy link
Contributor Author

wence- commented Jan 19, 2023

In current 23.02 the segmentation faults are gone (good!). But it continues to be the case that none of these examples completes successfully.

It appears that read_parquet and to_parquet don't play well with empty dataframes.

@wence- wence- changed the title [BUG] Segfault in read_parquet for empty dataframe [BUG] read_parquet/to_parquet don't handle empty dataframes correctly Jan 19, 2023
@wence-
Copy link
Contributor Author

wence- commented Jan 19, 2023

@GregoryKimball I think this probably needs some handling in libcudf, although I guess we could special-case empty dataframes in python, it feels hacky.

@GregoryKimball GregoryKimball added the libcudf Affects libcudf (C++/CUDA) code. label Apr 2, 2023
@GregoryKimball GregoryKimball assigned PointKernel and bdice and unassigned wence- Apr 11, 2023
@wence-
Copy link
Contributor Author

wence- commented Nov 23, 2023

An update, with 24.02:

Case 1: segfaults
Case 2: KeyError: '__index_level_0__'
Case 3: Works

@mhaseeb123
Copy link
Member

Investigating this. Some insights:

  1. The problems occur when the index column is encoded into the binary data. i.e. cudf.to_parquet(index=True). The following works perfectly because the index column is encoded as {"kind": "range", "name": null, "start": 0, "stop": 3, "step": 1}:
import cudf
import pandas as pd
from io import BytesIO
df = (cudf or pd).DataFrame(index=(cudf or pd).RangeIndex(0, 10))
buf = BytesIO()
df.to_parquet(buf, index=None)
df_read = cudf.read_parquet(buf)
  1. When index column is encoded (index=True), the to_parquet() function from pandas and cudf results in very different parquet data files. Specifically, the pandas version writes an extra column (__index_level_0__) in binary and metadata whereas the cudf version doesn't and leaves column=[]. This leads to segfault when trying to read the data.
# written by pd.to_parquet()

b'PAR1\x15\x04\x15\xa0\x01\x15dL\x15\x14\x15\x00\x12\x00\x00P\x00\x00\r\x01\x00\x01\r\x08\x00
\x02\r\x08\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08<\x08\x00\x00
\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x15$,\x15\x14\x15\x10\x15\x06
\x15\x06\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00
\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x10<\x02\x00\x00\x00\x14\x01\x04\x05\x102Tv\x98\x00\x00\x00&\xa8\x02\x1c\x15\x04
\x195\x00\x06\x10\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16\xd8\x02\x16\xa0\x02&\x8a
\x01&\x08\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00
\x00\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x19,5\x00
\x18\x06schema\x15\x02\x00\x15\x04%\x02\x18\x11__index_level_0__\x00\x16\x14\x19\x1c\x19
\x1c&\xa8\x02\x1c\x15\x04\x195\x00\x06\x10\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16
\xd8\x02\x16\xa0\x02&\x8a\x01&\x08\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00
\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00
\x00\x00\x16\xd8\x02\x16\x14&\x08\x16\xa0\x02\x14\x00\x00\x19,\x18\x06pandas\x18\xf5\x02
{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null, 
"pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "columns": [{"name": null, 
"field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}], 
"creator": {"library": "pyarrow", "version": "14.0.2"}, "pandas_version": "2.2.1"}\x00\x18\x0cARROW:schema
\x18\x80\x06/////zgCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAKw
BAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAHUBAAB7ImluZGV4X2
NvbHVtbnMiOiBbIl9faW5kZXhfbGV2ZWxfMF9fIl0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAi
ZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgI
m1ldGFkYXRhIjogbnVsbH1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6ICJfX2luZGV
4X2xldmVsXzBfXyIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkY
XRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMTQuMC4yIn0sI
CJwYW5kYXNfdmVyc2lvbiI6ICIyLjIuMSJ9AAAAAQAAABQAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQI
QAAAALAAAAAQAAAAAAAAAEQAAAF9faW5kZXhfbGV2ZWxfMF9fAAAACAAMAAgABwAIAAAAAAAAAUAAA
AAAAAAA\x00\x18 parquet-cpp-arrow version 14.0.2\x19\x1c\x1c\x00\x00\x00e\x05\x00\x00PAR1'

# written by cudf
b'PAR1\x15\x00\x15\xa0\x01\x15R,\x15\x14\x15\x00\x15\x06\x15\x06\x00\x00P\x00\x00\r\x01\x00\x01\r\x08
\x00\x02\r\x08\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08\x00\x08\r\x08\x00
\t\r\x08\x15\x02\x19,H\x06schema\x15\x02\x00\x15\x04%\x00\x18\x11__index_level_0__\x00\x16\x14\x19\x1c
\x19\x1c&\x00\x1c\x15\x04\x19\x15\x00\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16\xc4\x01\x16v&\x08
<6\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16\xc4
\x01\x16\x14\x00\x19\x1c\x18\x06pandas\x18\x84\x02{"index_columns": ["__index_level_0__"], "column_indexes":
 [{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}], 
"columns": [], "creator": {"library": "pyarrow", "version": "14.0.2"}, "pandas_version": "2.2.1"}\x00\x18\x15cudf 
version 24.06.00\x19\x1c\x1c\x00\x00\x00\xa3\x01\x00\x00PAR1'
  1. Still investigating where exactly does it fail to add a fix in either the writer or the reader. Debugging with printfs isn't showing the point of failure as it is possible the code is crashing at a later point and debug points are printing throughout the libcudf.read_parquet() API.

@wence-
Copy link
Contributor Author

wence- commented Mar 21, 2024

Could you try with this patch?

diff --git a/python/cudf/cudf/_lib/utils.pyx b/python/cudf/cudf/_lib/utils.pyx
index b6637e9df0..aac2fb1de0 100644
--- a/python/cudf/cudf/_lib/utils.pyx
+++ b/python/cudf/cudf/_lib/utils.pyx
@@ -59,7 +59,7 @@ cpdef generate_pandas_metadata(table, index):
     types = []
     index_levels = []
     index_descriptors = []
-
+    columns_to_convert = list(table._columns)
     # Columns
     for name, col in table._data.items():
         if cudf.get_option("mode.pandas_compatible"):
@@ -109,20 +109,23 @@ cpdef generate_pandas_metadata(table, index):
                 else:
                     # When `index=True`, RangeIndex needs to be materialized.
                     materialized_idx = cudf.Index(idx._values, name=idx.name)
-                    descr = \
-                        _index_level_name(
-                            index_name=materialized_idx.name,
-                            level=level,
-                            column_names=col_names
-                        )
-                    index_levels.append(materialized_idx)
-            else:
-                descr = \
-                    _index_level_name(
-                        index_name=idx.name,
+                    descr = _index_level_name(
+                        index_name=materialized_idx.name,
                         level=level,
                         column_names=col_names
                     )
+                    index_levels.append(materialized_idx)
+                    columns_to_convert.append(materialized_idx._values)
+                    col_names.append(descr)
+                    types.append(np_to_pa_dtype(idx.dtype))
+            else:
+                descr = _index_level_name(
+                    index_name=idx.name,
+                    level=level,
+                    column_names=col_names
+                )
+                columns_to_convert.append(idx._values)
+                col_names.append(descr)
                 if isinstance(idx.dtype, cudf.CategoricalDtype):
                     raise ValueError(
                         "'category' column dtypes are currently not "
@@ -141,14 +144,10 @@ cpdef generate_pandas_metadata(table, index):
                         types.append(np_to_pa_dtype(idx.dtype))
 
                 index_levels.append(idx)
-            col_names.append(name)
             index_descriptors.append(descr)
 
     metadata = pa.pandas_compat.construct_metadata(
-        columns_to_convert=[
-            col
-            for col in table._columns
-        ],
+        columns_to_convert=columns_to_convert,
         # It is OKAY to do `.head(0).to_pandas()` because
         # this method will extract `.columns` metadata only
         df=table.head(0).to_pandas(),

I think that should produce the correct metadata in the write.

I think one should still not segfault on incorrect data though.

@wence-
Copy link
Contributor Author

wence- commented Mar 21, 2024

To fix the segfault in read with invalid metadata, can you please try:

diff --git a/python/cudf/cudf/_lib/parquet.pyx b/python/cudf/cudf/_lib/parquet.pyx
index d3f5b42337..0fdf2cc287 100644
--- a/python/cudf/cudf/_lib/parquet.pyx
+++ b/python/cudf/cudf/_lib/parquet.pyx
@@ -297,6 +297,8 @@ cpdef read_parquet(filepaths_or_buffers, columns=None, row_groups=None,
         elif set(index_col).issubset(names):
             index_data = df[index_col]
             actual_index_names = list(index_col_names.values())
+            if len(actual_index_names) != len(index_data._data):
+                raise RuntimeError("Encoded metadata for index names doesn't match index data")
             if len(index_data._data) == 1:
                 idx = cudf.Index(
                     index_data._data.columns[0],

See also #15360.

How did I debug this:

  1. Install gdb from conda (this gives us pretty-printers for Python object types)
  2. Compile parquet.pyx with debug symbols:
    a. Edit cudf/_lib/CMakeLists.txt to add set_source_files_properties(parquet.cxx PROPERTIES COMPILE_OPTIONS "-O0;-g")
    b. Build cudf python library without stripping symbols from the installed library: export SKBUILD_INSTALL_STRIP=false and then build-cudf-python or however you do it.
  3. Run in gdb and hit segfault gdb --args python bug.py
Thread 1 "python" received signal SIGSEGV, Segmentation fault.
__pyx_f_4cudf_4_lib_7parquet_read_parquet (__pyx_v_filepaths_or_buffers=[<_io.BytesIO at remote 0x7ffff78365c0>], __pyx_skip_dispatch=0, __pyx_optional_args=0x7fffffffabb0)
    at /home/coder/cudf/python/cudf/build/conda/cuda-12.2/release/cudf/_lib/parquet.cxx:28731

28731           if (PyDict_SetItem(__pyx_t_5, __pyx_n_s_name, PyList_GET_ITEM(__pyx_v_actual_index_names, 0)) < 0) __PYX_ERR(0, 303, __pyx_L1_error)
(gdb) 
(gdb) p __pyx_v_actual_index_names
$1 = []
(gdb)

Hmm, interesting, what does PyList_GET_ITEM do? (Helps to have a cpython checkout)

#define PyList_GET_ITEM(op, i) (((PyListObject *)(op))->ob_item[i])

So it just indexes the data without error checking (due to the boundscheck=False line in the cython file):

(gdb) p *(PyListObject *)(__pyx_v_actual_index_names)
$3 = {ob_base = {ob_base = {ob_refcnt = 1, ob_type = 0x5555558dbc40 <PyList_Type>}, ob_size = 0}, ob_item = 0x0, allocated = 0}

And there's our null pointer dereference.

@mhaseeb123
Copy link
Member

@wence- Thank you for taking up from my notes and investigating this further. Also thank you for helpful instructions on debugging cython code with gdb, which I didn't know was possible like this (hence me using prints everywhere). It will be very useful for me in my ongoing investigation of other cython issues. Let me apply the above two patches and circle back to you if they solve the problem. Thanks again!

@mhaseeb123
Copy link
Member

@wence- : Confirming that the above patches solve the problem. Please feel free to commit and add a PR. 😊

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from io import BytesIO
import cudf


def empty_dfs():
    df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))
    buf = BytesIO()
    df.to_parquet(buf, index=True)
    df_read = cudf.read_parquet(buf)
    pd_read = pq.read_table(buf)

    print(df, df_read, pd_read)

    df2 = pd.DataFrame(index=pd.RangeIndex(0, 10))
    buf2 = BytesIO()
    df.to_parquet(buf2, index=True)
    pd2_read = pd.read_parquet(buf2)
    df2_read = cudf.read_parquet(buf2)

    print(df2, df2_read, pd2_read)

if __name__ == "__main__":
    empty_dfs()

wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
For an empty data frame (with no columns) we would previously not
write the correct metadata, resulting in not correctly round-tripping
through the parquet read/write cycle.

- Closes rapidsai#12243
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
For an empty data frame (with no columns) we would previously not
write the correct metadata, resulting in not correctly round-tripping
through the parquet read/write cycle.

- Closes rapidsai#12243
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
wence- added a commit to wence-/cudf that referenced this issue Mar 22, 2024
@rapids-bot rapids-bot bot closed this as completed in dda3f31 Mar 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue good first issue Good for newcomers libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants