-
Notifications
You must be signed in to change notification settings - Fork 912
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] read_parquet/to_parquet don't handle empty dataframes correctly #12243
Comments
In current 23.02 the segmentation faults are gone (good!). But it continues to be the case that none of these examples completes successfully. It appears that |
@GregoryKimball I think this probably needs some handling in libcudf, although I guess we could special-case empty dataframes in python, it feels hacky. |
An update, with 24.02: Case 1: segfaults |
Investigating this. Some insights:
import cudf
import pandas as pd
from io import BytesIO
df = (cudf or pd).DataFrame(index=(cudf or pd).RangeIndex(0, 10))
buf = BytesIO()
df.to_parquet(buf, index=None)
df_read = cudf.read_parquet(buf)
# written by pd.to_parquet()
b'PAR1\x15\x04\x15\xa0\x01\x15dL\x15\x14\x15\x00\x12\x00\x00P\x00\x00\r\x01\x00\x01\r\x08\x00
\x02\r\x08\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08<\x08\x00\x00
\x00\x00\x00\x00\x00\t\x00\x00\x00\x00\x00\x00\x00\x15\x00\x15\x15$,\x15\x14\x15\x10\x15\x06
\x15\x06\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00
\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x00\x10<\x02\x00\x00\x00\x14\x01\x04\x05\x102Tv\x98\x00\x00\x00&\xa8\x02\x1c\x15\x04
\x195\x00\x06\x10\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16\xd8\x02\x16\xa0\x02&\x8a
\x01&\x08\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00
\x00\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00
\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00\x00\x00\x15\x04\x19,5\x00
\x18\x06schema\x15\x02\x00\x15\x04%\x02\x18\x11__index_level_0__\x00\x16\x14\x19\x1c\x19
\x1c&\xa8\x02\x1c\x15\x04\x195\x00\x06\x10\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16
\xd8\x02\x16\xa0\x02&\x8a\x01&\x08\x1c\x18\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00
\x00\x00\x00\x00\x00\x00\x00\x16\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00
\x00\x00\x00\x00\x00\x00\x00\x19,\x15\x04\x15\x00\x15\x02\x00\x15\x00\x15\x10\x15\x02\x00
\x00\x00\x16\xd8\x02\x16\x14&\x08\x16\xa0\x02\x14\x00\x00\x19,\x18\x06pandas\x18\xf5\x02
{"index_columns": ["__index_level_0__"], "column_indexes": [{"name": null, "field_name": null,
"pandas_type": "int64", "numpy_type": "int64", "metadata": null}], "columns": [{"name": null,
"field_name": "__index_level_0__", "pandas_type": "int64", "numpy_type": "int64", "metadata": null}],
"creator": {"library": "pyarrow", "version": "14.0.2"}, "pandas_version": "2.2.1"}\x00\x18\x0cARROW:schema
\x18\x80\x06/////zgCAAAQAAAAAAAKAA4ABgAFAAgACgAAAAABBAAQAAAAAAAKAAwAAAAEAAgACgAAAKw
BAAAEAAAAAQAAAAwAAAAIAAwABAAIAAgAAAAIAAAAEAAAAAYAAABwYW5kYXMAAHUBAAB7ImluZGV4X2
NvbHVtbnMiOiBbIl9faW5kZXhfbGV2ZWxfMF9fIl0sICJjb2x1bW5faW5kZXhlcyI6IFt7Im5hbWUiOiBudWxsLCAi
ZmllbGRfbmFtZSI6IG51bGwsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgI
m1ldGFkYXRhIjogbnVsbH1dLCAiY29sdW1ucyI6IFt7Im5hbWUiOiBudWxsLCAiZmllbGRfbmFtZSI6ICJfX2luZGV
4X2xldmVsXzBfXyIsICJwYW5kYXNfdHlwZSI6ICJpbnQ2NCIsICJudW1weV90eXBlIjogImludDY0IiwgIm1ldGFkY
XRhIjogbnVsbH1dLCAiY3JlYXRvciI6IHsibGlicmFyeSI6ICJweWFycm93IiwgInZlcnNpb24iOiAiMTQuMC4yIn0sI
CJwYW5kYXNfdmVyc2lvbiI6ICIyLjIuMSJ9AAAAAQAAABQAAAAQABQACAAGAAcADAAAABAAEAAAAAAAAQI
QAAAALAAAAAQAAAAAAAAAEQAAAF9faW5kZXhfbGV2ZWxfMF9fAAAACAAMAAgABwAIAAAAAAAAAUAAA
AAAAAAA\x00\x18 parquet-cpp-arrow version 14.0.2\x19\x1c\x1c\x00\x00\x00e\x05\x00\x00PAR1'
# written by cudf
b'PAR1\x15\x00\x15\xa0\x01\x15R,\x15\x14\x15\x00\x15\x06\x15\x06\x00\x00P\x00\x00\r\x01\x00\x01\r\x08
\x00\x02\r\x08\x00\x03\r\x08\x00\x04\r\x08\x00\x05\r\x08\x00\x06\r\x08\x00\x07\r\x08\x00\x08\r\x08\x00
\t\r\x08\x15\x02\x19,H\x06schema\x15\x02\x00\x15\x04%\x00\x18\x11__index_level_0__\x00\x16\x14\x19\x1c
\x19\x1c&\x00\x1c\x15\x04\x19\x15\x00\x19\x18\x11__index_level_0__\x15\x02\x16\x14\x16\xc4\x01\x16v&\x08
<6\x00(\x08\t\x00\x00\x00\x00\x00\x00\x00\x18\x08\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16\xc4
\x01\x16\x14\x00\x19\x1c\x18\x06pandas\x18\x84\x02{"index_columns": ["__index_level_0__"], "column_indexes":
[{"name": null, "field_name": null, "pandas_type": "empty", "numpy_type": "object", "metadata": null}],
"columns": [], "creator": {"library": "pyarrow", "version": "14.0.2"}, "pandas_version": "2.2.1"}\x00\x18\x15cudf
version 24.06.00\x19\x1c\x1c\x00\x00\x00\xa3\x01\x00\x00PAR1'
|
Could you try with this patch? diff --git a/python/cudf/cudf/_lib/utils.pyx b/python/cudf/cudf/_lib/utils.pyx
index b6637e9df0..aac2fb1de0 100644
--- a/python/cudf/cudf/_lib/utils.pyx
+++ b/python/cudf/cudf/_lib/utils.pyx
@@ -59,7 +59,7 @@ cpdef generate_pandas_metadata(table, index):
types = []
index_levels = []
index_descriptors = []
-
+ columns_to_convert = list(table._columns)
# Columns
for name, col in table._data.items():
if cudf.get_option("mode.pandas_compatible"):
@@ -109,20 +109,23 @@ cpdef generate_pandas_metadata(table, index):
else:
# When `index=True`, RangeIndex needs to be materialized.
materialized_idx = cudf.Index(idx._values, name=idx.name)
- descr = \
- _index_level_name(
- index_name=materialized_idx.name,
- level=level,
- column_names=col_names
- )
- index_levels.append(materialized_idx)
- else:
- descr = \
- _index_level_name(
- index_name=idx.name,
+ descr = _index_level_name(
+ index_name=materialized_idx.name,
level=level,
column_names=col_names
)
+ index_levels.append(materialized_idx)
+ columns_to_convert.append(materialized_idx._values)
+ col_names.append(descr)
+ types.append(np_to_pa_dtype(idx.dtype))
+ else:
+ descr = _index_level_name(
+ index_name=idx.name,
+ level=level,
+ column_names=col_names
+ )
+ columns_to_convert.append(idx._values)
+ col_names.append(descr)
if isinstance(idx.dtype, cudf.CategoricalDtype):
raise ValueError(
"'category' column dtypes are currently not "
@@ -141,14 +144,10 @@ cpdef generate_pandas_metadata(table, index):
types.append(np_to_pa_dtype(idx.dtype))
index_levels.append(idx)
- col_names.append(name)
index_descriptors.append(descr)
metadata = pa.pandas_compat.construct_metadata(
- columns_to_convert=[
- col
- for col in table._columns
- ],
+ columns_to_convert=columns_to_convert,
# It is OKAY to do `.head(0).to_pandas()` because
# this method will extract `.columns` metadata only
df=table.head(0).to_pandas(), I think that should produce the correct metadata in the write. I think one should still not segfault on incorrect data though. |
To fix the segfault in read with invalid metadata, can you please try: diff --git a/python/cudf/cudf/_lib/parquet.pyx b/python/cudf/cudf/_lib/parquet.pyx
index d3f5b42337..0fdf2cc287 100644
--- a/python/cudf/cudf/_lib/parquet.pyx
+++ b/python/cudf/cudf/_lib/parquet.pyx
@@ -297,6 +297,8 @@ cpdef read_parquet(filepaths_or_buffers, columns=None, row_groups=None,
elif set(index_col).issubset(names):
index_data = df[index_col]
actual_index_names = list(index_col_names.values())
+ if len(actual_index_names) != len(index_data._data):
+ raise RuntimeError("Encoded metadata for index names doesn't match index data")
if len(index_data._data) == 1:
idx = cudf.Index(
index_data._data.columns[0], See also #15360. How did I debug this:
Hmm, interesting, what does
So it just indexes the data without error checking (due to the
And there's our null pointer dereference. |
@wence- Thank you for taking up from my notes and investigating this further. Also thank you for helpful instructions on debugging cython code with gdb, which I didn't know was possible like this (hence me using prints everywhere). It will be very useful for me in my ongoing investigation of other cython issues. Let me apply the above two patches and circle back to you if they solve the problem. Thanks again! |
@wence- : Confirming that the above patches solve the problem. Please feel free to commit and add a PR. 😊 import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd
from io import BytesIO
import cudf
def empty_dfs():
df = cudf.DataFrame(index=cudf.RangeIndex(0, 10))
buf = BytesIO()
df.to_parquet(buf, index=True)
df_read = cudf.read_parquet(buf)
pd_read = pq.read_table(buf)
print(df, df_read, pd_read)
df2 = pd.DataFrame(index=pd.RangeIndex(0, 10))
buf2 = BytesIO()
df.to_parquet(buf2, index=True)
pd2_read = pd.read_parquet(buf2)
df2_read = cudf.read_parquet(buf2)
print(df2, df2_read, pd2_read)
if __name__ == "__main__":
empty_dfs() |
For an empty data frame (with no columns) we would previously not write the correct metadata, resulting in not correctly round-tripping through the parquet read/write cycle. - Closes rapidsai#12243
For an empty data frame (with no columns) we would previously not write the correct metadata, resulting in not correctly round-tripping through the parquet read/write cycle. - Closes rapidsai#12243
Describe the bug
[This is (possibly) related to https://github.com/pandas-dev/pandas/issues/37897]
An empty dataframe cannot be round-tripped through
to_parquet/read_parquet
:If I try and read the buffer in pandas I also get an issue
If I create the buffer from pandas, Iget a bad allocation in libcudf
Expected behavior
cuDF should not segfault on invalid inputs. It should also produce a valid parquet file in this case.cuDF should be able to round-trip empty dataframes through a parquet file.
The text was updated successfully, but these errors were encountered: