Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix bug in dask_cudf.read_parquet for index=False #9453

Merged
merged 1 commit into from
Oct 15, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 7 additions & 3 deletions python/dask_cudf/dask_cudf/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -228,9 +228,8 @@ def read_partition(

if index and (index[0] in df.columns):
df = df.set_index(index[0])
elif index is False and set(df.index.names).issubset(columns):
# If index=False, we need to make sure all of the
# names in `columns` are actually in `df.columns`
elif index is False and df.index.names != (None,):
# If index=False, we shouldn't have a named index
df.reset_index(inplace=True)

return df
Expand Down Expand Up @@ -331,6 +330,11 @@ def set_object_dtypes_from_pa_schema(df, schema):
# pyarrow schema.
if schema:
for col_name, col in df._data.items():
if col_name is None:
# Pyarrow cannot handle `None` as a field name.
# However, this should be a simple range index that
# we can ignore anyway
continue
typ = cudf_dtype_from_pa_type(schema.field(col_name).type)
if (
col_name in schema.names
Expand Down
11 changes: 11 additions & 0 deletions python/dask_cudf/dask_cudf/io/tests/test_parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,6 +86,17 @@ def test_roundtrip_from_dask_index_false(tmpdir):
dd.assert_eq(ddf.reset_index(drop=False), ddf2)


def test_roundtrip_from_dask_none_index_false(tmpdir):
tmpdir = str(tmpdir)
path = os.path.join(tmpdir, "test.parquet")

df2 = ddf.reset_index(drop=True).compute()
df2.to_parquet(path, engine="pyarrow")

ddf3 = dask_cudf.read_parquet(path, index=False)
dd.assert_eq(df2, ddf3)
Comment on lines +90 to +97
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tmpdir = str(tmpdir)
path = os.path.join(tmpdir, "test.parquet")
df2 = ddf.reset_index(drop=True).compute()
df2.to_parquet(path, engine="pyarrow")
ddf3 = dask_cudf.read_parquet(path, index=False)
dd.assert_eq(df2, ddf3)
bytes_buf = BytesIO()
df2 = ddf.reset_index(drop=True).compute()
df2.to_parquet(bytes_buf, engine="pyarrow")
ddf3 = dask_cudf.read_parquet(bytes_buf, index=False)
dd.assert_eq(df2, ddf3)

Can we use BytesIO instead of interacting with Filesystem?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooo - I like your thinking here, but we cannot pass a BytesIO object to dask_cudf.read_parquet without making some much larger changes :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. I thought dask_cudf.read_parquet accepts Bytes like object similar to pyarrow/cudf/pandas. Then lets not do it now.



@pytest.mark.parametrize("write_meta", [True, False])
def test_roundtrip_from_dask_cudf(tmpdir, write_meta):
tmpdir = str(tmpdir)
Expand Down