Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] dask-cudf.to_parquet(write_metadata_file=True, append=True) fails #17177

Closed
ayushdg opened this issue Oct 24, 2024 · 2 comments · Fixed by #17198
Closed

[BUG] dask-cudf.to_parquet(write_metadata_file=True, append=True) fails #17177

ayushdg opened this issue Oct 24, 2024 · 2 comments · Fixed by #17198
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@ayushdg
Copy link
Member

ayushdg commented Oct 24, 2024

Describe the bug
When writing a dask-cudf data frame with the append=True and write_metadata_file=True flag set, the operation seems to fail when trying to merge metadata files.

Steps/Code to reproduce bug

import dask_cudf
import cudf

df = cudf.DataFrame({"a":[1,2,3]})
ddf = dask_cudf.from_cudf(df,1)
ddf.to_parquet("test.parquet", append=True, write_metadata_file=True, write_index=False)

ddf.to_parquet("test.parquet", append=True, write_metadata_file=True, write_index=False)
File /datasets/adattagupta/mambaforge/envs/curator-latest/lib/python3.10/site-packages/dask_cudf/io/parquet.py:423, in CudfEngine.write_metadata(parts, fmd, fs, path, append, **kwargs)
    420     _meta = [fmd]
    421 _meta.extend([parts[i][0]["meta"] for i in range(len(parts))])
    422 _meta = (
--> 423     cudf.io.merge_parquet_filemetadata(_meta)
    424     if len(_meta) > 1
    425     else _meta[0]
    426 )
    427 with fs.open(metadata_path, "wb") as fil:
    428     fil.write(memoryview(_meta))

File /datasets/adattagupta/mambaforge/envs/curator-latest/lib/python3.10/site-packages/cudf/io/parquet.py:1109, in merge_parquet_filemetadata(filemetadata_list)
   1105 @ioutils.doc_merge_parquet_filemetadata()
   1106 def merge_parquet_filemetadata(filemetadata_list):
   1107     """{docstring}"""
-> 1109     return libparquet.merge_filemetadata(filemetadata_list)

File parquet.pyx:829, in cudf._lib.parquet.merge_filemetadata()

File parquet.pyx:842, in cudf._lib.parquet.merge_filemetadata()

File <stringsource>:47, in vector.from_py.__pyx_convert_vector_from_py_uint8_t()

TypeError: 'pyarrow._parquet.FileMetaData' object is not iterable

Expected behavior
No errors

Environment overview (please complete the following information)

  • Environment location: [Bare-metal, Docker, Cloud(specify cloud provider)]
  • Method of cuDF install: [conda, Docker, or from source]
    • If method of install is [Docker], provide docker pull & docker run commands used

Environment details
Tested with cudf/dask-cudf 24.8 though it probably persists in newer releases as well.

Additional context
Add any other context about the problem here.

@ayushdg ayushdg added the bug Something isn't working label Oct 24, 2024
@vyasr vyasr added the Python Affects Python cuDF API. label Oct 28, 2024
@vyasr
Copy link
Contributor

vyasr commented Oct 28, 2024

@rjzamora it looks like this issue has to do with how the pyarrow metadata is being handled at the dask layer when passed down to cudf, yeah?

@rjzamora rjzamora self-assigned this Oct 29, 2024
@rjzamora
Copy link
Member

@rjzamora it looks like this issue has to do with how the pyarrow metadata is being handled at the dask layer when passed down to cudf, yeah?

Yeah, we are passing down list[pq.FileMetaData, bytes], and cudf is expecting list[bytes, ...]. I can submit a fix.

rapids-bot bot pushed a commit that referenced this issue Oct 30, 2024
Closes #17177

When appending to a parquet dataset with Dask cuDF, the original metadata must be converted from `pq.FileMetaData` to `bytes` before it can be passed down to `cudf.io.merge_parquet_filemetadata`.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #17198
@github-project-automation github-project-automation bot moved this from Todo to Done in cuDF Python Oct 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

3 participants