Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Null types are not promoted in cudf.io.merge_parquet_filemetadata #14326

Closed
rjzamora opened this issue Oct 25, 2023 · 0 comments · Fixed by #14322
Closed

[BUG] Null types are not promoted in cudf.io.merge_parquet_filemetadata #14326

rjzamora opened this issue Oct 25, 2023 · 0 comments · Fixed by #14322
Assignees
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.

Comments

@rjzamora
Copy link
Member

Describe the bug
There is a minor bug in cudf.io.merge_parquet_filemetadata after #14264 was merged. Some details are discussed in that PR. E.g. #14264 (comment)

Steps/Code to reproduce bug

import pyarrow.dataset as ds
import cudf
import fsspec
from dask_cudf.io.parquet import CudfEngine
import pandas as pd
paths = ["/tmp/part.0.parquet", "/tmp/part.1.parquet"]
pd.DataFrame({"a": [None] * 10}).to_parquet(paths[0])
pd.DataFrame({"a": range(10)}).to_parquet(paths[1])
md1 = CudfEngine.collect_file_metadata(paths[0], fsspec, "none")
md2 = CudfEngine.collect_file_metadata(paths[1], fsspec, "none")
md = cudf.io.merge_parquet_filemetadata([md1, md2])
with open("/tmp/_metadata", "wb") as fil:
    fil.write(memoryview(md))
schema = ds.parquet_dataset("/tmp/_metadata").schema
assert schema.field(0).type !=null
---------------------------------------------------------------------------
AssertionError                            Traceback (most recent call last)
Cell In[3], line 15
     13     fil.write(memoryview(md))
     14 schema = ds.parquet_dataset("/tmp/_metadata").schema
---> 15 assert schema.field(0).type != "null"

AssertionError: 

Expected behavior
Null column metadata should be "promoted" when it is merged with integer metedata for the same column.

Additional context
This bug is unlikely to affect many users, and can be treated as a low priority if the fix proves difficult.

@rjzamora rjzamora added bug Something isn't working Needs Triage Need team to review and classify labels Oct 25, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 25, 2023
We are not currently running any IO tests for `dask_cudf` in CI. This PR should correct this. It also modifies a test that *would* be failing due to #14326

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)
  - GALI PREM SAGAR (https://github.com/galipremsagar)

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Ray Douglass (https://github.com/raydouglass)

URL: #14327
@galipremsagar galipremsagar added libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Oct 27, 2023
rapids-bot bot pushed a commit that referenced this issue Oct 31, 2023
Closes #14315
Closes #14326

Parquet writer writes time and timestamp types with logical type with `isAdjustedToUTC` as `false`. However, timestamps in libcudf tables are implicitly in UTC and don't need to be adjusted.
This PR changes the `isAdjustedToUTC` to true.

Also added a writer option to write timestamps as local, as this is the expected behavior on the Python side.

Also changed the way logical type is handled for UNKNOWN type columns in `merge_row_group_metadata` - the logical type is excluded from merged metadata because of issues with type inference.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - David Wendt (https://github.com/davidwendt)
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Mike Wilson (https://github.com/hyperbolic2346)

URL: #14322
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants