Fix struct and list dtype logic in dask_cudf.read_parquet #9159

rjzamora · 2021-09-01T17:14:01Z

The dask_cudf.read_parquet logic currently uses a utility called set_object_dtypes_from_pa_schema to cast dtypes into proper cudf/pyarrow dtypes. The same utility is used to both (1) "reset" pandas.DataFrame metadata to agree with the pyarrow schema, and (2) to cast cudf.DataFrame partitions to the "correct" dtypes at IO time. In both cases, the pd/cudf.DataFrame columns cannot be cast to cudf list and/or struct columns. Therefore, this PR simply adds a simpole check to avoid casting for list/struct dtypes. We also add a new test (test_cudf_dtypes_from_pandas) to add test coverage for the original bug.

codecov · 2021-09-01T18:41:25Z

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.10@1935a8a). Click here to learn what that means.
The diff coverage is n/a.

❗ Current head e5096b6 differs from pull request most recent head 0b60f21. Consider uploading reports for the commit 0b60f21 to get more accurate results

@@               Coverage Diff               @@
##             branch-21.10    #9159   +/-   ##
===============================================
  Coverage                ?   10.85%           
===============================================
  Files                   ?      115           
  Lines                   ?    18742           
  Branches                ?        0           
===============================================
  Hits                    ?     2034           
  Misses                  ?    16708           
  Partials                ?        0

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1935a8a...0b60f21. Read the comment docs.

rjzamora · 2021-09-14T15:40:14Z

Closing in favor of #9203

A similar fix for this problem was recently submitted in #9159 and closed in favor of #9203. It seems that the test added in the latter PR was not actually capturing the original problem. However, after [dask#8072](dask/dask#8072) is merged, the new test will certainly start failing. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - Benjamin Zaitlen (https://github.com/quasiben) URL: #9314

rjzamora added 2 commits September 1, 2021 10:00

fix set_object_dtypes_from_pa_schema logic

2e227c4

comment tweak

0b60f21

rjzamora added bug Something isn't working 2 - In Progress Currently a work in progress dask Dask issue non-breaking Non-breaking change labels Sep 1, 2021

rjzamora self-assigned this Sep 1, 2021

rjzamora requested a review from a team as a code owner September 1, 2021 17:14

github-actions bot added the Python Affects Python cuDF API. label Sep 1, 2021

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Sep 1, 2021

rjzamora mentioned this pull request Sep 14, 2021

[REVIEW] Misc optimizations in cudf #9203

Merged

galipremsagar added a commit to galipremsagar/cudf that referenced this pull request Sep 14, 2021

add tests from rapidsai#9159

f11a180

rjzamora closed this Sep 14, 2021

rjzamora mentioned this pull request Sep 27, 2021

Avoid casting to list or struct dtypes in dask_cudf.read_parquet #9314

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix struct and list dtype logic in dask_cudf.read_parquet #9159

Fix struct and list dtype logic in dask_cudf.read_parquet #9159

rjzamora commented Sep 1, 2021

codecov bot commented Sep 1, 2021

rjzamora commented Sep 14, 2021

Fix struct and list dtype logic in dask_cudf.read_parquet #9159

Fix struct and list dtype logic in dask_cudf.read_parquet #9159

Conversation

rjzamora commented Sep 1, 2021

codecov bot commented Sep 1, 2021

Codecov Report

rjzamora commented Sep 14, 2021