Fix struct and list dtype logic in dask_cudf.read_parquet #9159
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The
dask_cudf.read_parquet
logic currently uses a utility calledset_object_dtypes_from_pa_schema
to cast dtypes into proper cudf/pyarrow dtypes. The same utility is used to both (1) "reset"pandas.DataFrame
metadata to agree with the pyarrow schema, and (2) to castcudf.DataFrame
partitions to the "correct" dtypes at IO time. In both cases, thepd/cudf.DataFrame
columns cannot be cast to cudf list and/or struct columns. Therefore, this PR simply adds a simpole check to avoid casting for list/struct dtypes. We also add a new test (test_cudf_dtypes_from_pandas
) to add test coverage for the original bug.