Skip to content

Commit

Permalink
Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#…
Browse files Browse the repository at this point in the history
…12983)

Once dask/dask#10007 is merged, users will be able to pass a dictionary of hive-partitioning options to `dd.read_parquet` (using the `dataset=` kwarg). This new feature provides a workaround for the fact that `pyarrow.dataset.Partitioning` objects **cannot** be serialized in Python. In order for this feature to be supported in `dask_cudf` the `CudfEngine.read_partition` method must account for the case that `partitioning` is a `dict`.

**NOTE**:
It is not possible to add test coverage for this change until dask#10007 is merged. However, I don't see any good reason not to merge this PR **before** dask#10007.

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)

URL: #12983
  • Loading branch information
rjzamora authored Mar 23, 2023
1 parent 2818d45 commit 5cdb9d9
Showing 1 changed file with 4 additions and 0 deletions.
4 changes: 4 additions & 0 deletions python/dask_cudf/dask_cudf/io/parquet.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,8 @@ def _read_paths(
if row_groups
else None,
strings_to_categorical=strings_to_categorical,
dataset_kwargs=dataset_kwargs,
categorical_partitions=False,
**kwargs,
)
for i, pof in enumerate(paths_or_fobs)
Expand Down Expand Up @@ -191,6 +193,8 @@ def read_partition(

dataset_kwargs = kwargs.get("dataset", {})
partitioning = partitioning or dataset_kwargs.get("partitioning", None)
if isinstance(partitioning, dict):
partitioning = pa_ds.partitioning(**partitioning)

# Check if we are actually selecting any columns
read_columns = columns
Expand Down

0 comments on commit 5cdb9d9

Please sign in to comment.