Pre-emptive fix for upstream dask.dataframe.read_parquet changes (#…

…12983) Once dask/dask#10007 is merged, users will be able to pass a dictionary of hive-partitioning options to `dd.read_parquet` (using the `dataset=` kwarg). This new feature provides a workaround for the fact that `pyarrow.dataset.Partitioning` objects **cannot** be serialized in Python. In order for this feature to be supported in `dask_cudf` the `CudfEngine.read_partition` method must account for the case that `partitioning` is a `dict`. **NOTE**: It is not possible to add test coverage for this change until dask#10007 is merged. However, I don't see any good reason not to merge this PR **before** dask#10007. Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: #12983
rapidsai · Mar 23, 2023 · 5cdb9d9 · 5cdb9d9
1 parent 2818d45
commit 5cdb9d9
Showing 1 changed file with 4 additions and 0 deletions.
diff --git a/python/dask_cudf/dask_cudf/io/parquet.py b/python/dask_cudf/dask_cudf/io/parquet.py
@@ -121,6 +121,8 @@ def _read_paths(
                                 if row_groups
                                 else None,
                                 strings_to_categorical=strings_to_categorical,
+                                dataset_kwargs=dataset_kwargs,
+                                categorical_partitions=False,
                                 **kwargs,
                             )
                             for i, pof in enumerate(paths_or_fobs)
@@ -191,6 +193,8 @@ def read_partition(
 
         dataset_kwargs = kwargs.get("dataset", {})
         partitioning = partitioning or dataset_kwargs.get("partitioning", None)
+        if isinstance(partitioning, dict):
+            partitioning = pa_ds.partitioning(**partitioning)
 
         # Check if we are actually selecting any columns
         read_columns = columns