Add new `dask_cudf.read_parquet` API #17250

rjzamora · 2024-11-05T21:36:53Z

Description

It's time to clean up the dask_cudf.read_parquet API and prioritize GPU-specific optimizations. To this end, it makes sense to expose our own read_parquet API within Dask cuDF.

Notes:

The "new" dask_cudf.read_parquet API is only relevant when query-planning is enabled (the default).
Using filesystem="arrow" now uses cudf.read_parquet when reading from local storage (rather than PyArrow).
(specific to Dask cuDF): The default blocksize argument is now specific to the "smallest" NVIDIA device detected within the active dask cluster (or the first device visible to the the client). More specifically, we use pynvml to find this representative device size, and we set blocksize to be 1/32 this size.
- The user may also pass in something like blocksize=0.125 to use 1/8 the minimum device size (or blocksize='1GiB' to bypass the default logic altogether).
(specific to Dask cuDF): When blocksize is None, we disable partition fusion at optimization time.
(specific to Dask cuDF): When blocksize is not None, we use the parquet metadata from the first few files to inform partition fusion at optimization time (instead of a rough column-count ratio).

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

…rquet-api

python/dask_cudf/dask_cudf/io/parquet.py

…rquet-api

rjzamora · 2024-11-15T18:27:40Z

python/cudf/cudf/io/parquet.py

+    elif (
+        filters is None
+        and isinstance(dataset_kwargs, dict)
+        and dataset_kwargs.get("partitioning") is None
+    ):
+        # Skip dataset processing if we have no filters
+        # or hive/directory partitioning to deal with.
+        return paths, row_groups, [], {}


The pyarrow.dataset logic below has non-negligible overhead on remove storage. This code block allows us to pass in dataset_kwargs={"partitioning": None} to skip unnecessary PyArrow processing when we know we are not reading from hive-partitioned data (and we aren't applying filters).

By default (when we don't pass in dataset_kwargs={"partitioning": None}), cudf will still pre-process the dataset with PyArrow just in case it is hive partitioned.

Thanks for the context @rjzamora ! Do you happen to have the rough overhead numbers/magnitude?

The overhead is measurable but not huge (~4-5%) for the specific case I was benchmarking.

madsbk

Nice work @rjzamora, looks good to me

python/dask_cudf/dask_cudf/io/parquet.py

…rquet-api

wence-

Some small suggestions

python/dask_cudf/dask_cudf/io/parquet.py

wence- · 2024-11-18T18:14:18Z

python/dask_cudf/dask_cudf/io/parquet.py

-            **to_pandas_kwargs,
-        )
+class CudfReadParquetFSSpec(ReadParquetFSSpec):
+    _STATS_CACHE: MutableMapping[str, Any] = {}


nit: Should this be a cache of bounded size (via lru_cache?

I don't think I'm very worried about this cache getting too large. We're just storing a dict with two elements (schema name, and storage size) for each column.

…rquet-api

galipremsagar

cudf python approval.

…new-read-parquet-api

python/dask_cudf/dask_cudf/io/parquet.py

rjzamora · 2024-11-19T21:48:21Z

/merge

rjzamora · 2024-11-19T23:51:34Z

@GregoryKimball @vyasr - FYI: I'd really like this to be included in 24.12 (it makes it much easier to use/demonstrate the recent KvikIO + S3 improvements).

vyasr · 2024-11-20T16:18:53Z

All good! I had this on the board slated for this release.

#17250 started using `pynvml` but did not add the proper dependency, this change fixes the missing dependency. Authors: - Peter Andreas Entschev (https://github.com/pentschev) - https://github.com/jakirkham Approvers: - GALI PREM SAGAR (https://github.com/galipremsagar) - https://github.com/jakirkham URL: #17386

rjzamora added 2 commits November 5, 2024 12:56

add new read_parquet API to dask_cudf

3e853f8

fix non-expr deprecation

b30c529

rjzamora added the 2 - In Progress Currently a work in progress label Nov 5, 2024

rjzamora self-assigned this Nov 5, 2024

github-actions bot added the Python Affects Python cuDF API. label Nov 5, 2024

rjzamora added 4 commits November 5, 2024 16:33

fix CudfReadParquetFSSpec fusion

e3c640a

correct for aggregate_files=False

e482026

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

b9af7b7

…rquet-api

update default blocksize, and add docstring

2ad1867

rjzamora added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 6, 2024

rjzamora added 4 commits November 6, 2024 11:35

Merge branch 'branch-24.12' into new-read-parquet-api

6c37a9c

Merge branch 'branch-24.12' into new-read-parquet-api

53dfbdf

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

900ebf6

…rquet-api

revise _normalize_blocksize

3fb23fd

rjzamora changed the title ~~[WIP] Add new dask_cudf.read_parquet API~~ Add new dask_cudf.read_parquet API Nov 12, 2024

rjzamora commented Nov 12, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/parquet.py Show resolved Hide resolved

rjzamora added 3 commits November 13, 2024 08:15

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

7af233e

…rquet-api

fix test

0552b33

proper test fix - and disable dataset processing unless necessary

564c13c

rjzamora marked this pull request as ready for review November 13, 2024 19:51

rjzamora requested review from a team as code owners November 13, 2024 19:51

rjzamora requested review from wence- and brandon-b-miller November 13, 2024 19:51

rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 13, 2024

rjzamora added 3 commits November 13, 2024 12:33

preserve default hive handling in cudf

64dd105

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

68b8faf

…rquet-api

Merge branch 'branch-24.12' into new-read-parquet-api

e074004

Merge branch 'branch-24.12' into new-read-parquet-api

a305398

github-actions bot assigned vyasr Nov 15, 2024

rjzamora commented Nov 15, 2024

View reviewed changes

sarahyurick mentioned this pull request Nov 15, 2024

Add blocksize to DocumentDataset.read_* that uses dask_cudf.read_* NVIDIA/NeMo-Curator#285

Merged

3 tasks

madsbk approved these changes Nov 18, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved

rjzamora added 2 commits November 18, 2024 06:53

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

b539ed4

…rquet-api

address code review

0291b95

wence- approved these changes Nov 18, 2024

View reviewed changes

rjzamora added 3 commits November 18, 2024 10:45

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

beafffb

…rquet-api

sample single worker for device size

0832e55

Merge remote-tracking branch 'upstream/branch-24.12' into new-read-pa…

db82e0b

…rquet-api

galipremsagar approved these changes Nov 18, 2024

View reviewed changes

rjzamora added 3 commits November 19, 2024 08:24

Merge branch 'branch-24.12' into new-read-parquet-api

df1e283

use CUDA_VISIBLE_DEVICES

2195947

Merge branch 'new-read-parquet-api' of github.com:rjzamora/cudf into …

1038170

…new-read-parquet-api

rjzamora commented Nov 19, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved

support mig

0ecf6dd

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 19, 2024

rjzamora commented Nov 19, 2024

View reviewed changes

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved

Update python/dask_cudf/dask_cudf/io/parquet.py

fa03396

Merge branch 'branch-24.12' into new-read-parquet-api

1b79270

github-actions bot assigned madsbk Nov 20, 2024

rapids-bot bot merged commit 3111aa4 into rapidsai:branch-24.12 Nov 20, 2024
108 checks passed

rjzamora deleted the new-read-parquet-api branch November 20, 2024 14:14

pentschev mentioned this pull request Nov 20, 2024

Add pynvml as a dependency for dask-cudf #17386

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add new `dask_cudf.read_parquet` API #17250

Add new `dask_cudf.read_parquet` API #17250

rjzamora commented Nov 5, 2024 •

edited

Loading

rjzamora Nov 15, 2024

galipremsagar Nov 18, 2024 •

edited

Loading

rjzamora Nov 18, 2024 •

edited

Loading

madsbk left a comment

wence- left a comment

wence- Nov 18, 2024

rjzamora Nov 18, 2024

galipremsagar left a comment

rjzamora commented Nov 19, 2024

rjzamora commented Nov 19, 2024

vyasr commented Nov 20, 2024

Add new dask_cudf.read_parquet API #17250

Add new dask_cudf.read_parquet API #17250

Conversation

rjzamora commented Nov 5, 2024 • edited Loading

Description

Checklist

rjzamora Nov 15, 2024

Choose a reason for hiding this comment

galipremsagar Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

rjzamora Nov 18, 2024 • edited Loading

Choose a reason for hiding this comment

madsbk left a comment

Choose a reason for hiding this comment

wence- left a comment

Choose a reason for hiding this comment

wence- Nov 18, 2024

Choose a reason for hiding this comment

rjzamora Nov 18, 2024

Choose a reason for hiding this comment

galipremsagar left a comment

Choose a reason for hiding this comment

rjzamora commented Nov 19, 2024

rjzamora commented Nov 19, 2024

vyasr commented Nov 20, 2024

Add new `dask_cudf.read_parquet` API #17250

Add new `dask_cudf.read_parquet` API #17250

rjzamora commented Nov 5, 2024 •

edited

Loading

galipremsagar Nov 18, 2024 •

edited

Loading

rjzamora Nov 18, 2024 •

edited

Loading