Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new dask_cudf.read_parquet API #17250

Merged
merged 28 commits into from
Nov 20, 2024

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Nov 5, 2024

Description

It's time to clean up the dask_cudf.read_parquet API and prioritize GPU-specific optimizations. To this end, it makes sense to expose our own read_parquet API within Dask cuDF.

Notes:

  • The "new" dask_cudf.read_parquet API is only relevant when query-planning is enabled (the default).
  • Using filesystem="arrow" now uses cudf.read_parquet when reading from local storage (rather than PyArrow).
  • (specific to Dask cuDF): The default blocksize argument is now specific to the "smallest" NVIDIA device detected within the active dask cluster (or the first device visible to the the client). More specifically, we use pynvml to find this representative device size, and we set blocksize to be 1/32 this size.
    • The user may also pass in something like blocksize=0.125 to use 1/8 the minimum device size (or blocksize='1GiB' to bypass the default logic altogether).
  • (specific to Dask cuDF): When blocksize is None, we disable partition fusion at optimization time.
  • (specific to Dask cuDF): When blocksize is not None, we use the parquet metadata from the first few files to inform partition fusion at optimization time (instead of a rough column-count ratio).

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@rjzamora rjzamora added the 2 - In Progress Currently a work in progress label Nov 5, 2024
@rjzamora rjzamora self-assigned this Nov 5, 2024
@github-actions github-actions bot added the Python Affects Python cuDF API. label Nov 5, 2024
@rjzamora rjzamora added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 6, 2024
@rjzamora rjzamora changed the title [WIP] Add new dask_cudf.read_parquet API Add new dask_cudf.read_parquet API Nov 12, 2024
@rjzamora rjzamora marked this pull request as ready for review November 13, 2024 19:51
@rjzamora rjzamora requested review from a team as code owners November 13, 2024 19:51
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Nov 13, 2024
Comment on lines +371 to +378
elif (
filters is None
and isinstance(dataset_kwargs, dict)
and dataset_kwargs.get("partitioning") is None
):
# Skip dataset processing if we have no filters
# or hive/directory partitioning to deal with.
return paths, row_groups, [], {}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The pyarrow.dataset logic below has non-negligible overhead on remove storage. This code block allows us to pass in dataset_kwargs={"partitioning": None} to skip unnecessary PyArrow processing when we know we are not reading from hive-partitioned data (and we aren't applying filters).

By default (when we don't pass in dataset_kwargs={"partitioning": None}), cudf will still pre-process the dataset with PyArrow just in case it is hive partitioned.

Copy link
Contributor

@galipremsagar galipremsagar Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the context @rjzamora ! Do you happen to have the rough overhead numbers/magnitude?

Copy link
Member Author

@rjzamora rjzamora Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

profile-with-dataset-overhead

The overhead is measurable but not huge (~4-5%) for the specific case I was benchmarking.

Copy link
Member

@madsbk madsbk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @rjzamora, looks good to me

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved
Copy link
Contributor

@wence- wence- left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small suggestions

python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved
python/dask_cudf/dask_cudf/io/parquet.py Outdated Show resolved Hide resolved
**to_pandas_kwargs,
)
class CudfReadParquetFSSpec(ReadParquetFSSpec):
_STATS_CACHE: MutableMapping[str, Any] = {}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be a cache of bounded size (via lru_cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I'm very worried about this cache getting too large. We're just storing a dict with two elements (schema name, and storage size) for each column.

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cudf python approval.

@rjzamora rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels Nov 19, 2024
@rjzamora
Copy link
Member Author

/merge

@rjzamora
Copy link
Member Author

@GregoryKimball @vyasr - FYI: I'd really like this to be included in 24.12 (it makes it much easier to use/demonstrate the recent KvikIO + S3 improvements).

@rapids-bot rapids-bot bot merged commit 3111aa4 into rapidsai:branch-24.12 Nov 20, 2024
108 checks passed
@rjzamora rjzamora deleted the new-read-parquet-api branch November 20, 2024 14:14
@vyasr
Copy link
Contributor

vyasr commented Nov 20, 2024

All good! I had this on the board slated for this release.

rapids-bot bot pushed a commit that referenced this pull request Nov 21, 2024
#17250 started using `pynvml` but did not add the proper dependency, this change fixes the missing dependency.

Authors:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - https://github.com/jakirkham

Approvers:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - https://github.com/jakirkham

URL: #17386
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

7 participants