Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fsspec.parquet for improved read_parquet performance from remote storage #9589

Merged
merged 31 commits into from
Jan 20, 2022

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Nov 3, 2021

Important Note: Marking this as WIP until the fsspec.parquet module is available in a filesystem_spec release (fsspec.parquet module is available)

This PR modifies cudf.read_parquet and dask_cudf.read_parquet to leverage the new fsspec.parquet.open_parquet_file function for optimized data transfer/caching from remote storage. The long-term goal is to remove the temporary data-transfer optimizations that we currently use in cudf.read_parquet.

Performance Motivation:

In [1]: import cudf, dask_cudf
   ...: path = [
   ...:     "gs://my-bucket/criteo-parquet/day_0.parquet",
   ...:     "gs://my-bucket/criteo-parquet/day_1.parquet",
   ...: ]

# cudf BEFORE
In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…)
CPU times: user 11.1 s, sys: 11.5 s, total: 22.6 s
Wall time: 24.4 s

# cudf AFTER
In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…)
CPU times: user 3.48 s, sys: 722 ms, total: 4.2 s
Wall time: 6.32 s

# (Threaded) Dask-cudf BEFORE
In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute()
CPU times: user 27.1 s, sys: 15.5 s, total: 42.6 s
Wall time: 57.6 s

# (Threaded) Dask-cudf AFTER
In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute()
CPU times: user 3.43 s, sys: 851 ms, total: 4.28 s
Wall time: 13.1 s

@rjzamora rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 3, 2021
@rjzamora rjzamora self-assigned this Nov 3, 2021
@codecov
Copy link

codecov bot commented Nov 4, 2021

Codecov Report

Merging #9589 (560bc0c) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

❗ Current head 560bc0c differs from pull request most recent head 51ebe83. Consider uploading reports for the commit 51ebe83 to get more accurate results
Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9589      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20604     +299     
================================================
+ Hits               2130     2148      +18     
- Misses            18175    18456     +281     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.66% <0.00%> (-0.25%) ⬇️
python/dask_cudf/dask_cudf/core.py 70.85% <0.00%> (-0.17%) ⬇️
python/cudf/cudf/io/csv.py 0.00% <0.00%> (ø)
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/api/types.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/dtypes.py 0.00% <0.00%> (ø)
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fd7dd2...51ebe83. Read the comment docs.

@rjzamora
Copy link
Member Author

rjzamora commented Nov 4, 2021

Just a note that this test failure, although triggered by a bug that was "fixed" in acf3d08, is actually an existing bug in branch-21.12. When reading from remote storage with use_python_file_object=True, iIt seems that cudf will fail to interpret RangeIndex information from the "pandas_metadata". This is because cudf outsources this logic to pyarrow, and the file source that it tries to pass to pyarrow will already have been converted to cudf._lib.io.datasource.NativeFileDatasource.

I will raise a separate issue about this and try to fix it before merging this PR (since taking advantage of the new fsspec.parquet optimization requires us to start using use_python_file_object=True by default).

@rjzamora
Copy link
Member Author

rjzamora commented Nov 4, 2021

Update: Copied the simple pyarrow-metadata fix into a stand-alone branch and submitted #9608

rapids-bot bot pushed a commit that referenced this pull request Nov 11, 2021
This fixes a `read_parquet` bug discovered while iterating on #9589

Without this fix, the optimized `read_parquet` code path will fail when the pandas metadata includes index-column information. It may also fail when the data includes list or struct columns (depending on the engine that wrote the parquet file).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - https://github.com/brandon-b-miller

URL: #9638
@rjzamora rjzamora changed the base branch from branch-21.12 to branch-22.02 November 29, 2021 20:11
@rjzamora rjzamora requested a review from a team as a code owner January 14, 2022 15:32
@rjzamora
Copy link
Member Author

I'd like this to get in for 22.02, if possible (cc @quasiben)

@brandon-b-miller
Copy link
Contributor

Looking this over today 👍

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple of q's, looking great though.

@galipremsagar
Copy link
Contributor

rerun tests

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spoke to @rjzamora for a bit yesterday and got some context for this PR. This is a really worthwhile optimization (with great benchmarks) and while there's a lot going on here, a lot of the changes are actually delegating logic out of cuDF and into fsspec so this should be pretty safe to merge for this release.

f"This version of fsspec ({fsspec.__version__}) does "
f"not support parquet-optimized precaching. Please upgrade "
f"to the latest fsspec version for better performance."
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there any plans around making this a requirement at some point?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right - I was thinking about this for a while. The user will need fsspec>=2011.11.1 to benefit from the optimizations in this PR. However, if they are not reading parquet files from remote storage, then an older version should be fine. Therefore, I was hesitant to suggest any official version pinning.

@quasiben
Copy link
Member

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 09035d6 into rapidsai:branch-22.02 Jan 20, 2022
@rjzamora rjzamora deleted the use-fsspec-parquet branch January 20, 2022 21:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants