Use fsspec.parquet for improved read_parquet performance from remote storage #9589

rjzamora · 2021-11-03T20:58:20Z

Important Note: ~~Marking this as WIP until the fsspec.parquet module is available in a filesystem_spec release~~ (fsspec.parquet module is available)

This PR modifies cudf.read_parquet and dask_cudf.read_parquet to leverage the new fsspec.parquet.open_parquet_file function for optimized data transfer/caching from remote storage. The ~~long-term~~ goal is to remove the temporary data-transfer optimizations that we currently use in cudf.read_parquet.

Performance Motivation:

In [1]: import cudf, dask_cudf
   ...: path = [
   ...:     "gs://my-bucket/criteo-parquet/day_0.parquet",
   ...:     "gs://my-bucket/criteo-parquet/day_1.parquet",
   ...: ]

# cudf BEFORE
In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…)
CPU times: user 11.1 s, sys: 11.5 s, total: 22.6 s
Wall time: 24.4 s

# cudf AFTER
In [2]: %time df = cudf.read_parquet(path, columns=["I10"], storage_options=…)
CPU times: user 3.48 s, sys: 722 ms, total: 4.2 s
Wall time: 6.32 s

# (Threaded) Dask-cudf BEFORE
In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute()
CPU times: user 27.1 s, sys: 15.5 s, total: 42.6 s
Wall time: 57.6 s

# (Threaded) Dask-cudf AFTER
In [2]: %time df = dask_cudf.read_parquet(path, columns=["I10"], storage_options=…).compute()
CPU times: user 3.43 s, sys: 851 ms, total: 4.28 s
Wall time: 13.1 s

codecov · 2021-11-04T02:10:57Z

Codecov Report

Merging #9589 (560bc0c) into branch-22.02 (967a333) will decrease coverage by 0.06%.
The diff coverage is n/a.

❗ Current head 560bc0c differs from pull request most recent head 51ebe83. Consider uploading reports for the commit 51ebe83 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.02    #9589      +/-   ##
================================================
- Coverage         10.49%   10.42%   -0.07%     
================================================
  Files               119      119              
  Lines             20305    20604     +299     
================================================
+ Hits               2130     2148      +18     
- Misses            18175    18456     +281

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.66% <0.00%> (-0.25%)`	⬇️
python/dask_cudf/dask_cudf/core.py	`70.85% <0.00%> (-0.17%)`	⬇️
python/cudf/cudf/io/csv.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/api/types.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/dtypes.py	`0.00% <0.00%> (ø)`
... and 40 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8fd7dd2...51ebe83. Read the comment docs.

rjzamora · 2021-11-04T13:28:00Z

Just a note that this test failure, although triggered by a bug that was "fixed" in acf3d08, is actually an existing bug in branch-21.12. When reading from remote storage with use_python_file_object=True, iIt seems that cudf will fail to interpret RangeIndex information from the "pandas_metadata". This is because cudf outsources this logic to pyarrow, and the file source that it tries to pass to pyarrow will already have been converted to cudf._lib.io.datasource.NativeFileDatasource.

I will raise a separate issue about this and try to fix it before merging this PR (since taking advantage of the new fsspec.parquet optimization requires us to start using use_python_file_object=True by default).

rjzamora · 2021-11-04T16:59:45Z

Update: Copied the simple pyarrow-metadata fix into a stand-alone branch and submitted #9608

…parquet

…rquet optimization

This fixes a `read_parquet` bug discovered while iterating on #9589 Without this fix, the optimized `read_parquet` code path will fail when the pandas metadata includes index-column information. It may also fail when the data includes list or struct columns (depending on the engine that wrote the parquet file). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller URL: #9638

…parquet

rjzamora · 2022-01-14T17:22:05Z

I'd like this to get in for 22.02, if possible (cc @quasiben)

…parquet

brandon-b-miller · 2022-01-19T15:39:09Z

Looking this over today 👍

python/cudf/cudf/io/parquet.py

python/cudf/cudf/tests/test_s3.py

python/cudf/cudf/utils/ioutils.py

brandon-b-miller

couple of q's, looking great though.

galipremsagar · 2022-01-20T13:25:06Z

rerun tests

python/cudf/cudf/utils/ioutils.py

brandon-b-miller

Spoke to @rjzamora for a bit yesterday and got some context for this PR. This is a really worthwhile optimization (with great benchmarks) and while there's a lot going on here, a lot of the changes are actually delegating logic out of cuDF and into fsspec so this should be pretty safe to merge for this release.

brandon-b-miller · 2022-01-20T15:25:55Z

python/cudf/cudf/utils/ioutils.py

+            f"This version of fsspec ({fsspec.__version__}) does "
+            f"not support parquet-optimized precaching. Please upgrade "
+            f"to the latest fsspec version for better performance."
+        )


are there any plans around making this a requirement at some point?

Right - I was thinking about this for a while. The user will need fsspec>=2011.11.1 to benefit from the optimizations in this PR. However, if they are not reading parquet files from remote storage, then an older version should be fine. Therefore, I was hesitant to suggest any official version pinning.

quasiben · 2022-01-20T20:36:41Z

@gpucibot merge

use fsspec.parquet.open_parquet_file

0d542d9

rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. dask Dask issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Nov 3, 2021

rjzamora self-assigned this Nov 3, 2021

fix use_python_file_object default

acf3d08

rjzamora mentioned this pull request Nov 4, 2021

[BUG] cudf.read_parquet fails to read from remote storage with use_python_file_object=True #9599

Closed

Hopefully address 9599

6d29bf6

rjzamora added 7 commits November 8, 2021 06:43

Merge remote-tracking branch 'upstream/branch-21.12' into use-fsspec-…

62fd911

…parquet

update s3 test

fd5f594

remove use_python_file_object default for now

6ed95d8

Merge remote-tracking branch 'upstream/branch-21.12' into use-fsspec-…

69b87fd

…parquet

use new use_fsspec_parquet argument to turn on experimental fsspec.pa…

be013aa

…rquet optimization

consolidate new logic in ioutils

8eb23be

fix bugs in default remote-storage optimization

9ccaf96

rjzamora mentioned this pull request Nov 9, 2021

Fix read_parquet bug for extended dtypes from remote storage #9638

Merged

rjzamora changed the base branch from branch-21.12 to branch-22.02 November 29, 2021 20:11

rjzamora added 4 commits November 29, 2021 14:14

Merge branch 'branch-22.02' into use-fsspec-parquet

f502671

update defaults to reflect latest fsspec features

7260617

fix bug in _handle_fsspec_parquet

c992d0d

Merge remote-tracking branch 'upstream/branch-22.02' into use-fsspec-…

c1f541f

…parquet

rjzamora requested a review from a team as a code owner January 14, 2022 15:32

rjzamora requested review from galipremsagar and brandon-b-miller January 14, 2022 15:32

Merge remote-tracking branch 'upstream/branch-22.02' into use-fsspec-…

84ba4b3

…parquet

rjzamora added 3 commits January 18, 2022 08:14

allow use_python_file_object=False, but fall back on old/slow behavior

c9cfe94

Merge remote-tracking branch 'upstream/branch-22.02' into use-fsspec-…

f300d01

…parquet

Merge remote-tracking branch 'upstream/branch-22.02' into use-fsspec-…

739c2d4

…parquet