Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

rjzamora · 2021-09-24T18:30:29Z

This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one).

What this PR DOES do:

Enables users to pass Arrow-based file objects directly to the cudf read_parquet and read_csv functions. For example:

import cudf
import pyarrow.fs as pa_fs

fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet")
with fs.open_input_file(path) as fil:
    gdf = cudf.read_parquet(fil)

Adds automatic conversion of fsspec AbstractBufferedFile objects into Arrow-backed PythonFile objects. For read_parquet, an Arrow-backed PythonFile object can be used (in place of an optimized fsspec transfer) by passing use_python_file_object=True:

import cudf

gdf = cudf.read_parquet(path, use_python_file_object=True)

or

import cudf
from fsspec.core import get_fs_token_paths

fs = get_fs_token_paths(path)[0]
with fs.open(path, mode="rb") as fil:
    gdf = cudf.read_parquet(fil, use_python_file_object=True)

What this PR does NOT do:

cudf will not automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead.
cudf will not accept NativeFile input for IO functions other than read_csv and read_parquet
dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage

Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests).

…parquet

…ot stable

…parquet

vyasr

C++ approval

cpp/include/cudf/io/datasource.hpp

jrhemstad

Default constructor is not safe.

python/cudf/cudf/_lib/io/datasource.pxd

python/cudf/cudf/_lib/io/datasource.pyx

cpp/include/cudf/io/datasource.hpp

rjzamora · 2021-10-04T21:38:31Z

@gpucibot merge

This is a simple follow-up to #9304 and #9265 meant to achieve the following: - After this PR, the default behavior of `cudf.read_csv` will be to convert fsspec-based `AbstractBufferedFile` objects to Arrow `PythonFile` objects for non-local file systems. Since `PythonFile` objects inherit from `NativeFile` objects, libcudf can seek/read distinct byte ranges without requiring the entire file to be read into host memory (i.e. the default behavior enables proper partial IO from remote storage) - #9265 recently added an fsspec-based optimization for transfering csv byte ranges into local memory. That optimization already allowed us to avoid a full file transfer when a specific `byte_range` is specified to the `cudf.read_csv` call. However, the simpler approach introduced in this PR is (1) more general, (2) easier to maintain, and (3) demonstrates comparable performance. Therefore, this PR also rolls back one of the less-maintainable optimizations added in #9265 (local buffer clipping). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller URL: #9376

This is a follow-up to #9304, and is more-or-less the ORC version of #9376 These changes will enable partial IO to behave "correctly" for `cudf.read_orc` from remote storage. Simpe multi-stripe file example: ```python # After this PR %time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options) CPU times: user 579 ms, sys: 166 ms, total: 744 ms Wall time: 2.38 s # Before this PR %time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options) CPU times: user 3.9 s, sys: 1.47 s, total: 5.37 s Wall time: 8.5 s ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #9377

rjzamora added 25 commits September 9, 2021 13:31

save work related to byte-range collection

82ab31d

enable byte_ranges optimization for open file-like

ef02f3d

fix bug for no column or row-group selection

a32a7ae

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

26b96f3

…parquet

add arrow_filesystem flag for dask_cudf

bd2e59a

use cat_ranges when available

53bb32e

expose arrow_filesystem and legacy_transfer

ada8451

most tests passing with reasonable defaults - arrow_filesystem=True n…

661ae58

…ot stable

fix bug

42c55c9

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

51021ab

…parquet

legacy_transfer fix

ddec4df

fix test failures

36e4c52

plumb in csv support since most of the work is already done

98efb58

remove unncessary BytesIO usage for optimized code path

63dd615

Merge remote-tracking branch 'upstream/branch-21.10' into nativefile-…

40639c2

…parquet

avoid memory leaks in _read_byte_ranges

5524538

avoid full-file transfer for read_csv with byte_range defined

fd2998a

avoid seeking before beginning of file

d1cb7a6

remove arrow_filesystem option from dask (for now)

491c69f

save state

5994fd9

simplify PR to require NativeFile input (no more uri inference for now)

a821ba7

Merge remote-tracking branch 'upstream/branch-21.12' into nativefile-…

587ee5b

…parquet

add test coverage (csv not passing yet)

ab18cab

fixng datasource bug - requires code duplication for now

1095029

add s3-specific tests

d0a17c4

rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. Cython Performance Performance related issue improvement Improvement / enhancement to an existing function labels Sep 24, 2021

vyasr approved these changes Oct 4, 2021

View reviewed changes

jrhemstad reviewed Oct 4, 2021

View reviewed changes

cpp/include/cudf/io/datasource.hpp Outdated Show resolved Hide resolved

jrhemstad requested changes Oct 4, 2021

View reviewed changes

shwina reviewed Oct 4, 2021

View reviewed changes

python/cudf/cudf/_lib/io/datasource.pxd Outdated Show resolved Hide resolved

shwina reviewed Oct 4, 2021

View reviewed changes

python/cudf/cudf/_lib/io/datasource.pyx Outdated Show resolved Hide resolved

shwina reviewed Oct 4, 2021

View reviewed changes

python/cudf/cudf/_lib/io/datasource.pyx Outdated Show resolved Hide resolved

avoid default ctor for arrow_io_source

7375af5

rjzamora commented Oct 4, 2021

View reviewed changes

cpp/include/cudf/io/datasource.hpp Outdated Show resolved Hide resolved

Update cpp/include/cudf/io/datasource.hpp

c3591c8

github-actions bot removed the libcudf Affects libcudf (C++/CUDA) code. label Oct 4, 2021

rjzamora removed the 4 - Needs Review Waiting for reviewer to review or respond label Oct 4, 2021

rjzamora requested a review from jrhemstad October 4, 2021 20:44

jrhemstad approved these changes Oct 4, 2021

View reviewed changes

rjzamora added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels Oct 4, 2021

rapids-bot bot merged commit fb18491 into rapidsai:branch-21.12 Oct 4, 2021

rjzamora deleted the native-file-simple branch October 4, 2021 21:44

This was referenced Oct 5, 2021

Use Arrow PythonFile for remote CSV storage #9376

Merged

Support Arrow NativeFile and PythonFile for remote ORC storage #9377

Merged

rjzamora mentioned this pull request Oct 18, 2021

[Experimental] Optimize cudf/dask-cudf read_parquet for s3/remote filesystems #9225

Closed

rjzamora mentioned this pull request Jan 31, 2022

read_csv for s3 data #8961

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

rjzamora commented Sep 24, 2021 •

edited

Loading

vyasr left a comment

jrhemstad left a comment

rjzamora commented Oct 4, 2021

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Add Arrow-NativeFile and PythonFile support to read_parquet and read_csv in cudf #9304

Conversation

rjzamora commented Sep 24, 2021 • edited Loading

vyasr left a comment

Choose a reason for hiding this comment

jrhemstad left a comment

Choose a reason for hiding this comment

rjzamora commented Oct 4, 2021

rjzamora commented Sep 24, 2021 •

edited

Loading