-
Notifications
You must be signed in to change notification settings - Fork 916
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Add Arrow-NativeFile and PythonFile support to read_parquet and read_…
…csv in cudf (#9304) This PR implements a simple but critical subset of the the features implemented and discussed in #8961 and #9225. Note that I suggest those PRs be closed in favor of a few simpler PRs (like this one). **What this PR DOES do**: - Enables users to pass Arrow-based file objects directly to the cudf `read_parquet` and `read_csv` functions. For example: ```python import cudf import pyarrow.fs as pa_fs fs, path = pa_fs.FileSystem.from_uri("s3://my-bucket/some-file.parquet") with fs.open_input_file(path) as fil: gdf = cudf.read_parquet(fil) ``` - Adds automatic conversion of fsspec `AbstractBufferedFile` objects into Arrow-backed `PythonFile` objects. For `read_parquet`, an Arrow-backed `PythonFile` object can be used (in place of an optimized fsspec transfer) by passing `use_python_file_object=True`: ```python import cudf gdf = cudf.read_parquet(path, use_python_file_object=True) ``` or ```python import cudf from fsspec.core import get_fs_token_paths fs = get_fs_token_paths(path)[0] with fs.open(path, mode="rb") as fil: gdf = cudf.read_parquet(fil, use_python_file_object=True) ``` **What this PR does NOT do**: - cudf will **not** automatically produce "direct" (e.g. HadoopFileSystem/S3FileSystem-based) Arrow NativeFile objects for explicit file-path input. It is still up to the user to create/supply a direct NativeFile object to read_csv/parquet if they do not want any python overhead. - cudf will **not** accept NativeFile input for IO functions other than read_csv and read_parquet - dask-cudf does not yet have a mechanism to open/process s3 files as "direct" NativeFile objects - Those changes only apply to direct cudf usage Props to @shridharathi for doing most of the work for this in #8961 (this PR only extends that work to include parquet and add tests). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) - Vyas Ramasubramani (https://github.com/vyasr) - Jake Hemstad (https://github.com/jrhemstad) URL: #9304
- Loading branch information
Showing
11 changed files
with
195 additions
and
30 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,10 +1,13 @@ | ||
# Copyright (c) 2020, NVIDIA CORPORATION. | ||
|
||
from libcpp.memory cimport unique_ptr | ||
from libcpp.memory cimport shared_ptr | ||
|
||
from cudf._lib.cpp.io.types cimport datasource | ||
from cudf._lib.cpp.io.types cimport arrow_io_source, datasource | ||
|
||
|
||
cdef class Datasource: | ||
|
||
cdef datasource* get_datasource(self) nogil except * | ||
|
||
cdef class NativeFileDatasource(Datasource): | ||
cdef shared_ptr[arrow_io_source] c_datasource | ||
cdef datasource* get_datasource(self) nogil |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,12 +1,26 @@ | ||
# Copyright (c) 2020, NVIDIA CORPORATION. | ||
|
||
from libcpp.memory cimport unique_ptr | ||
from libcpp.memory cimport shared_ptr | ||
from pyarrow.includes.libarrow cimport CRandomAccessFile | ||
from pyarrow.lib cimport NativeFile | ||
|
||
from cudf._lib.cpp.io.types cimport datasource | ||
from cudf._lib.cpp.io.types cimport arrow_io_source, datasource | ||
|
||
|
||
cdef class Datasource: | ||
cdef datasource* get_datasource(self) nogil except *: | ||
with gil: | ||
raise NotImplementedError("get_datasource() should not " | ||
+ "be directly invoked here") | ||
|
||
cdef class NativeFileDatasource(Datasource): | ||
|
||
def __cinit__(self, NativeFile native_file,): | ||
|
||
cdef shared_ptr[CRandomAccessFile] ra_src | ||
|
||
ra_src = native_file.get_random_access_file() | ||
self.c_datasource.reset(new arrow_io_source(ra_src)) | ||
|
||
cdef datasource* get_datasource(self) nogil: | ||
return <datasource *> (self.c_datasource.get()) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.