-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Arrow PythonFile for remote CSV storage #9376
Conversation
…zation (seems like it was wrong)
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9376 +/- ##
================================================
- Coverage 10.79% 10.76% -0.03%
================================================
Files 116 116
Lines 18869 19476 +607
================================================
+ Hits 2036 2096 +60
- Misses 16833 17380 +547
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
one q otherwise lgtm
@gpucibot merge |
This is a follow-up to #9304, and is more-or-less the ORC version of #9376 These changes will enable partial IO to behave "correctly" for `cudf.read_orc` from remote storage. Simpe multi-stripe file example: ```python # After this PR %time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options) CPU times: user 579 ms, sys: 166 ms, total: 744 ms Wall time: 2.38 s # Before this PR %time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options) CPU times: user 3.9 s, sys: 1.47 s, total: 5.37 s Wall time: 8.5 s ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Charles Blackmon-Luca (https://github.com/charlesbluca) URL: #9377
This is a simple follow-up to #9304 and #9265 meant to achieve the following:
After this PR, the default behavior of
cudf.read_csv
will be to convert fsspec-basedAbstractBufferedFile
objects to ArrowPythonFile
objects for non-local file systems. SincePythonFile
objects inherit fromNativeFile
objects, libcudf can seek/read distinct byte ranges without requiring the entire file to be read into host memory (i.e. the default behavior enables proper partial IO from remote storage)Optimized fsspec data transfer for remote file-systems #9265 recently added an fsspec-based optimization for transfering csv byte ranges into local memory. That optimization already allowed us to avoid a full file transfer when a specific
byte_range
is specified to thecudf.read_csv
call. However, the simpler approach introduced in this PR is (1) more general, (2) easier to maintain, and (3) demonstrates comparable performance. Therefore, this PR also rolls back one of the less-maintainable optimizations added in Optimized fsspec data transfer for remote file-systems #9265 (local buffer clipping).