Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Arrow PythonFile for remote CSV storage #9376

Merged
merged 4 commits into from
Oct 6, 2021

Conversation

rjzamora
Copy link
Member

@rjzamora rjzamora commented Oct 5, 2021

This is a simple follow-up to #9304 and #9265 meant to achieve the following:

  • After this PR, the default behavior of cudf.read_csv will be to convert fsspec-based AbstractBufferedFile objects to Arrow PythonFile objects for non-local file systems. Since PythonFile objects inherit from NativeFile objects, libcudf can seek/read distinct byte ranges without requiring the entire file to be read into host memory (i.e. the default behavior enables proper partial IO from remote storage)

  • Optimized fsspec data transfer for remote file-systems #9265 recently added an fsspec-based optimization for transfering csv byte ranges into local memory. That optimization already allowed us to avoid a full file transfer when a specific byte_range is specified to the cudf.read_csv call. However, the simpler approach introduced in this PR is (1) more general, (2) easier to maintain, and (3) demonstrates comparable performance. Therefore, this PR also rolls back one of the less-maintainable optimizations added in Optimized fsspec data transfer for remote file-systems #9265 (local buffer clipping).

@rjzamora rjzamora added 2 - In Progress Currently a work in progress Python Affects Python cuDF API. improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Oct 5, 2021
@rjzamora rjzamora self-assigned this Oct 5, 2021
@rjzamora rjzamora requested a review from a team as a code owner October 5, 2021 15:01
@rjzamora rjzamora added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Oct 5, 2021
@codecov
Copy link

codecov bot commented Oct 5, 2021

Codecov Report

Merging #9376 (95d2e83) into branch-21.12 (ab4bfaa) will decrease coverage by 0.02%.
The diff coverage is 1.73%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-21.12    #9376      +/-   ##
================================================
- Coverage         10.79%   10.76%   -0.03%     
================================================
  Files               116      116              
  Lines             18869    19476     +607     
================================================
+ Hits               2036     2096      +60     
- Misses            16833    17380     +547     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 0.00% <ø> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/categorical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/datetime.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/lists.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/numerical.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/timedelta.py 0.00% <0.00%> (ø)
... and 80 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 88eefe5...95d2e83. Read the comment docs.

Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

one q otherwise lgtm

@rjzamora
Copy link
Member Author

rjzamora commented Oct 6, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 68c56b7 into rapidsai:branch-21.12 Oct 6, 2021
@rjzamora rjzamora deleted the csv-pythonfile branch October 6, 2021 20:27
rapids-bot bot pushed a commit that referenced this pull request Oct 7, 2021
This is a follow-up to #9304, and is more-or-less the ORC version of #9376

These changes will enable partial IO to behave "correctly" for `cudf.read_orc` from remote storage. Simpe multi-stripe file example:

```python
# After this PR
%time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options)
CPU times: user 579 ms, sys: 166 ms, total: 744 ms
Wall time: 2.38 s

# Before this PR
%time gdf = cudf.read_orc(orc_path, stripes=[0], storage_options=storage_options)
CPU times: user 3.9 s, sys: 1.47 s, total: 5.37 s
Wall time: 8.5 s
```

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Charles Blackmon-Luca (https://github.com/charlesbluca)

URL: #9377
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants