-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use fsspec.parquet for improved read_parquet performance from remote storage #9589
Use fsspec.parquet for improved read_parquet performance from remote storage #9589
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9589 +/- ##
================================================
- Coverage 10.49% 10.42% -0.07%
================================================
Files 119 119
Lines 20305 20604 +299
================================================
+ Hits 2130 2148 +18
- Misses 18175 18456 +281
Continue to review full report at Codecov.
|
Just a note that this test failure, although triggered by a bug that was "fixed" in acf3d08, is actually an existing bug in I will raise a separate issue about this and try to fix it before merging this PR (since taking advantage of the new |
Update: Copied the simple pyarrow-metadata fix into a stand-alone branch and submitted #9608 |
…rquet optimization
This fixes a `read_parquet` bug discovered while iterating on #9589 Without this fix, the optimized `read_parquet` code path will fail when the pandas metadata includes index-column information. It may also fail when the data includes list or struct columns (depending on the engine that wrote the parquet file). Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - https://github.com/brandon-b-miller URL: #9638
I'd like this to get in for 22.02, if possible (cc @quasiben) |
Looking this over today 👍 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple of q's, looking great though.
rerun tests |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spoke to @rjzamora for a bit yesterday and got some context for this PR. This is a really worthwhile optimization (with great benchmarks) and while there's a lot going on here, a lot of the changes are actually delegating logic out of cuDF and into fsspec
so this should be pretty safe to merge for this release.
f"This version of fsspec ({fsspec.__version__}) does " | ||
f"not support parquet-optimized precaching. Please upgrade " | ||
f"to the latest fsspec version for better performance." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
are there any plans around making this a requirement at some point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - I was thinking about this for a while. The user will need fsspec>=2011.11.1 to benefit from the optimizations in this PR. However, if they are not reading parquet files from remote storage, then an older version should be fine. Therefore, I was hesitant to suggest any official version pinning.
@gpucibot merge |
Important Note:
Marking this as WIP until the(fsspec.parquet module is available)fsspec.parquet
module is available in a filesystem_spec releaseThis PR modifies
cudf.read_parquet
anddask_cudf.read_parquet
to leverage the newfsspec.parquet.open_parquet_file
function for optimized data transfer/caching from remote storage. Thelong-termgoal is to remove the temporary data-transfer optimizations that we currently use in cudf.read_parquet.Performance Motivation: