Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Upgrade Arrow support to 4.0.0 #7224

Closed
mughetto opened this issue Jan 27, 2021 · 11 comments · Fixed by #7495
Closed

[FEA] Upgrade Arrow support to 4.0.0 #7224

mughetto opened this issue Jan 27, 2021 · 11 comments · Fixed by #7495
Assignees
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Comments

@mughetto
Copy link

mughetto commented Jan 27, 2021

Hi,

we have developped internally a python package that uses pandas to read parquet files with the pyarrow engine. After some testing a few months ago it appeared that using pyarrow 2.0 was much faster than 1.0 for large files so we decided to enforce pyarrow>=2.0 in our requirements.txt

But now that we want to use this package in a RAPIDS environment (say 0.17 + our local pakcage install with pyarrow 2.0) we have noticed that import cudf was failing with this trace:

  File "<stdin>", line 1, in <module>
  File "/home/user/.conda/envs/rapids-0.17/lib/python3.8/site-packages/cudf/__init__.py", line 11, in <module>
    from cudf import core, datasets, testing
  File "/home/user/.conda/envs/rapids-0.17/lib/python3.8/site-packages/cudf/core/__init__.py", line 3, in <module>
    from cudf.core import buffer, column, common
  File "/home/user/.conda/envs/rapids-0.17/lib/python3.8/site-packages/cudf/core/column/__init__.py", line 3, in <module>
    from cudf.core.column.categorical import CategoricalColumn
  File "/home/user/.conda/envs/rapids-0.17/lib/python3.8/site-packages/cudf/core/column/categorical.py", line 8, in <module>
    from cudf import _lib as libcudf
  File "/home/user/.conda/envs/rapids-0.17/lib/python3.8/site-packages/cudf/_lib/__init__.py", line 4, in <module>
    from . import (
  File "cudf/_lib/gpuarrow.pyx", line 1, in init cudf._lib.gpuarrow
AttributeError: module 'pyarrow.lib' has no attribute '_CRecordBatchReader' 

Bumping back to pyarrow 1.0.1 solved the issue but represents a loss of performance for us on the pure pandas side.

Are we missing something obvious that would allow us to get pyarrow 2.0 to work with cudf? If not, are there any plan to make things compatible with more recent versions of pyarrow (3.0 as of yesterday)

Please let me know if you need details more about our environments.

Thanks a lot !

@mughetto mughetto added Needs Triage Need team to review and classify question Further information is requested labels Jan 27, 2021
@kkraus14 kkraus14 added CMake CMake build issue Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. and removed Needs Triage Need team to review and classify labels Jan 27, 2021
@lmeyerov
Copy link

Also relevant: arrow format version support vs. lib version dependency - see http://arrow.apache.org/docs/format/Versioning.html . I didn't find a clear breakdown on their sites for tracking these.

@kkraus14
Copy link
Collaborator

Hey @mughetto we're currently evaluating our options here. While there's people wanting newer versions of Arrow, others need to continue using Arrow 1.x, so ideally we'd support 1.0.1+, but that's not 100% straightforward to do. We're currently looking into this.

@lmeyerov
Copy link

To Keith's point: More important to us than the particular decision is clarity of rapids release timing in changing arrow formats, and ideally, a sense of what the expected impact area would be

@mughetto
Copy link
Author

Also relevant: arrow format version support vs. lib version dependency - see http://arrow.apache.org/docs/format/Versioning.html . I didn't find a clear breakdown on their sites for tracking these.

Yeah the chain of dependencies and who fetches what where is a bit hard to track down at the moment :/

@mughetto
Copy link
Author

@kkraus14 Ok thanks a lot for the quick answer!

@github-actions
Copy link

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

@kkraus14
Copy link
Collaborator

Pandas 1.2 support was added in #7375

@kkraus14 kkraus14 changed the title [QST] Pandas/Pyarrow/cudf version compatibilities [FEA] Upgrade Arrow support to 4.0.0 Mar 26, 2021
@kkraus14 kkraus14 added feature request New feature or request and removed question Further information is requested labels Mar 26, 2021
@kkraus14
Copy link
Collaborator

This was attempted in #7495 but it was found there was blocking bugs in both Arrow 2.0.0 and 3.0.0 that prevented upgrading. These should be fixed in 4.0.0 where we'll try to upgrade.

@bdice
Copy link
Contributor

bdice commented Apr 1, 2021

@kkraus14 @galipremsagar Just came across this issue while using cuDF -- I have some other requirements that rely on newer Arrow versions. Could you provide any details about the blocking bugs you encountered in Arrow 2 / 3? I didn't see anything obvious in the comments or gpuCI build logs of #7495.

@kkraus14
Copy link
Collaborator

kkraus14 commented Apr 1, 2021

@kkraus14 @galipremsagar Just came across this issue while using cuDF -- I have some other requirements that rely on newer Arrow versions. Could you provide any details about the blocking bugs you encountered in Arrow 2 / 3? I didn't see anything obvious in the comments or gpuCI build logs of #7495.

It was discussed.on the Arrow mailing list, but on Arrow 3.0.0, you can't create Arrow Arrays or Arrow Tables from GPU backed Buffer objects. In Arrow 2.0.0 there was a bug that prevents round tripping lists of structs columns in the Parquet Reader/Writer.

We believe all of the current known issues on our side are fixed in the current tip of Arrow and the 4.0.0 release is scheduled for April which would give us plenty of time to upgrade in our 0.20 release.

@github-actions
Copy link

github-actions bot commented May 1, 2021

This issue has been labeled inactive-30d due to no recent activity in the past 30 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed. This issue will be labeled inactive-90d if there is no activity in the next 60 days.

rapids-bot bot pushed a commit that referenced this issue Jun 29, 2021
Fixes: #7224

This PR:

- [x] Adds support for arrow 4.0.1 in cudf.
- [x] Moves testing-related utilities to `cudf.testing` module.
- [x] Fixes miscellaneous errors related to arrow upgrade.

Authors:
  - GALI PREM SAGAR (https://github.com/galipremsagar)
  - Paul Taylor (https://github.com/trxcllnt)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Ram (Ramakrishna Prabhu) (https://github.com/rgsl888prabhu)
  - Jeremy Dyer (https://github.com/jdye64)
  - Paul Taylor (https://github.com/trxcllnt)
  - Dillon Cullinan (https://github.com/dillon-cullinan)
  - Devavret Makkar (https://github.com/devavret)
  - Keith Kraus (https://github.com/kkraus14)
  - Michael Wang (https://github.com/isVoid)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #7495
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CMake CMake build issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants