-
-
Notifications
You must be signed in to change notification settings - Fork 18k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PDEP-10: Add pyarrow as a required dependency #52711
Conversation
by PyArrow including: | ||
|
||
- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations | ||
- Avoiding NumPy object data types more by default for analogous types that have native PyArrow support such as decimal, binary, and nested types |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be too optimistic, but having pyarrow as a required dependency has the potential to make the c/cython-code for read_csv and read_json obsolete (if they are on par and similarly fast).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that would be a compile time dependency which we are not contemplating at the current time; possibly could propose in the future
- The minimum version of PyArrow will be bumped every major pandas release to the highest | ||
PyArrow version that has been released for at least 2 years. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
using the major.minor.patch terminology, major could be 2-3 years (ignoring for now the proposal by some to make this more frequent) and minor is 6-9 months.
It is not clear here, is the minimum supported version kept for all minor releases in this proposal?
Near the tail end of the major release cycle, the minimum supported version of pyarrow could be 5 years old?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not clear here, is the minimum supported version kept for all minor releases in this proposal?
Correct
Will there also be a version cap? NumPy is extremely conservative with breaking changes. I can't recall a case where a cap would have helped avoid issues with NumPy changes, especially deprecations and removals. Is pyarrow similarly stable? Does pyarrow have an implicit or explicit deprecation policy? If not, would there need to be a cap on each release too? Recently in a number of projects I've been downstream of SciPy which has been doing a lot of long-needed but painful clean up. This has resulted in cases where not too old releases of downstream projects are breaking against the most recently released SciPy. |
Additionally, requiring PyArrow would simplify the related development within pandas and potentially improve NumPy functionality that would be better suited | ||
by PyArrow including: | ||
|
||
- Avoiding runtime checking if PyArrow is available to perform PyArrow object inference during constructor or indexing operations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure this comment by Will has been addressed (unless I missed it?)
to make it easier to find: the link is here, and says:
Are there any small code samples we can add to drive this point home? I think still we would make a runtime determination whether to return a pyarrow or numpy-backed object even if both are installed, no?
This PDEP proposes that: | ||
|
||
- PyArrow becomes a runtime dependency starting pandas 2.1 | ||
- The minimum version of PyArrow supported starting pandas 2.1 is version 7. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would this version be consistent across the entire pandas API?
e.g. If I wanted to bump the pyarrow version for just the CSV parser to something higher, would I be able to do it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The minimum version would be consistent across the library, but IMO that shouldn't stop development of features that exist in newer versions of pyarrow (we already do this with version checking or try/except)
In one of the linked issues there is the comment about pandas being backend agnostic. Is it possible that this PDEP can broach that and consider how making pyarrow a dependency defines that objective as being unacheivable or how it would fit in to such a concept. This is not clear to me, or whether that is indeed a goal. Primarily I have used numpy for linear algebra and pandas as an extension for better indexing, data wrangling and output. the added bloat when incorporating into web apps with limited build space is concerning. |
This isn't a really useful discussion to have right now (and IMO shouldn't be in scope here). Even if we get to a point where NumPy is optional (this is years away), NumPy is still a required dependency of PyArrow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made some comments, but I'd really like to understand the burden on development if we left things as they are in terms of it being an optional dependency. How much simpler does the code base become if PyArrow is required?
I'm also concerned about the lack of support on Alpine Linux for PyArrow, and maybe we should wait for that before accepting this PDEP.
Is it a good comparison to say that trying to develop arrow-backed arrays without pyarrow as a dependency is like trying to develop NumPy-backed arrays/blocks without NumPy as a dependency? |
Co-authored-by: Irv Lustig <[email protected]>
Co-authored-by: Irv Lustig <[email protected]>
Co-authored-by: Irv Lustig <[email protected]>
date and times are 2 additional dtypes that we can infer properly. Added them to the list as well.
|
IMHO, it's better that we get some feedback, rather than none. The wording as proposed doesn't commit us to saying we will not require pyarrow if we get negative feedback - it just says that we will get the feedback, which gives us the possibility of delaying the requirement based on that feedback. |
improve structure, list user benefits more clearly, add faq
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small comment about the word "horrendous"
Reworded the horrendous thing. Let's start the voting now |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, let's do this. After recent discussions and clarifications, I'm sold: mainly based on superseding object
dtype in string, list, and struct dtypes, which would be a real and immediate benefit to users
To emphasise: if accepted, this would not change the default for dtypes which are currently non-object numpy dtypes (that would be a separate discussion). It would only change the default for dtypes which are currently object
(which I don't think I'm alone in considering an embarrassment)
Agreed. I'd like to propose that we use the newly proposed voting procedure for PDEP-1 on this. I've created an issue at #54106 for the core team to place their votes. |
Not sure whether I mentioned this here already: Some reading material with Benchmarks for Arrow support/strings |
This was accepted in our vote. So updated status and will merge |
I've tried to summarize the motivations and drawbacks from #52509 in this PDEP. Please let me know if there are any reasons on either side that I am missing.
Feel free to continue the discussion in #52509, but as a reminder the voting will take place here.
cc @pandas-dev/pandas-core