-
-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RFC: Use Apache Arrow Parquet Crate #6735
Comments
cc @jorgecarleitao in case he has any thoughts as well. |
Thanks for starting this discussion! I think this would be a great addition for polars. Parquet data is one of the most abundant and important data formats for workloads that polars aims. Given the importance of parquet to polars (and the work being done on
I very much agree on this one and I think that by working together on this, we will come up with things that close the gap. I have been studying the
This will be worth it. As mentioned above, parquet is important. :)
This might be a change to come up with an abstraction one level higher. These might be traits only where arrow-rs/arrow2 implement those traits. I am just thinking aloud here, it might not be possible.
In any case I think this can be great for both |
Works for me, this would also allow side-by-side comparison
I think I'm not sure traits are the way to go here, as then everything ends up generic polluting all the interfaces, it should be possible to transparently do zero-copy conversion from one arrow representation to another, without needing to change any interfaces? |
Yes, that should definitely be possible via FFI. I mean we already do this with pyarrow.
Fair enough! 😄 Do you think it would it be possible to |
It would be hard to handle this outside the crate, as the schema inference logic is fairly complex, see here, and needs to interact with the record shredding to support all the various forms of list. This would likely require exposing implementation details I'm not sure we really wish to expose. Similarly when reading/writing dictionaries, there are scenarios where you need to go from a PLAIN to/from DICTIONARY representation. This could be reimplemented in parquet, and I actually had a PR that did this, but given the majority of arrow-cast is actually these dictionary conversions, you don't gain a huge amount by doing this. TLDR I wouldn't do this as part of a first-pass at an integration, longer-term who knows - perhaps polars might even use arrow-cast if the integration effort goes smoothly 😅 |
Right, I trust you on this. I agree this is very premature optimization on my part. 👍 |
Just starting to work on this now, here is the high-level outline of what I plan to do. I'm not familiar with the polars codebase, so please correct me if I'm about to undertake something stupid 😅 Add a Add a new Add a new Modify Modify I believe the above should be sufficient to enable apache parquet for streaming execution if This is predicated on my understanding that if streaming is set to true |
I think we should not dictate control flow by feature flags. This makes sense for Rust users, but for python users I want to be able to compile both readers and expose an
This is correct. Though I think once we have a
Yes that should do it. 👍 |
How would this be plumbed through, the reason for feature flags was to avoid having to make breaking API changes. ParquetOptions would be the obvious choice perhaps, but most of the APIs seem to list arguments manually? |
We could put the specific options in the enum itself and keep the generic onces as list arguments? Do you expect the arguments to differ much? Somewhat breaking the API is fine. 👍 |
This sounds like |
I'm not sure I follow what you are saying. Where would this enum be placed? I could add it to Should I make
I plan to see what shakes out of this effort, using FFI to convert between the two should be case of a couple of lines of code so may not warrant a separate crate. It should be as simple as
|
Thanks for the cc and sorry for the late reply. Lots of great input here! My opinion of arrow-rs has not fundamentally changed - it continues to be a crate with a design that is prone to unsoundness, and it continues to have unsound cases being found on basic functionality (such as FFI). Exposing Polars to this crate (through the parquet crate) will make people using Polars on servers (e.g. through fastAPI) more likely to be vulnerable. Regarding parquet itself, my understanding is that:
One aspect to take into account is that the main contributors of parquet and arrow-rs are paid / part/full time on it, while this is not the case in arrow2 / parquet2. I read somewhere that people cannot work on arrow2 / parquet2 because it is not part of the Apache foundation, but Polars is also not, so I am confused. Given the above, in view, the main work that needs to happen to address the issues in this post are:
Doing this would result in a significantly simpler dependency tree, in likely faster reading, and in Polars having its core dependencies that fulfill Rust's hypothesis that zero-cost abstractions and idiomatic safe code result in less memory bugs. Regarding the JSON and CSV, my suggestion is to lift the abstraction to a crate that everyone can benefit from. I have been doing this many times in arrow2 so that others can benefit from this without having to depend on arrow2 itself. Examples: Case in point is: I love the implementation of the push based json in arrow-rs, but I can't use it in arrow2 without dragging arrow-rs as dependency. |
I'm sorry you still feel this way, but I think we are going to have to agree to disagree at this point. I'm not aware of any major soundness issues in the last 6 months, and frankly I grow tired of this FUD. My stance remains unchanged from apache/arrow-rs#1176 (comment) and I would rather spend my time moving the community forward than continuing to fragment it. I will defer to @ritchie46, but given this crate makes significant use of unsafe, not to mention pyarrow, I'm not sure how important this is to polars.
DataFusion does not automatically parallelize parquet scans, polars does. The fact this is the only situation in which it is slower despite only using a single thread is perhaps telling. I suppose we shall see...
Or we could pool our efforts on making one implementation work well 😄 |
Ok, that was not my intention and I am sorry that you feel this way. This is certainly not productive. I replied to your comment there to try to find a way forward. |
Since we've recently incorporated |
Problem description
Following on from discussions between myself, @ritchie46 and @alamb, I would like to propose migrating polars to use the parquet crate. I am very happy to work on making this happen, but want to get consensus on the path forward first 😄
Disclaimer: I am one of the maintainers of the Apache Arrow crates
Why
List<u32>
jorgecarleitao/arrow2#1368Why Not
parquet
crate is a non-trivial additional dependency, and whilst it doesn't depend on the top-levelarrow
crate, it does depend onarrow-cast
arrow2
andarrow
The text was updated successfully, but these errors were encountered: