-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Discussion: Switch DataFusion to using arrow2? #1532
Comments
I believe the current proposal is to make an official arrow branch in datafusion: #68 (comment), which is probably a step towards switching to arrow2 |
Will |
I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide? |
Thank you @alamb for bringing this up!
Yes, this aligns with what I have in mind. The official arrow2 branch was proposed so we can close that long running PR and have a centralized location for folks to collaborate on the migration until we are happy with the master merge. If the community is happy with merging directly into master and iterate there, that would work as well.
I believe so. However, we could probably save this work and get it for free with the arrow2 switch.
IMHO, the main downside is the switch cost and downstream impact. But I think it's a one time cost that's worth paying. I think arrow2 at this point should have covered most of all our needs in datafusion as demonstrated in #68. All unit and integration tests are passing at the moment. |
I think someone just needs to put in the effort to make the current
If this is really true (I haven't checked) it certainly sounds compelling I like the idea of making an official "arrow2" branch in DataFusion, getting some more 👀 on it, and then propose it as a PR to merge to datafusion master. |
That's good enough for me |
Btw the latest arrow2 commit that still has RecordBatch is jorgecarleitao/arrow2@ef7937d it would probably be good to use that as a starting point for a temporary arrow2 fork ? That would allow me to integrate the necessary patches for some features such as decimal, without having to switch to having RecordBatch in datafusion. |
Thank you for considering using arrow2, very excited about this! To provide some selling points, the primary goals of the repo have been:
atm it is the fastest implementation of Apache Parquet IO and Apache Avro IO that I can find (both read and write), both supporting The crate is under active development, both in volume (~800 commits in a year), and also exploring different ideas, such as
(which is a major reason why it is 0.X, to allow space to try things out) The crate has been adopted by Polars, databend, grafana's SDK for Rust and is interoperable with connectorx. Releases have been happening about once a month (breaking), and on demand for bug fixes. The next is planned for end of this week. I hope this offers a general idea of what is the crate and where it is heading. |
FWIW it would be relatively straightforward to support async IO within the context of arrow-rs. You need buffered fetching in order to get reasonable IO performance anyway, and so you just do an async fetch into a buffer and then use the sync decoders to decode it. I believe this is what arrow2 is doing anyway?? I quickly cobbled something together showing how this can be done with parquet here. FWIW I have some optimisations to the arrow-rs parquet reader in flight that yield some pretty significant speedups apache/arrow-rs#1054, apache/arrow-rs#1082. And I am planning to work on dictionary preservation next which should yield orders of magnitude speedups for string dictionaries. I would personally prefer an approach that sees the great work on arrow2 cherry-picked into arrow-rs, with |
I have heard lots of excitement about |
So my summary of this ticket so far is that the next step is to get a PR up to datafusion with the most up to date code to get arrow2 working. In parallel, I will plan to start some discussions (hopefully later in the week) on the apache arrow dev list about potential ways to get Looking forward to seeing a PR so we can assess how close/far we are from this goal. |
My branch got merged into the fork so now we only need to address a few remaining issues that break tests. |
For me personally, on top of the highly optimized parquet, avro and json io modules, I really like it's transmute free design and the muttable array abstraction. The latter is the main reason why delta-rs is also in the process of migrating to arrow2. From previous discussions in the arrow dev list, I believe Jorge tried applying his arrow2 learnings back to arrow-rs last year, but decided that it's not worth the effort because it would require basically rewriting the majority of the code base. My main concern with cherry-picking arrow2 designs into arrow-rs is that we are spending all these efforts into making arrow-rs as good as arrow2 while on the other hand we could have spent the same amount of efforts into making arrow2 even better, which will not only benefit datafusion, but a much larger community including other projects that are currently using arrow2. IMHO, there is value in forking an open-source repo when fundamental design tradeoffs diverges. But from what I have seen so far, both arrow2 and arrow-rs contributors are pretty aligned on the direction of where an ideal arrow rust implementation should go?
I agree 100%. That's why I think it would be good if we can come up with a way to avoid cherry-picking commits from arrow2 into arrow-rs. Perhaps we can have arrow-rs build on top of arrow2 so they still share the majority of the code base? For example, arrow-rs could focus on providing a higher level and stable API for consumers while using arrow2 as the core. That way from contributors' point of view, it will be clear where they should send their patches to depending on which layer they work on. |
As for the datafusion arrow2 branch, the PR is now available for review at #1556. I encourage everyone to:
|
Sorry, I meant more cherry-picking ideas, not actual implementation. As in you might copy across arrow-2's I guess I've just had bad past experiences of simultaneously changing all the things at once 😆. Having looked at the This is therefore unlikely to be a strictly additive change, and I'm having a very hard time getting my head around all of its implications. That's all I really care about, that we can communicate something more than "everything may or may not be broken" 😆 1. FWIW this is the thing that makes reading parquet tricky, as pages don't delimit rows across columns or even semantic records within a column. If you just read row groups, it will be simple and fast, but recommendations are for row groups on the order of 1GB compressed so the memory footprint of such an approach is unfortunate 😅 |
I would like to thank all of you have have been working on the PR, and also to all of those that already gave it a spin. Incredibly humbled and thankful for it. @tusvold, thanks a lot for raising these concerns, much appreciated and valid. I agree with you that batch control is useful to avoid a very large memory footprint. I have added it as an issue on arrow2. Let me know if it captures your main point. wrt to the encoders, I have been challenged in finding parquet writers that can write such encodings, so that we can integration-test them against when implementing them. I would be happy to add them to our CI and prove correctness and completeness of the implementation (and fix what is missing) - the process in arrow2 wrt to formats has been that we need at least one official implementation or 2 non-official implementations to confirm correctness of arrow2's implementation. Since a comparison was made, I think that we could be fair and enumerate disadvantages and advantages of each other. For example, arrow-rs/parquet currently supports for deep nested parquet types, while arrow2 does not. Datafusion does not use them much, but there are important use-cases where they appear. Arrow-rs has a larger user-base by crate downloads and it is an official implementation. Arrow-rs also has pyarrow support out of the box (for those using pyo3), while arrow2 does not. OTOH, arrow2 implements the complete arrow specification (under the process mentioned above), has Now, under the argument that it is the same in-memory format after all and what matters is feature completeness, I could argue that we should then create a thin FFI wrapper for the C++ implementation in Rust, abandon the Rust implementations altogether, and all contribute to the official C++. Which brings me to my main point: imo this is not about which implementation has the fastest parquet reader or writer, it is about which code base has the most solid foundations for all of us to develop the next generation of columnar-based query engines, web applications leveraging arrow's forte, cool web-based streaming apps leveraging Arrow and Tokio, distributed query engines on AWS lambdas, etc., on a programming paradigm centred around correctness, easiness of use, and performance. The fact that datafusion never passed MIRI checks and that it has been like this since its inception shows that these are not simple to fix issues nor the arrows' internals are sufficiently appealing for the community to fix it (at least to the unpaid ones like myself). Furthermore
With that said, is there a conclusion that the root cause for the high memory usage results from not batching parquet column chunks in smaller arrays, or is it an hypothesis that we need to test? Is that the primary concern here and it is sufficiently important to block adoption? If yes, I would gladly work towards addressing it upstream. |
I believe Andrew intends to start a separate discussion about how to unify development effort around arrow2 and arrow-rs and this particular discussion is probably better had there. Apologies for derailing this thread, I appreciate that not everyone may share the perspective that they are the same issue. |
I have filed apache/arrow-rs#1176 for a discussion on what should we do with arrow / arrow2 if datafusion switched to using arrow2. FWIW I think the decision to switch datafusion or not should be made independently (based on whatever is best for DataFusion) but the switch I think would have major implications for arrow / arrow2 |
💯
💯 I am not currently actively involved in development with DataFusion, but if I were, I would be offering to help with the transition to arrow2. Now that a PR is up to move to arrow2 I will at least try and help with some testing and benchmarking. I am really excited to see this happening. 😍 |
Cross-posting related mailing list discussion: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo |
Given how arrow2 has fine grained controls over io features im wondering if it would make sense to pass that through to datafusion so you only have to install IO features that are needed. Im thinking of this in the use case of using datafusion in ETL jobs where each task has its own container and may only need to use 1 or 2 file types. this could be a way to help limit container size / speed up installation. |
That requires patching file formats with feature flags. |
Is this still planned? Seems like there has been little movement since July (besides #1039 in September) |
I believe @v0y4g3r is working on getting the arrow2 branch updated in #2855 as part of #2709 I'm not sure what the long-term plans for this effort are, especially as the implementations continue to diverge in functionality, in both directions. I don't believe a wholesale switch is likely in the foreseeable future, it certainly isn't planned, but there have been some discussions about allowing users to mix and match arrow implementations, including arrow-gpu. Anything is possible so long as people are motivated to achieve it 😄 |
Yeah -- short answer is that no one has gathered sufficient effort to get the code unified. |
I think it is also important to note that many of the ideas from @jorgecarleitao in arrow2 have now been incorporated into arrow-rs |
I think arrow-rs may be the safe bet for now. Especially since @jorgecarleitao has been fairly busy recently to to spearhead things. Pity, I really liked the direction of the arrow2 apis. |
An "update" on this is I intend to use pola-rs/polars#6735 as an opportunity to explore the possibilities for improved interoperability between arrow and arrow2. I'm fairly optimistic we can make use of the FFI APIs to provide inexpensive, zero-copy conversion between the two libraries, allowing people to mix and match as desired. |
Update is I believe rather than switching DataFusion to use arrow2, we are likely going to combine arrow-rs and arrow2 -- see discussions on apache/arrow-rs#1176 (comment) Let's move the discussion there |
Cool. Thanks for update! |
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Datafusion currently relies on the https://github.com/apache/arrow-rs implementation of Apache Arrow. This also means any project that is built on DataFusion is likely to end up using that implementation as well
There has been various talk / discussion / work on switching to arrow2 - https://github.com/jorgecarleitao/arrow2 from @jorgecarleitao
Describe the solution you'd like
A consensus on if we want to switch datafusion to using arrow2
Additional context
The text was updated successfully, but these errors were encountered: