Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discussion: Switch DataFusion to using arrow2? #1532

Closed
alamb opened this issue Jan 9, 2022 · 31 comments
Closed

Discussion: Switch DataFusion to using arrow2? #1532

alamb opened this issue Jan 9, 2022 · 31 comments
Labels
documentation Improvements or additions to documentation
Milestone

Comments

@alamb
Copy link
Contributor

alamb commented Jan 9, 2022

Is your feature request related to a problem or challenge? Please describe what you are trying to do.
Datafusion currently relies on the https://github.com/apache/arrow-rs implementation of Apache Arrow. This also means any project that is built on DataFusion is likely to end up using that implementation as well

There has been various talk / discussion / work on switching to arrow2 - https://github.com/jorgecarleitao/arrow2 from @jorgecarleitao

Describe the solution you'd like
A consensus on if we want to switch datafusion to using arrow2

Additional context

@alamb alamb added documentation Improvements or additions to documentation design labels Jan 9, 2022
@alamb
Copy link
Contributor Author

alamb commented Jan 9, 2022

I believe the current proposal is to make an official arrow branch in datafusion: #68 (comment), which is probably a step towards switching to arrow2

@thinkharderdev
Copy link
Contributor

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

@hntd187
Copy link
Contributor

hntd187 commented Jan 9, 2022

I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide?

@houqp
Copy link
Member

houqp commented Jan 9, 2022

Thank you @alamb for bringing this up!

I believe the current proposal is to make an official arrow branch in datafusion: #68 (comment), which is probably a step towards switching to arrow2

Yes, this aligns with what I have in mind. The official arrow2 branch was proposed so we can close that long running PR and have a centralized location for folks to collaborate on the migration until we are happy with the master merge. If the community is happy with merging directly into master and iterate there, that would work as well.

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

I believe so. However, we could probably save this work and get it for free with the arrow2 switch.

I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide?

IMHO, the main downside is the switch cost and downstream impact. But I think it's a one time cost that's worth paying. I think arrow2 at this point should have covered most of all our needs in datafusion as demonstrated in #68. All unit and integration tests are passing at the moment.

@alamb
Copy link
Contributor Author

alamb commented Jan 9, 2022

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

I think someone just needs to put in the effort to make the current arrow-rs implementation async -- I don't know of any reason it can be done, nor do I know of anyone who plans to do so yet.

at this point should have covered most of all our needs in datafusion as demonstrated in #68. All unit and integration tests are passing at the moment.

If this is really true (I haven't checked) it certainly sounds compelling

I like the idea of making an official "arrow2" branch in DataFusion, getting some more 👀 on it, and then propose it as a PR to merge to datafusion master.

@houqp can you make a PR? Would you like me to? @yjshen ?

@houqp
Copy link
Member

houqp commented Jan 9, 2022

@houqp can you make a PR? Would you like me to? @yjshen ?

For sure, I can help create that PR 👍

@hntd187
Copy link
Contributor

hntd187 commented Jan 9, 2022

I guess, what are the reasons switching would be a bad idea? Like what is the delta between what they both currently provide?

IMHO, the main downside is the switch cost and downstream impact. But I think it's a one time cost that's worth paying. I think arrow2 at this point should have covered most of all our needs in datafusion as demonstrated in #68. All unit and integration tests are passing at the moment.

That's good enough for me

@Igosuki
Copy link
Contributor

Igosuki commented Jan 10, 2022

Btw the latest arrow2 commit that still has RecordBatch is jorgecarleitao/arrow2@ef7937d it would probably be good to use that as a starting point for a temporary arrow2 fork ? That would allow me to integrate the necessary patches for some features such as decimal, without having to switch to having RecordBatch in datafusion.

@jorgecarleitao
Copy link
Member

Thank you for considering using arrow2, very excited about this!

To provide some selling points, the primary goals of the repo have been:

  • be a place to innovate both on Arrow and on Rust
  • be sound and use the least amount of unsafe, pass MIRI checks, have a curated selection of dependencies
  • be panic-free on untrusted input to be usable in the context of web servers
  • be idiomatic via iterators, std::Vec, MutableArray API, Scalar API, easy to follow structs, etc.
  • be performant via simd, trusted len, and fast implementations
  • support sync and async IO with APIs that decouple blocking from non-blocking tasks
  • be interoperable with other formats such as Arrow, Avro, Parquet and ORC, including mandatory integration tests against corresponding reference implementations
  • be modular / easy to compile via feature flags over almost all functionality
  • support WASM
  • be maintainable via macros to reduce code duplication, user guide, examples and general avoidance of unsafe

atm it is the fastest implementation of Apache Parquet IO and Apache Avro IO that I can find (both read and write), both supporting sync and async executions and implemented in safe Rust (all IO in the crate is unsafe-free).

The crate is under active development, both in volume (~800 commits in a year), and also exploring different ideas, such as

(which is a major reason why it is 0.X, to allow space to try things out)

The crate has been adopted by Polars, databend, grafana's SDK for Rust and is interoperable with connectorx.

Releases have been happening about once a month (breaking), and on demand for bug fixes. The next is planned for end of this week.

I hope this offers a general idea of what is the crate and where it is heading.

@tustvold
Copy link
Contributor

Will arrow-rs eventually support async file IO? Requiring a synchronous ChuckReader is currently a major limitation in supporting alternate ObjectStores

FWIW it would be relatively straightforward to support async IO within the context of arrow-rs. You need buffered fetching in order to get reasonable IO performance anyway, and so you just do an async fetch into a buffer and then use the sync decoders to decode it. I believe this is what arrow2 is doing anyway?? I quickly cobbled something together showing how this can be done with parquet here.

FWIW I have some optimisations to the arrow-rs parquet reader in flight that yield some pretty significant speedups apache/arrow-rs#1054, apache/arrow-rs#1082. And I am planning to work on dictionary preservation next which should yield orders of magnitude speedups for string dictionaries.

I would personally prefer an approach that sees the great work on arrow2 cherry-picked into arrow-rs, with arrow2 serving as an incubator for new ideas. I am happy to help out with this if there are things people would particularly like to see ported across? The current ecosystem fragmentation is just unfortunate for both users and contributors imo...

@alamb
Copy link
Contributor Author

alamb commented Jan 11, 2022

I am happy to help out with this if there are things people would particularly like to see ported across?

I have heard lots of excitement about async IO (for parquet, csv, avro, etc) and the performance of those readers.

@alamb
Copy link
Contributor Author

alamb commented Jan 12, 2022

So my summary of this ticket so far is that the next step is to get a PR up to datafusion with the most up to date code to get arrow2 working. In parallel, I will plan to start some discussions (hopefully later in the week) on the apache arrow dev list about potential ways to getarrow2 unified with arrow-rs

Looking forward to seeing a PR so we can assess how close/far we are from this goal.

@Igosuki
Copy link
Contributor

Igosuki commented Jan 13, 2022

My branch got merged into the fork so now we only need to address a few remaining issues that break tests.

@houqp
Copy link
Member

houqp commented Jan 14, 2022

I would personally prefer an approach that sees the great work on arrow2 cherry-picked into arrow-rs, with arrow2 serving as an incubator for new ideas. I am happy to help out with this if there are things people would particularly like to see ported across?

For me personally, on top of the highly optimized parquet, avro and json io modules, I really like it's transmute free design and the muttable array abstraction. The latter is the main reason why delta-rs is also in the process of migrating to arrow2.

From previous discussions in the arrow dev list, I believe Jorge tried applying his arrow2 learnings back to arrow-rs last year, but decided that it's not worth the effort because it would require basically rewriting the majority of the code base. My main concern with cherry-picking arrow2 designs into arrow-rs is that we are spending all these efforts into making arrow-rs as good as arrow2 while on the other hand we could have spent the same amount of efforts into making arrow2 even better, which will not only benefit datafusion, but a much larger community including other projects that are currently using arrow2.

IMHO, there is value in forking an open-source repo when fundamental design tradeoffs diverges. But from what I have seen so far, both arrow2 and arrow-rs contributors are pretty aligned on the direction of where an ideal arrow rust implementation should go?

The current ecosystem fragmentation is just unfortunate for both users and contributors imo...

I agree 100%. That's why I think it would be good if we can come up with a way to avoid cherry-picking commits from arrow2 into arrow-rs. Perhaps we can have arrow-rs build on top of arrow2 so they still share the majority of the code base? For example, arrow-rs could focus on providing a higher level and stable API for consumers while using arrow2 as the core. That way from contributors' point of view, it will be clear where they should send their patches to depending on which layer they work on.

@houqp
Copy link
Member

houqp commented Jan 14, 2022

As for the datafusion arrow2 branch, the PR is now available for review at #1556. I encourage everyone to:

  • Review the code change to get a feeling of the API UX.
  • Do your own performance tests with the code base and potentially downstream projects to see if the performance story actually holds up or not and whether there is any regression in workloads that haven't been tested.

@houqp houqp added this to the arrow2 milestone Jan 14, 2022
@tustvold
Copy link
Contributor

tustvold commented Jan 14, 2022

That's why I think it would be good if we can come up with a way to avoid cherry-picking commits from arrow2 into arrow-rs

Sorry, I meant more cherry-picking ideas, not actual implementation. As in you might copy across arrow-2's Buffer implementation, add a conversion to arrow-rs's Buffer implementation and then migrate the array implementations across one-by-one. Or do something similar for MutableBuffer. Ultimately the in-memory format is the same arrow spec, just getting wrapped up in different ways - the whole point of arrow is conversion between the two representations should be cheap...

I guess I've just had bad past experiences of simultaneously changing all the things at once 😆. Having looked at the arrow2 parquet implementation, as it is the part of the arrow-rs codebase I'm most familiar with, there is a fair amount of non-trivial functionality loss compared to arrow-rs. Some of it is esoteric things like nested structures, but also larger omissions like certain page encodings or batch size control1. (it appears to read entire row groups into a single RecordBatch??).

This is therefore unlikely to be a strictly additive change, and I'm having a very hard time getting my head around all of its implications. That's all I really care about, that we can communicate something more than "everything may or may not be broken" 😆

1. FWIW this is the thing that makes reading parquet tricky, as pages don't delimit rows across columns or even semantic records within a column. If you just read row groups, it will be simple and fast, but recommendations are for row groups on the order of 1GB compressed so the memory footprint of such an approach is unfortunate 😅

@jorgecarleitao
Copy link
Member

jorgecarleitao commented Jan 14, 2022

I would like to thank all of you have have been working on the PR, and also to all of those that already gave it a spin. Incredibly humbled and thankful for it.

@tusvold, thanks a lot for raising these concerns, much appreciated and valid.

I agree with you that batch control is useful to avoid a very large memory footprint. I have added it as an issue on arrow2. Let me know if it captures your main point.

wrt to the encoders, I have been challenged in finding parquet writers that can write such encodings, so that we can integration-test them against when implementing them. I would be happy to add them to our CI and prove correctness and completeness of the implementation (and fix what is missing) - the process in arrow2 wrt to formats has been that we need at least one official implementation or 2 non-official implementations to confirm correctness of arrow2's implementation.

Since a comparison was made, I think that we could be fair and enumerate disadvantages and advantages of each other. For example, arrow-rs/parquet currently supports for deep nested parquet types, while arrow2 does not. Datafusion does not use them much, but there are important use-cases where they appear. Arrow-rs has a larger user-base by crate downloads and it is an official implementation. Arrow-rs also has pyarrow support out of the box (for those using pyo3), while arrow2 does not.

OTOH, arrow2 implements the complete arrow specification (under the process mentioned above), has async support to all its IO except arrow stream read, all its MIRI tests pass, its IO reading except parquet is panic free, actively implements the suggestions from Rust security WG and portable simd WG, its IO is forbid(unsafe_code), it has faster compute kernels, and its internals are simpler to understand and use, leveraging strong typed data structures.

Now, under the argument that it is the same in-memory format after all and what matters is feature completeness, I could argue that we should then create a thin FFI wrapper for the C++ implementation in Rust, abandon the Rust implementations altogether, and all contribute to the official C++.

Which brings me to my main point: imo this is not about which implementation has the fastest parquet reader or writer, it is about which code base has the most solid foundations for all of us to develop the next generation of columnar-based query engines, web applications leveraging arrow's forte, cool web-based streaming apps leveraging Arrow and Tokio, distributed query engines on AWS lambdas, etc., on a programming paradigm centred around correctness, easiness of use, and performance.

The fact that datafusion never passed MIRI checks and that it has been like this since its inception shows that these are not simple to fix issues nor the arrows' internals are sufficiently appealing for the community to fix it (at least to the unpaid ones like myself). Furthermore

  • despite Andrews' amazing effort in validating input data to arrays, not all arrow-rs tests pass MIRI yet (arrow2 passes, including roundtrips on all its IO and FFI components)
  • parquet crate is not tested under MIRI (both parquet2 and arrow2's IO are #[forbid(unsafe_code)])

With that said, is there a conclusion that the root cause for the high memory usage results from not batching parquet column chunks in smaller arrays, or is it an hypothesis that we need to test? Is that the primary concern here and it is sufficiently important to block adoption? If yes, I would gladly work towards addressing it upstream.

@tustvold
Copy link
Contributor

I believe Andrew intends to start a separate discussion about how to unify development effort around arrow2 and arrow-rs and this particular discussion is probably better had there. Apologies for derailing this thread, I appreciate that not everyone may share the perspective that they are the same issue.

@alamb
Copy link
Contributor Author

alamb commented Jan 14, 2022

I have filed apache/arrow-rs#1176 for a discussion on what should we do with arrow / arrow2 if datafusion switched to using arrow2.

FWIW I think the decision to switch datafusion or not should be made independently (based on whatever is best for DataFusion) but the switch I think would have major implications for arrow / arrow2

@andygrove
Copy link
Member

FWIW I think the decision to switch datafusion or not should be made independently (based on whatever is best for DataFusion) but the switch I think would have major implications for arrow / arrow2

💯

Which brings me to my main point: imo this is not about which implementation has the fastest parquet reader or writer, it is about which code base has the most solid foundations for all of us to develop the next generation of columnar-based query engines

💯

I am not currently actively involved in development with DataFusion, but if I were, I would be offering to help with the transition to arrow2.

Now that a PR is up to move to arrow2 I will at least try and help with some testing and benchmarking. I am really excited to see this happening. 😍

@emkornfield
Copy link
Contributor

Cross-posting related mailing list discussion: https://lists.apache.org/thread/dsyks2ylbonhs8ngnx6529dzfyfdjjzo

@matthewmturner
Copy link
Contributor

Given how arrow2 has fine grained controls over io features im wondering if it would make sense to pass that through to datafusion so you only have to install IO features that are needed.

Im thinking of this in the use case of using datafusion in ETL jobs where each task has its own container and may only need to use 1 or 2 file types. this could be a way to help limit container size / speed up installation.

@Igosuki
Copy link
Contributor

Igosuki commented Mar 9, 2022

That requires patching file formats with feature flags.
Edit : definitely doable with the current master.

@Alnaimi-
Copy link

Alnaimi- commented Oct 4, 2022

Is this still planned? Seems like there has been little movement since July (besides #1039 in September)

@tustvold
Copy link
Contributor

tustvold commented Oct 4, 2022

I believe @v0y4g3r is working on getting the arrow2 branch updated in #2855 as part of #2709

I'm not sure what the long-term plans for this effort are, especially as the implementations continue to diverge in functionality, in both directions. I don't believe a wholesale switch is likely in the foreseeable future, it certainly isn't planned, but there have been some discussions about allowing users to mix and match arrow implementations, including arrow-gpu. Anything is possible so long as people are motivated to achieve it 😄

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2022

Yeah -- short answer is that no one has gathered sufficient effort to get the code unified.

@alamb
Copy link
Contributor Author

alamb commented Oct 4, 2022

I think it is also important to note that many of the ideas from @jorgecarleitao in arrow2 have now been incorporated into arrow-rs

@Alnaimi-
Copy link

Alnaimi- commented Oct 4, 2022

Thanks both @tustvold @alamb.

I think arrow-rs may be the safe bet for now. Especially since @jorgecarleitao has been fairly busy recently to to spearhead things. Pity, I really liked the direction of the arrow2 apis.

@tustvold
Copy link
Contributor

tustvold commented Feb 9, 2023

An "update" on this is I intend to use pola-rs/polars#6735 as an opportunity to explore the possibilities for improved interoperability between arrow and arrow2. I'm fairly optimistic we can make use of the FFI APIs to provide inexpensive, zero-copy conversion between the two libraries, allowing people to mix and match as desired.

@alamb
Copy link
Contributor Author

alamb commented Mar 1, 2023

Update is I believe rather than switching DataFusion to use arrow2, we are likely going to combine arrow-rs and arrow2 -- see discussions on apache/arrow-rs#1176 (comment)

Let's move the discussion there

@alamb alamb closed this as completed Mar 1, 2023
@Alnaimi-
Copy link

Alnaimi- commented Mar 1, 2023

Cool. Thanks for update!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

No branches or pull requests