Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

Closed
wants to merge 5 commits into from

Conversation

andygrove
Copy link
Member

@andygrove andygrove commented Apr 11, 2021

Updates Ballista to use the most recent DataFusion version.

Changes made:

  • Ballista overrides physical optimizer rules to remove Repartition
  • Added serde support for new TryCast expression
  • Updated DataFrame API usage to use Vec<_> instead of &[_]
  • Renamed some timestamp scalar variants
  • HashJoinExec updated to take new CollectLeft argument
  • Removed hard-coded batch size from serde code for CsvScanExec

@andygrove andygrove changed the title ARROW-12335: [Rust] [Ballista] Bump DataFusion version ARROW-12335: [Rust] [Ballista] Bump DataFusion version [WIP] Apr 11, 2021
@github-actions
Copy link

@codecov-io
Copy link

Codecov Report

Merging #9991 (a96fd26) into master (13c334e) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #9991      +/-   ##
==========================================
- Coverage   78.69%   78.68%   -0.01%     
==========================================
  Files         285      285              
  Lines       63900    63909       +9     
==========================================
+ Hits        50284    50285       +1     
- Misses      13616    13624       +8     
Impacted Files Coverage Δ
rust/ballista/rust/client/src/context.rs 0.00% <0.00%> (ø)
rust/ballista/rust/core/src/datasource.rs 0.00% <ø> (ø)
...sta/rust/core/src/serde/logical_plan/from_proto.rs 0.00% <0.00%> (ø)
...lista/rust/core/src/serde/logical_plan/to_proto.rs 0.00% <ø> (ø)
...ta/rust/core/src/serde/physical_plan/from_proto.rs 0.00% <0.00%> (ø)
rust/ballista/rust/scheduler/src/lib.rs 0.00% <0.00%> (ø)
rust/parquet/src/encodings/encoding.rs 95.05% <0.00%> (+0.19%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1f4f2b...a96fd26. Read the comment docs.

@andygrove andygrove changed the title ARROW-12335: [Rust] [Ballista] Bump DataFusion version [WIP] ARROW-12335: [Rust] [Ballista] Bump DataFusion version Apr 14, 2021
@andygrove andygrove marked this pull request as ready for review April 14, 2021 13:51
@andygrove andygrove changed the title ARROW-12335: [Rust] [Ballista] Bump DataFusion version ARROW-12335: [Rust] [Ballista] Use relative path for Arrow dependencies Apr 14, 2021
@andygrove
Copy link
Member Author

@alamb @jorgecarleitao This makes Ballista depend on the version of Arrow and DataFusion in the repo and brings the code up to date.

I don't have strong feelings about keeping it this way and we can move back to depending on pinned commits if you think that would be better, and we can periodically update Ballista to upgrade to the latest DataFusion. Let me know what you think.

@andygrove
Copy link
Member Author

One thing to add to that, is that given that we're about to release DataFusion 4.0.0, it would be nice if Ballista could depend on that, so maybe it is worth keeping the hard in-repo dependency at least until 4.0.0 is released.

[profile.release]
lto = true
codegen-units = 1
#[profile.release]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Dandandan This was an accidental commit. I had to comment this out so that I could build and test without really slow build times. Is there a better way for me to work around this?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would really be convenient for this to have this feature rust-lang/cargo#6988 as I agree if you just want to run it with "reasonable" performance it shouldn't take ages to compile Ballista. In my experience it roughly 2x as long for project to build with lto (maybe worse when you compare it with incremental builds).

It is possible to do via flags as well, but earlier this didn't work because of the structure of the Ballista projects (multiple binaries per crate as far as I remember), maybe we can just temporary remove this (and be a bit slower) and see if we can enable it in a different way.
Another route I saw is just appending lto = true in a build script when creating binaries.

@andygrove
Copy link
Member Author

So the integration tests break if I move to relative paths for dependencies so have reverted to pinned commits for now.

@andygrove andygrove changed the title ARROW-12335: [Rust] [Ballista] Use relative path for Arrow dependencies ARROW-12335: [Rust] [Ballista] Use latest DataFusion Apr 14, 2021
@andygrove
Copy link
Member Author

Java build failed, unrelated to this PR.

@andygrove
Copy link
Member Author

@alamb @jorgecarleitao I will merge this tonight if there are no objections

Copy link
Member

@jorgecarleitao jorgecarleitao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, yes, ready to ship. I agree that DataFusion and Ballista versions should be tied. Not so certain about the arrow and parquet crates, but let's figure out that later. :)

@kszucs
Copy link
Member

kszucs commented Apr 15, 2021

@andygrove do you want to include this in 4.0?

@andygrove
Copy link
Member Author

@kszucs Yes, that would be great. Thanks!

@kszucs
Copy link
Member

kszucs commented Apr 15, 2021

Thanks, merging then!

@kszucs kszucs closed this in 958c19a Apr 15, 2021
GeorgeAp pushed a commit to sirensolutions/arrow that referenced this pull request Jun 7, 2021
Updates Ballista to use the most recent DataFusion version.

Changes made:

- Ballista overrides physical optimizer rules to remove `Repartition`
- Added serde support for new `TryCast` expression
- Updated DataFrame API usage to use `Vec<_>` instead of `&[_]`
- Renamed some timestamp scalar variants
- HashJoinExec updated to take new `CollectLeft` argument
- Removed hard-coded batch size from serde code for `CsvScanExec`

Closes apache#9991 from andygrove/ballista-bump-df-version

Authored-by: Andy Grove <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 10, 2021
Updates Ballista to use the most recent DataFusion version.

Changes made:

- Ballista overrides physical optimizer rules to remove `Repartition`
- Added serde support for new `TryCast` expression
- Updated DataFrame API usage to use `Vec<_>` instead of `&[_]`
- Renamed some timestamp scalar variants
- HashJoinExec updated to take new `CollectLeft` argument
- Removed hard-coded batch size from serde code for `CsvScanExec`

Closes apache#9991 from andygrove/ballista-bump-df-version

Authored-by: Andy Grove <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
michalursa pushed a commit to michalursa/arrow that referenced this pull request Jun 13, 2021
Updates Ballista to use the most recent DataFusion version.

Changes made:

- Ballista overrides physical optimizer rules to remove `Repartition`
- Added serde support for new `TryCast` expression
- Updated DataFrame API usage to use `Vec<_>` instead of `&[_]`
- Renamed some timestamp scalar variants
- HashJoinExec updated to take new `CollectLeft` argument
- Removed hard-coded batch size from serde code for `CsvScanExec`

Closes apache#9991 from andygrove/ballista-bump-df-version

Authored-by: Andy Grove <[email protected]>
Signed-off-by: Krisztián Szűcs <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants