ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

andygrove · 2021-04-11T22:23:53Z

Updates Ballista to use the most recent DataFusion version.

Changes made:

Ballista overrides physical optimizer rules to remove Repartition
Added serde support for new TryCast expression
Updated DataFrame API usage to use Vec<_> instead of &[_]
Renamed some timestamp scalar variants
HashJoinExec updated to take new CollectLeft argument
Removed hard-coded batch size from serde code for CsvScanExec

github-actions · 2021-04-11T22:24:09Z

https://issues.apache.org/jira/browse/ARROW-12335

codecov-io · 2021-04-11T22:39:17Z

Codecov Report

Merging #9991 (a96fd26) into master (13c334e) will decrease coverage by 0.00%.
The diff coverage is 0.00%.

@@            Coverage Diff             @@
##           master    #9991      +/-   ##
==========================================
- Coverage   78.69%   78.68%   -0.01%     
==========================================
  Files         285      285              
  Lines       63900    63909       +9     
==========================================
+ Hits        50284    50285       +1     
- Misses      13616    13624       +8

Impacted Files	Coverage Δ
rust/ballista/rust/client/src/context.rs	`0.00% <0.00%> (ø)`
rust/ballista/rust/core/src/datasource.rs	`0.00% <ø> (ø)`
...sta/rust/core/src/serde/logical_plan/from_proto.rs	`0.00% <0.00%> (ø)`
...lista/rust/core/src/serde/logical_plan/to_proto.rs	`0.00% <ø> (ø)`
...ta/rust/core/src/serde/physical_plan/from_proto.rs	`0.00% <0.00%> (ø)`
rust/ballista/rust/scheduler/src/lib.rs	`0.00% <0.00%> (ø)`
rust/parquet/src/encodings/encoding.rs	`95.05% <0.00%> (+0.19%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f1f4f2b...a96fd26. Read the comment docs.

andygrove · 2021-04-14T13:56:53Z

@alamb @jorgecarleitao This makes Ballista depend on the version of Arrow and DataFusion in the repo and brings the code up to date.

I don't have strong feelings about keeping it this way and we can move back to depending on pinned commits if you think that would be better, and we can periodically update Ballista to upgrade to the latest DataFusion. Let me know what you think.

andygrove · 2021-04-14T13:57:59Z

One thing to add to that, is that given that we're about to release DataFusion 4.0.0, it would be nice if Ballista could depend on that, so maybe it is worth keeping the hard in-repo dependency at least until 4.0.0 is released.

andygrove · 2021-04-14T13:59:06Z

rust/ballista/rust/Cargo.toml

-[profile.release]
-lto = true
-codegen-units = 1
+#[profile.release]


@Dandandan This was an accidental commit. I had to comment this out so that I could build and test without really slow build times. Is there a better way for me to work around this?

It would really be convenient for this to have this feature rust-lang/cargo#6988 as I agree if you just want to run it with "reasonable" performance it shouldn't take ages to compile Ballista. In my experience it roughly 2x as long for project to build with lto (maybe worse when you compare it with incremental builds).

It is possible to do via flags as well, but earlier this didn't work because of the structure of the Ballista projects (multiple binaries per crate as far as I remember), maybe we can just temporary remove this (and be a bit slower) and see if we can enable it in a different way.
Another route I saw is just appending lto = true in a build script when creating binaries.

andygrove · 2021-04-14T14:22:38Z

So the integration tests break if I move to relative paths for dependencies so have reverted to pinned commits for now.

andygrove · 2021-04-15T01:41:44Z

Java build failed, unrelated to this PR.

andygrove · 2021-04-15T15:12:36Z

@alamb @jorgecarleitao I will merge this tonight if there are no objections

jorgecarleitao

Sorry, yes, ready to ship. I agree that DataFusion and Ballista versions should be tied. Not so certain about the arrow and parquet crates, but let's figure out that later. :)

kszucs · 2021-04-15T18:48:08Z

@andygrove do you want to include this in 4.0?

andygrove · 2021-04-15T18:54:40Z

@kszucs Yes, that would be great. Thanks!

kszucs · 2021-04-15T18:56:41Z

Thanks, merging then!

Updates Ballista to use the most recent DataFusion version. Changes made: - Ballista overrides physical optimizer rules to remove `Repartition` - Added serde support for new `TryCast` expression - Updated DataFrame API usage to use `Vec<_>` instead of `&[_]` - Renamed some timestamp scalar variants - HashJoinExec updated to take new `CollectLeft` argument - Removed hard-coded batch size from serde code for `CsvScanExec` Closes apache#9991 from andygrove/ballista-bump-df-version Authored-by: Andy Grove <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>

andygrove changed the title ~~ARROW-12335: [Rust] [Ballista] Bump DataFusion version~~ ARROW-12335: [Rust] [Ballista] Bump DataFusion version [WIP] Apr 11, 2021

github-actions bot added Component: Rust - Ballista Component: Rust labels Apr 11, 2021

andygrove added 3 commits April 14, 2021 07:20

Bump DataFusion version

08a2706

save

b1e5ec6

Fix regression

ddb51f5

andygrove force-pushed the ballista-bump-df-version branch from a96fd26 to ddb51f5 Compare April 14, 2021 13:51

andygrove changed the title ~~ARROW-12335: [Rust] [Ballista] Bump DataFusion version [WIP]~~ ARROW-12335: [Rust] [Ballista] Bump DataFusion version Apr 14, 2021

andygrove marked this pull request as ready for review April 14, 2021 13:51

andygrove changed the title ~~ARROW-12335: [Rust] [Ballista] Bump DataFusion version~~ ARROW-12335: [Rust] [Ballista] Use relative path for Arrow dependencies Apr 14, 2021

andygrove commented Apr 14, 2021

View reviewed changes

revert to pinned commit for deps on Arrow

74dfd8d

andygrove changed the title ~~ARROW-12335: [Rust] [Ballista] Use relative path for Arrow dependencies~~ ARROW-12335: [Rust] [Ballista] Use latest DataFusion Apr 14, 2021

Add ASF header to .dockerignore

d69494d

jorgecarleitao approved these changes Apr 15, 2021

View reviewed changes

kszucs closed this in 958c19a Apr 15, 2021

asfimport mentioned this pull request Apr 15, 2021

[Rust] [Ballista] Bump DataFusion version #28135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

andygrove commented Apr 11, 2021 •

edited

Loading

github-actions bot commented Apr 11, 2021

codecov-io commented Apr 11, 2021

andygrove commented Apr 14, 2021

andygrove commented Apr 14, 2021

andygrove Apr 14, 2021

Dandandan Apr 14, 2021

andygrove commented Apr 14, 2021

andygrove commented Apr 15, 2021

andygrove commented Apr 15, 2021

jorgecarleitao left a comment

kszucs commented Apr 15, 2021

andygrove commented Apr 15, 2021

kszucs commented Apr 15, 2021

ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991

Conversation

andygrove commented Apr 11, 2021 • edited Loading

github-actions bot commented Apr 11, 2021

codecov-io commented Apr 11, 2021

Codecov Report

andygrove commented Apr 14, 2021

andygrove commented Apr 14, 2021

andygrove Apr 14, 2021

Choose a reason for hiding this comment

Dandandan Apr 14, 2021

Choose a reason for hiding this comment

andygrove commented Apr 14, 2021

andygrove commented Apr 15, 2021

andygrove commented Apr 15, 2021

jorgecarleitao left a comment

Choose a reason for hiding this comment

kszucs commented Apr 15, 2021

andygrove commented Apr 15, 2021

kszucs commented Apr 15, 2021

andygrove commented Apr 11, 2021 •

edited

Loading