-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ARROW-12335: [Rust] [Ballista] Use latest DataFusion #9991
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9991 +/- ##
==========================================
- Coverage 78.69% 78.68% -0.01%
==========================================
Files 285 285
Lines 63900 63909 +9
==========================================
+ Hits 50284 50285 +1
- Misses 13616 13624 +8
Continue to review full report at Codecov.
|
a96fd26
to
ddb51f5
Compare
@alamb @jorgecarleitao This makes Ballista depend on the version of Arrow and DataFusion in the repo and brings the code up to date. I don't have strong feelings about keeping it this way and we can move back to depending on pinned commits if you think that would be better, and we can periodically update Ballista to upgrade to the latest DataFusion. Let me know what you think. |
One thing to add to that, is that given that we're about to release DataFusion 4.0.0, it would be nice if Ballista could depend on that, so maybe it is worth keeping the hard in-repo dependency at least until 4.0.0 is released. |
[profile.release] | ||
lto = true | ||
codegen-units = 1 | ||
#[profile.release] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Dandandan This was an accidental commit. I had to comment this out so that I could build and test without really slow build times. Is there a better way for me to work around this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would really be convenient for this to have this feature rust-lang/cargo#6988 as I agree if you just want to run it with "reasonable" performance it shouldn't take ages to compile Ballista. In my experience it roughly 2x as long for project to build with lto (maybe worse when you compare it with incremental builds).
It is possible to do via flags as well, but earlier this didn't work because of the structure of the Ballista projects (multiple binaries per crate as far as I remember), maybe we can just temporary remove this (and be a bit slower) and see if we can enable it in a different way.
Another route I saw is just appending lto = true
in a build script when creating binaries.
So the integration tests break if I move to relative paths for dependencies so have reverted to pinned commits for now. |
Java build failed, unrelated to this PR. |
@alamb @jorgecarleitao I will merge this tonight if there are no objections |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, yes, ready to ship. I agree that DataFusion and Ballista versions should be tied. Not so certain about the arrow
and parquet
crates, but let's figure out that later. :)
@andygrove do you want to include this in 4.0? |
@kszucs Yes, that would be great. Thanks! |
Thanks, merging then! |
Updates Ballista to use the most recent DataFusion version. Changes made: - Ballista overrides physical optimizer rules to remove `Repartition` - Added serde support for new `TryCast` expression - Updated DataFrame API usage to use `Vec<_>` instead of `&[_]` - Renamed some timestamp scalar variants - HashJoinExec updated to take new `CollectLeft` argument - Removed hard-coded batch size from serde code for `CsvScanExec` Closes apache#9991 from andygrove/ballista-bump-df-version Authored-by: Andy Grove <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>
Updates Ballista to use the most recent DataFusion version. Changes made: - Ballista overrides physical optimizer rules to remove `Repartition` - Added serde support for new `TryCast` expression - Updated DataFrame API usage to use `Vec<_>` instead of `&[_]` - Renamed some timestamp scalar variants - HashJoinExec updated to take new `CollectLeft` argument - Removed hard-coded batch size from serde code for `CsvScanExec` Closes apache#9991 from andygrove/ballista-bump-df-version Authored-by: Andy Grove <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>
Updates Ballista to use the most recent DataFusion version. Changes made: - Ballista overrides physical optimizer rules to remove `Repartition` - Added serde support for new `TryCast` expression - Updated DataFrame API usage to use `Vec<_>` instead of `&[_]` - Renamed some timestamp scalar variants - HashJoinExec updated to take new `CollectLeft` argument - Removed hard-coded batch size from serde code for `CsvScanExec` Closes apache#9991 from andygrove/ballista-bump-df-version Authored-by: Andy Grove <[email protected]> Signed-off-by: Krisztián Szűcs <[email protected]>
Updates Ballista to use the most recent DataFusion version.
Changes made:
Repartition
TryCast
expressionVec<_>
instead of&[_]
CollectLeft
argumentCsvScanExec