-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure the row count is preserved when coalescing over empty records #3439
Conversation
471f730
to
8b2b1e4
Compare
Codecov Report
@@ Coverage Diff @@
## master #3439 +/- ##
==========================================
+ Coverage 85.61% 85.67% +0.06%
==========================================
Files 297 298 +1
Lines 54490 54647 +157
==========================================
+ Hits 46650 46821 +171
+ Misses 7840 7826 -14
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @isidentical -- this looks great. I left some small suggestions but they are not required -- let me know if you want to address them in this PR or a follow on
Thanks again 🏅
let mut options = RecordBatchOptions::default(); | ||
options.row_count = Some(row_count); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As an FYI (not important) you can write this same pattern in what is likely more idiomatic rust (avoid mut
) like:
let mut options = RecordBatchOptions::default(); | |
options.row_count = Some(row_count); | |
let options = RecordBatchOptions{ | |
row_count:Some(row_count), | |
..Default::default() | |
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to fail since RecordBatchOptions
is non-exhaustive and comes directly from arrow-rs. I am not really familiar with this part of Rust semantics, but there seems to be an existing issue about it.
error[E0639]: cannot create non-exhaustive struct using struct expression
--> /home/isidentical/projects/arrow-datafusion/datafusion/core/src/physical_plan/coalesce_batches.rs:295:19
|
295 | let options = RecordBatchOptions {
| ___________________^
296 | | row_count: Some(row_count),
297 | | ..Default::default()
298 | | };
| |_____^
For more information about this error, try `rustc --explain E0639`.
Is there any way to achieve it? The only other example of RecordBatchOptions
construction present in DF is the following which uses the same pattern as the code present in this PR but I'd be happy to refactor both if there is a more idiomatic way.
https://github.com/apache/arrow-datafusion/blob/9956f80f197550051db7debae15d5c706afc22a3/datafusion/core/src/physical_plan/file_format/mod.rs#L286-L288
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for trying @isidentical -- I will file a follow on in arrow-rs to make this API more ergonomic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed apache/arrow-rs#2728 as a follow on (I also hit the same thing in #3454 FWIW)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI apache/arrow-rs#2729 fixed this issue and #3483 includes using the new API.
Benchmark runs are scheduled for baseline = 69d05aa and contender = 6de0796. 6de0796 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #3283.
Rationale for this change
The row count information is required for the validation happening during the
RecordBatch
construction. Luckily we already calculate it and propagate it in the stream handler code, but just don't preserve it when creating the final record.This PR also changes the SQL test code to never double-optimize logical plans (as it used to) since that would hide this problem (and maybe different ones in the future).
What changes are included in this PR?
This PR now creates the final record with the propagated
row_count
for the cases like #3283 where the schema might not contain any fields at all and the validation logic in arrow-rs would fail.Are there any user-facing changes?
This is a bug fix in general. It also includes a change in our internal tooling to never double-optimize logical plans, which the reasoning is provided inn #3283.