-
Notifications
You must be signed in to change notification settings - Fork 171
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEAT] Change Parquet file splitting logic to split all files #3454
base: main
Are you sure you want to change the base?
Conversation
CodSpeed Performance ReportMerging #3454 will not alter performanceComparing Summary
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3454 +/- ##
==========================================
+ Coverage 77.49% 77.51% +0.01%
==========================================
Files 687 687
Lines 84475 84545 +70
==========================================
+ Hits 65464 65531 +67
- Misses 19011 19014 +3
|
[source] => source, | ||
_ => unreachable!( | ||
"SplitByRowGroupsAccumulator should only have one DataSource in its ScanTask" | ||
), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a fairly odd constraint that I inherited from the previous logic. I think we should be able to also split ScanTasks with more than 1 DataSource, but not a priority atm.
self.num_rows += rg.num_rows(); | ||
self.row_group_indices.push(*rg_idx); | ||
|
||
// Flush the accumulator if necessary |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic has been changed from the past:
-
We use
ScanTask::estimated_size_bytes
when adding to theself.size_bytes
accumulator, instead of the size of the file on disk. This should give us a "more accurate" sizing than before, since it should account for things such as column pruning (ideally). -
When
ScanTask::estimated_size_bytes
is not provided, we just always flush (i.e. every rowgroup becomes its own ScanTask). Alternatively, we can just avoid splitting at all. Not sure what the intended behavior there should be.
N
ScanTask::scantask_estimated_size_bytes
as the metric for splitting instead of the DataSource's size on diskChanges in logic highlighted as PR comments.
I think this will actually result in performance regressions in many cases, as planning will take longer for cases where we think we need to split the ScanTask. Especially for something like a
.show()
of a really large dataset. Should do some benchmarking to see how this affects various workloads.TODO: