[FEAT] Change Parquet file splitting logic to split all files #3454

jaychia · 2024-11-29T23:42:23Z

Performs splitting of Parquet files on all files instead of just the first N
Use ScanTask::scantask_estimated_size_bytes as the metric for splitting instead of the DataSource's size on disk
Refactors the code to use an accumulator struct instead

Changes in logic highlighted as PR comments.

I think this will actually result in performance regressions in many cases, as planning will take longer for cases where we think we need to split the ScanTask. Especially for something like a .show() of a really large dataset. Should do some benchmarking to see how this affects various workloads.

TODO:

Perform bulk downloads of Parquet metadata instead of one-at-a-time

codspeed-hq · 2024-11-29T23:51:01Z

CodSpeed Performance Report

Merging #3454 will not alter performance

_{Comparing jay/split-all-files (65d6e2b) with main (a16a045)}

Summary

✅ 17 untouched benchmarks

codecov · 2024-11-30T00:04:27Z

Codecov Report

Attention: Patch coverage is 96.22642% with 6 lines in your changes missing coverage. Please review.

Project coverage is 77.51%. Comparing base (794a4fd) to head (65d6e2b).
Report is 4 commits behind head on main.

Files with missing lines	Patch %	Lines
src/daft-scan/src/scan_task_iters.rs	96.22%	6 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3454      +/-   ##
==========================================
+ Coverage   77.49%   77.51%   +0.01%     
==========================================
  Files         687      687              
  Lines       84475    84545      +70     
==========================================
+ Hits        65464    65531      +67     
- Misses      19011    19014       +3

Files with missing lines	Coverage Δ
src/daft-scan/src/scan_task_iters.rs	`96.20% <96.22%> (-0.73%)`	⬇️

... and 5 files with indirect coverage changes

jaychia · 2024-11-30T02:02:16Z

src/daft-scan/src/scan_task_iters.rs

+            [source] => source,
+            _ => unreachable!(
+                "SplitByRowGroupsAccumulator should only have one DataSource in its ScanTask"
+            ),


This seems like a fairly odd constraint that I inherited from the previous logic. I think we should be able to also split ScanTasks with more than 1 DataSource, but not a priority atm.

jaychia · 2024-11-30T02:06:37Z

src/daft-scan/src/scan_task_iters.rs

+        self.num_rows += rg.num_rows();
+        self.row_group_indices.push(*rg_idx);
+
+        // Flush the accumulator if necessary


This logic has been changed from the past:

We use ScanTask::estimated_size_bytes when adding to the self.size_bytes accumulator, instead of the size of the file on disk. This should give us a "more accurate" sizing than before, since it should account for things such as column pruning (ideally).

When ScanTask::estimated_size_bytes is not provided, we just always flush (i.e. every rowgroup becomes its own ScanTask). Alternatively, we can just avoid splitting at all. Not sure what the intended behavior there should be.

Perform split on all files

f706cff

github-actions bot added the enhancement New feature or request label Nov 29, 2024

Jay Chia added 5 commits November 29, 2024 17:00

Refactor into accumulator struct

aa699c7

Rename

c7b7cf4

Further simplification of accumulator logic

a36fffb

Cleanup into separate accumulator and accumulator context

d17e91d

Account for potentially null TableMetadata

65d6e2b

jaychia commented Nov 30, 2024

View reviewed changes

jaychia requested review from samster25 and kevinzwang November 30, 2024 02:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEAT] Change Parquet file splitting logic to split all files #3454

[FEAT] Change Parquet file splitting logic to split all files #3454

jaychia commented Nov 29, 2024 •

edited

Loading

codspeed-hq bot commented Nov 29, 2024 •

edited

Loading

codecov bot commented Nov 30, 2024 •

edited

Loading

jaychia Nov 30, 2024

jaychia Nov 30, 2024

[FEAT] Change Parquet file splitting logic to split all files #3454

Are you sure you want to change the base?

[FEAT] Change Parquet file splitting logic to split all files #3454

Conversation

jaychia commented Nov 29, 2024 • edited Loading

codspeed-hq bot commented Nov 29, 2024 • edited Loading

CodSpeed Performance Report

Merging #3454 will not alter performance

Summary

codecov bot commented Nov 30, 2024 • edited Loading

Codecov Report

jaychia Nov 30, 2024

Choose a reason for hiding this comment

jaychia Nov 30, 2024

Choose a reason for hiding this comment

jaychia commented Nov 29, 2024 •

edited

Loading

codspeed-hq bot commented Nov 29, 2024 •

edited

Loading

codecov bot commented Nov 30, 2024 •

edited

Loading