You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The resulting parquet files have row groups with 1024 rows instead of 8192.
What you expected to happen:
Most row groups in the compacted parquet should have size 8192.
How to reproduce it:
Call dt.optimize.compact() with max_row_group_size greater than 1024.
More details:
This is caused by calling self.arrow_writer.flush() at the end of each batch in core/src/operations/writer.rs, introduced recently in #2318. This creates a new row group for each batch, even when there are less than rows than max_row_group_size. Since we read batches using ParquetRecordBatchStreamBuilder with default config (i.e. batch size 1024), we end up with only row groups up to 1024 rows, even if we set max_row_group_size to a larger value.
I don't think calling flush is necessary since ArrowWriter does that automatically when we reach max_row_group_size rows.
This negatively impacts our use cases by inflating our parquet sizes, sometimes by up to 4x (40 MB to 160 MB).
The text was updated successfully, but these errors were encountered:
# Description
Reverts #2318 by removing
`flush` after writing each batch since it was causing smaller than
expected row groups to be written during compaction.
# Related Issue(s)
- closes#2386
Environment
Delta-rs version: 0.16.4
Binding: python
Environment:
Bug
What happened:
Compact produces parquet files that are larger than expected:
The resulting parquet files have row groups with 1024 rows instead of 8192.
What you expected to happen:
Most row groups in the compacted parquet should have size 8192.
How to reproduce it:
Call
dt.optimize.compact()
withmax_row_group_size
greater than 1024.More details:
This is caused by calling
self.arrow_writer.flush()
at the end of each batch incore/src/operations/writer.rs
, introduced recently in #2318. This creates a new row group for each batch, even when there are less than rows thanmax_row_group_size
. Since we read batches usingParquetRecordBatchStreamBuilder
with default config (i.e. batch size 1024), we end up with only row groups up to 1024 rows, even if we setmax_row_group_size
to a larger value.I don't think calling
flush
is necessary sinceArrowWriter
does that automatically when we reachmax_row_group_size
rows.This negatively impacts our use cases by inflating our parquet sizes, sometimes by up to 4x (40 MB to 160 MB).
The text was updated successfully, but these errors were encountered: