Compacting produces smaller row groups than expected #2386

PeterKeDer · 2024-04-04T21:39:07Z

Environment

Delta-rs version: 0.16.4

Binding: python

Environment:

Cloud provider: AWS
OS: macOS
Other:

Bug

What happened:

Compact produces parquet files that are larger than expected:

dt = DeltaTable("...")
dt.optimize.compact(
    writer_properties=WriterProperties(
        max_row_group_size=8192,
        write_batch_size=8192,
    )
)

The resulting parquet files have row groups with 1024 rows instead of 8192.

What you expected to happen:

Most row groups in the compacted parquet should have size 8192.

How to reproduce it:

Call dt.optimize.compact() with max_row_group_size greater than 1024.

More details:

This is caused by calling self.arrow_writer.flush() at the end of each batch in core/src/operations/writer.rs, introduced recently in #2318. This creates a new row group for each batch, even when there are less than rows than max_row_group_size. Since we read batches using ParquetRecordBatchStreamBuilder with default config (i.e. batch size 1024), we end up with only row groups up to 1024 rows, even if we set max_row_group_size to a larger value.

I don't think calling flush is necessary since ArrowWriter does that automatically when we reach max_row_group_size rows.

This negatively impacts our use cases by inflating our parquet sizes, sometimes by up to 4x (40 MB to 160 MB).

The text was updated successfully, but these errors were encountered:

ion-elgreco · 2024-04-04T23:17:23Z

@PeterKeDer feel free to make a PR to revert the change! I was testing some things there

# Description Reverts #2318 by removing `flush` after writing each batch since it was causing smaller than expected row groups to be written during compaction. # Related Issue(s) - closes #2386

PeterKeDer added the bug Something isn't working label Apr 4, 2024

PeterKeDer mentioned this issue Apr 5, 2024

fix(rust): remove flush after writing every batch #2387

Merged

ion-elgreco closed this as completed in #2387 Apr 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compacting produces smaller row groups than expected #2386

Compacting produces smaller row groups than expected #2386

PeterKeDer commented Apr 4, 2024 •

edited

Loading

ion-elgreco commented Apr 4, 2024

Compacting produces smaller row groups than expected #2386

Compacting produces smaller row groups than expected #2386

Comments

PeterKeDer commented Apr 4, 2024 • edited Loading

Environment

Bug

ion-elgreco commented Apr 4, 2024

PeterKeDer commented Apr 4, 2024 •

edited

Loading