Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: do not write empty parquet file/add on writer close; accurately … #2123

Merged
merged 4 commits into from
Jan 27, 2024

Conversation

alexwilcoxson-rel
Copy link
Contributor

@alexwilcoxson-rel alexwilcoxson-rel commented Jan 26, 2024

…track unflushed row group size

Description

When writing a batch with the writer, if that batch would result in a flush and no more batches follow, then on close an empty parquet file (no rows) is written. This includes an Add in the log that looks like

    Add {
        path: "part-00002-3da49db6-e5e9-4426-8839-0092a56cc155-c000.parquet",
        partition_values: {},
        size: 346,
        modification_time: 1706297596165,
        data_change: true,
        stats: Some(
            "{\"numRecords\":0,\"minValues\":{},\"maxValues\":{},\"nullCount\":{}}",
        ),
        tags: None,
        deletion_vector: None,
        base_row_id: None,
        default_row_commit_version: None,
        clustering_provider: None,
        stats_parsed: None,
    },

The empty stats structs causes issues with scans in datafusion.

Also changed it so the writer tracking internal buffers includes the parquet writer buffering within its row group writer.

Copy link
Collaborator

@roeap roeap left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍 - one more merge from main and good to go!

@roeap roeap enabled auto-merge (squash) January 27, 2024 00:03
@roeap roeap merged commit 7fbc02b into delta-io:main Jan 27, 2024
20 checks passed
RobinLin666 pushed a commit to RobinLin666/delta-rs that referenced this pull request Feb 2, 2024
delta-io#2123)

…track unflushed row group size

# Description
When writing a batch with the writer, if that batch would result in a
flush and no more batches follow, then on close an empty parquet file
(no rows) is written. This includes an Add in the log that looks like

```
    Add {
        path: "part-00002-3da49db6-e5e9-4426-8839-0092a56cc155-c000.parquet",
        partition_values: {},
        size: 346,
        modification_time: 1706297596165,
        data_change: true,
        stats: Some(
            "{\"numRecords\":0,\"minValues\":{},\"maxValues\":{},\"nullCount\":{}}",
        ),
        tags: None,
        deletion_vector: None,
        base_row_id: None,
        default_row_commit_version: None,
        clustering_provider: None,
        stats_parsed: None,
    },
```

The empty stats structs causes issues with scans in datafusion.

Also changed it so the writer tracking internal buffers includes the
parquet writer buffering within its row group writer.
@alexwilcoxson-rel alexwilcoxson-rel deleted the empty-file-fix branch May 9, 2024 15:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/rust Issues for the Rust crate crate/core
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants