RecordBatch with no columns cannot be roundtripped through Parquet #6988

samgqroberts · 2025-01-16T19:23:49Z

Describe the bug
If we create a RecordBatch with no columns (and no rows), serialize it to Parquet bytes via parquet::arrow::ArrowWriter, and attempt to deserialize it back via parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder, we get the error "Repetition level must be defined for a primitive type".

To Reproduce

use arrow::array::RecordBatch;
use arrow::array::RecordBatchOptions;
use arrow::datatypes::Field;
use arrow::datatypes::Schema;
use bytes::Bytes;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ArrowWriter;
use parquet::basic::Compression;
use parquet::basic::ZstdLevel;
use parquet::file::properties::WriterProperties;
use std::sync::Arc;

#[test]
fn arrow_rs_repro() {
    // create empty record batch
    let empty_fields: Vec<Field> = vec![];
    let empty_schema = Arc::new(Schema::new(empty_fields));
    let empty_batch = RecordBatch::try_new_with_options(
        empty_schema,
        vec![],
        &RecordBatchOptions::default().with_row_count(Some(0)),
    )
    .unwrap();

    // write to parquet
    let mut parquet_bytes: Vec<u8> = Vec::new();
    let props = WriterProperties::builder()
        .set_compression(Compression::ZSTD(ZstdLevel::default()))
        .build();
    let mut writer =
        ArrowWriter::try_new(&mut parquet_bytes, empty_batch.schema(), Some(props)).unwrap();
    writer.write(&empty_batch).unwrap();
    writer.close().unwrap();
    assert_eq!(
            String::from_utf8_lossy(&parquet_bytes),
            "PAR1\u{15}\u{2}\u{19}\u{1c}H\u{c}arrow_schema\u{15}\0\0\u{16}\0\u{19}\u{c}\u{19}\u{1c}\u{18}\u{c}ARROW:schema\u{18}L/////zAAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAAAAAA=\0\u{18}\u{19}parquet-rs version 53.3.0\u{19}\u{c}\0�\0\0\0PAR1"
        );

    // read from parquet
    let bytes = Bytes::from(parquet_bytes);
    let result = ParquetRecordBatchReaderBuilder::try_new(bytes);
    // REPRODUCTION: below fails with
    // called `Result::unwrap()` on an `Err` value: General("Repetition level must be defined for a primitive type")
    result.unwrap();
}

Expected behavior
We should be able to reconstruct our original no-column RecordBatch.

Additional context
Using arrow / parquet crate versions 53.3.0

I tested this behavior in PyArrow v18.1.0 (which is backed by arrow-cpp) and I found that:

The Parquet bytes produced by PyArrow via pyarrow.parquet.write_table() for a no-column pa.Table can be successfully read by pyarrow.parquet.read_table() as well as the Rust ParquetRecordBatchReaderBuilder approach above.
pyarrow.parquet.read_table() can successfully read the bytes produced by rust (parquet_bytes above).

I did a little debugging and found two differences in the produced file metadata between PyArrow and arrow-rs:

The file metadata in the PyArrow-produced Parquet bytes has a single SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in the Rust-produced bytes does have the single SchemaElement, with the num_children: 0, but repetition_type is unspecified. This discrepancy leads to schema::types::from_thrift_helper throwing the error for the Rust-produced bytes.
The PyArrow file metadata has a single row group with 0s for total_byte_size, num_rows, etc., whereas the Rust file metadata has no row groups. I'm not sure this is important for this particular error.

This perhaps points to two distinct bugs:

Like PyArrow, ParquetRecordBatchReaderBuilder should be able to read the Rust-produced bytes, forgiving the lack of repetition_type in the SchemaElement at least for the case where there are 0 columns.
Like PyArrow, ArrowWriter should be able to produce a Parquet file that properly specifies that repetition_type.

I'd also very much accept a "Hey you didn't see this config option over here" as a solution for my particular usage!

The text was updated successfully, but these errors were encountered:

etseidl · 2025-01-17T00:26:58Z

The file metadata in the PyArrow-produced Parquet bytes has a single SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in the Rust-produced bytes does have the single SchemaElement, with the num_children: 0, but repetition_type is unspecified. This discrepancy leads to schema::types::from_thrift_helper throwing the error for the Rust-produced bytes.

Actually, the lack of repetition_type is per the spec

repetition of the field. The root of the schema does not have a repetition_type. All other nodes must have one

The issue seems to be that since num_children is 0, it's assumed to be a leaf node rather than the root of the schema. I think we need to check for this case and return early from from_thrift_helper. I'll take a stab at this if no one beats me to it.

etseidl · 2025-01-17T01:12:27Z

@samgqroberts thanks for the thorough issue description and reproducer! Made tracking this down really quick. I'll have a PR for this soon.

samgqroberts · 2025-01-17T15:12:48Z

@etseidl Awesome! Thank you very much for the quick responses and work.

alamb · 2025-01-27T11:14:26Z

label_issue.py automatically added labels {'parquet'} from #6990

samgqroberts added the bug label Jan 16, 2025

etseidl mentioned this issue Jan 17, 2025

Add check for empty schema in parquet::schema::types::from_thrift_helper #6990

Merged

tustvold closed this as completed in #6990 Jan 19, 2025

alamb added the parquet Changes to the parquet crate label Jan 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RecordBatch with no columns cannot be roundtripped through Parquet #6988

RecordBatch with no columns cannot be roundtripped through Parquet #6988

samgqroberts commented Jan 16, 2025

etseidl commented Jan 17, 2025

etseidl commented Jan 17, 2025

samgqroberts commented Jan 17, 2025

alamb commented Jan 27, 2025

RecordBatch with no columns cannot be roundtripped through Parquet #6988

RecordBatch with no columns cannot be roundtripped through Parquet #6988

Comments

samgqroberts commented Jan 16, 2025

etseidl commented Jan 17, 2025

etseidl commented Jan 17, 2025

samgqroberts commented Jan 17, 2025

alamb commented Jan 27, 2025