Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RecordBatch with no columns cannot be roundtripped through Parquet #6988

Closed
samgqroberts opened this issue Jan 16, 2025 · 4 comments · Fixed by #6990
Closed

RecordBatch with no columns cannot be roundtripped through Parquet #6988

samgqroberts opened this issue Jan 16, 2025 · 4 comments · Fixed by #6990
Labels
bug parquet Changes to the parquet crate

Comments

@samgqroberts
Copy link

Describe the bug
If we create a RecordBatch with no columns (and no rows), serialize it to Parquet bytes via parquet::arrow::ArrowWriter, and attempt to deserialize it back via parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder, we get the error "Repetition level must be defined for a primitive type".

To Reproduce

use arrow::array::RecordBatch;
use arrow::array::RecordBatchOptions;
use arrow::datatypes::Field;
use arrow::datatypes::Schema;
use bytes::Bytes;
use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;
use parquet::arrow::ArrowWriter;
use parquet::basic::Compression;
use parquet::basic::ZstdLevel;
use parquet::file::properties::WriterProperties;
use std::sync::Arc;

#[test]
fn arrow_rs_repro() {
    // create empty record batch
    let empty_fields: Vec<Field> = vec![];
    let empty_schema = Arc::new(Schema::new(empty_fields));
    let empty_batch = RecordBatch::try_new_with_options(
        empty_schema,
        vec![],
        &RecordBatchOptions::default().with_row_count(Some(0)),
    )
    .unwrap();

    // write to parquet
    let mut parquet_bytes: Vec<u8> = Vec::new();
    let props = WriterProperties::builder()
        .set_compression(Compression::ZSTD(ZstdLevel::default()))
        .build();
    let mut writer =
        ArrowWriter::try_new(&mut parquet_bytes, empty_batch.schema(), Some(props)).unwrap();
    writer.write(&empty_batch).unwrap();
    writer.close().unwrap();
    assert_eq!(
            String::from_utf8_lossy(&parquet_bytes),
            "PAR1\u{15}\u{2}\u{19}\u{1c}H\u{c}arrow_schema\u{15}\0\0\u{16}\0\u{19}\u{c}\u{19}\u{1c}\u{18}\u{c}ARROW:schema\u{18}L/////zAAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAAAAAA=\0\u{18}\u{19}parquet-rs version 53.3.0\u{19}\u{c}\0\0\0\0PAR1"
        );

    // read from parquet
    let bytes = Bytes::from(parquet_bytes);
    let result = ParquetRecordBatchReaderBuilder::try_new(bytes);
    // REPRODUCTION: below fails with
    // called `Result::unwrap()` on an `Err` value: General("Repetition level must be defined for a primitive type")
    result.unwrap();
}

Expected behavior
We should be able to reconstruct our original no-column RecordBatch.

Additional context
Using arrow / parquet crate versions 53.3.0

I tested this behavior in PyArrow v18.1.0 (which is backed by arrow-cpp) and I found that:

  • The Parquet bytes produced by PyArrow via pyarrow.parquet.write_table() for a no-column pa.Table can be successfully read by pyarrow.parquet.read_table() as well as the Rust ParquetRecordBatchReaderBuilder approach above.
  • pyarrow.parquet.read_table() can successfully read the bytes produced by rust (parquet_bytes above).

I did a little debugging and found two differences in the produced file metadata between PyArrow and arrow-rs:

  1. The file metadata in the PyArrow-produced Parquet bytes has a single SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in the Rust-produced bytes does have the single SchemaElement, with the num_children: 0, but repetition_type is unspecified. This discrepancy leads to schema::types::from_thrift_helper throwing the error for the Rust-produced bytes.
  2. The PyArrow file metadata has a single row group with 0s for total_byte_size, num_rows, etc., whereas the Rust file metadata has no row groups. I'm not sure this is important for this particular error.

This perhaps points to two distinct bugs:

  1. Like PyArrow, ParquetRecordBatchReaderBuilder should be able to read the Rust-produced bytes, forgiving the lack of repetition_type in the SchemaElement at least for the case where there are 0 columns.
  2. Like PyArrow, ArrowWriter should be able to produce a Parquet file that properly specifies that repetition_type.

I'd also very much accept a "Hey you didn't see this config option over here" as a solution for my particular usage!

@etseidl
Copy link
Contributor

etseidl commented Jan 17, 2025

  1. The file metadata in the PyArrow-produced Parquet bytes has a single SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in the Rust-produced bytes does have the single SchemaElement, with the num_children: 0, but repetition_type is unspecified. This discrepancy leads to schema::types::from_thrift_helper throwing the error for the Rust-produced bytes.

Actually, the lack of repetition_type is per the spec

repetition of the field. The root of the schema does not have a repetition_type. All other nodes must have one

The issue seems to be that since num_children is 0, it's assumed to be a leaf node rather than the root of the schema. I think we need to check for this case and return early from from_thrift_helper. I'll take a stab at this if no one beats me to it.

@etseidl
Copy link
Contributor

etseidl commented Jan 17, 2025

@samgqroberts thanks for the thorough issue description and reproducer! Made tracking this down really quick. I'll have a PR for this soon.

@samgqroberts
Copy link
Author

@etseidl Awesome! Thank you very much for the quick responses and work.

@alamb alamb added the parquet Changes to the parquet crate label Jan 27, 2025
@alamb
Copy link
Contributor

alamb commented Jan 27, 2025

label_issue.py automatically added labels {'parquet'} from #6990

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants