You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
If we create a RecordBatch with no columns (and no rows), serialize it to Parquet bytes via parquet::arrow::ArrowWriter, and attempt to deserialize it back via parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder, we get the error "Repetition level must be defined for a primitive type".
To Reproduce
use arrow::array::RecordBatch;use arrow::array::RecordBatchOptions;use arrow::datatypes::Field;use arrow::datatypes::Schema;use bytes::Bytes;use parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder;use parquet::arrow::ArrowWriter;use parquet::basic::Compression;use parquet::basic::ZstdLevel;use parquet::file::properties::WriterProperties;use std::sync::Arc;#[test]fnarrow_rs_repro(){// create empty record batchlet empty_fields:Vec<Field> = vec![];let empty_schema = Arc::new(Schema::new(empty_fields));let empty_batch = RecordBatch::try_new_with_options(
empty_schema,vec![],&RecordBatchOptions::default().with_row_count(Some(0)),).unwrap();// write to parquetletmut parquet_bytes:Vec<u8> = Vec::new();let props = WriterProperties::builder().set_compression(Compression::ZSTD(ZstdLevel::default())).build();letmut writer =
ArrowWriter::try_new(&mut parquet_bytes, empty_batch.schema(),Some(props)).unwrap();
writer.write(&empty_batch).unwrap();
writer.close().unwrap();assert_eq!(String::from_utf8_lossy(&parquet_bytes),"PAR1\u{15}\u{2}\u{19}\u{1c}H\u{c}arrow_schema\u{15}\0\0\u{16}\0\u{19}\u{c}\u{19}\u{1c}\u{18}\u{c}ARROW:schema\u{18}L/////zAAAAAQAAAAAAAKAAwACgAJAAQACgAAABAAAAAAAQQACAAIAAAABAAIAAAABAAAAAAAAAA=\0\u{18}\u{19}parquet-rs version 53.3.0\u{19}\u{c}\0�\0\0\0PAR1");// read from parquetlet bytes = Bytes::from(parquet_bytes);let result = ParquetRecordBatchReaderBuilder::try_new(bytes);// REPRODUCTION: below fails with// called `Result::unwrap()` on an `Err` value: General("Repetition level must be defined for a primitive type")
result.unwrap();}
Expected behavior
We should be able to reconstruct our original no-column RecordBatch.
Additional context
Using arrow / parquet crate versions 53.3.0
I tested this behavior in PyArrow v18.1.0 (which is backed by arrow-cpp) and I found that:
The Parquet bytes produced by PyArrow via pyarrow.parquet.write_table() for a no-column pa.Table can be successfully read by pyarrow.parquet.read_table() as well as the Rust ParquetRecordBatchReaderBuilder approach above.
pyarrow.parquet.read_table() can successfully read the bytes produced by rust (parquet_bytes above).
I did a little debugging and found two differences in the produced file metadata between PyArrow and arrow-rs:
The file metadata in the PyArrow-produced Parquet bytes has a single SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in the Rust-produced bytes does have the single SchemaElement, with the num_children: 0, but repetition_type is unspecified. This discrepancy leads to schema::types::from_thrift_helper throwing the error for the Rust-produced bytes.
The PyArrow file metadata has a single row group with 0s for total_byte_size, num_rows, etc., whereas the Rust file metadata has no row groups. I'm not sure this is important for this particular error.
This perhaps points to two distinct bugs:
Like PyArrow, ParquetRecordBatchReaderBuilder should be able to read the Rust-produced bytes, forgiving the lack of repetition_type in the SchemaElement at least for the case where there are 0 columns.
Like PyArrow, ArrowWriter should be able to produce a Parquet file that properly specifies that repetition_type.
I'd also very much accept a "Hey you didn't see this config option over here" as a solution for my particular usage!
The text was updated successfully, but these errors were encountered:
The file metadata in the PyArrow-produced Parquet bytes has a single SchemaElement with num_children: 0 and repetition_type: 0. The file metadata in the Rust-produced bytes does have the single SchemaElement, with the num_children: 0, but repetition_type is unspecified. This discrepancy leads to schema::types::from_thrift_helper throwing the error for the Rust-produced bytes.
Actually, the lack of repetition_type is per the spec
repetition of the field. The root of the schema does not have a repetition_type. All other nodes must have one
The issue seems to be that since num_children is 0, it's assumed to be a leaf node rather than the root of the schema. I think we need to check for this case and return early from from_thrift_helper. I'll take a stab at this if no one beats me to it.
Describe the bug
If we create a RecordBatch with no columns (and no rows), serialize it to Parquet bytes via
parquet::arrow::ArrowWriter
, and attempt to deserialize it back viaparquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder
, we get the error"Repetition level must be defined for a primitive type"
.To Reproduce
Expected behavior
We should be able to reconstruct our original no-column RecordBatch.
Additional context
Using arrow / parquet crate versions
53.3.0
I tested this behavior in PyArrow v18.1.0 (which is backed by arrow-cpp) and I found that:
pyarrow.parquet.write_table()
for a no-columnpa.Table
can be successfully read bypyarrow.parquet.read_table()
as well as the RustParquetRecordBatchReaderBuilder
approach above.pyarrow.parquet.read_table()
can successfully read the bytes produced by rust (parquet_bytes
above).I did a little debugging and found two differences in the produced file metadata between PyArrow and arrow-rs:
schema::types::from_thrift_helper
throwing the error for the Rust-produced bytes.This perhaps points to two distinct bugs:
ParquetRecordBatchReaderBuilder
should be able to read the Rust-produced bytes, forgiving the lack of repetition_type in the SchemaElement at least for the case where there are 0 columns.ArrowWriter
should be able to produce a Parquet file that properly specifies that repetition_type.I'd also very much accept a "Hey you didn't see this config option over here" as a solution for my particular usage!
The text was updated successfully, but these errors were encountered: