-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error writing STRUCT
to parquet in parallel: internal error: entered unreachable code: cannot downcast Int64 to byte array
#8853
Comments
Similar to #8851, the parallelized parquet writer code is to blame here. There is something wrong with how that code is handling nested types.
|
I tried to make a minimal reproducer with just arrow-rs, but it appears to work fine. It must be then that there is an issue with the tokio implementation of this logic in DataFusion. use std::sync::Arc;
use arrow_array::*;
use arrow_schema::*;
use parquet::arrow::arrow_to_parquet_schema;
use parquet::arrow::arrow_writer::{ArrowLeafColumn, compute_leaves, get_column_writers};
use parquet::file::properties::WriterProperties;
use parquet::file::writer::SerializedFileWriter;
fn main(){
let schema = Arc::new(Schema::new(vec![
Field::new("struct", DataType::Struct(vec![
Field::new("b", DataType::Boolean, false),
Field::new("c", DataType::Int32, false),].into()), false
)
]));
// Compute the parquet schema
let parquet_schema = arrow_to_parquet_schema(schema.as_ref()).unwrap();
let props = Arc::new(WriterProperties::default());
// Create writers for each of the leaf columns
let col_writers = get_column_writers(&parquet_schema, &props, &schema).unwrap();
// Spawn a worker thread for each column
// This is for demonstration purposes, a thread-pool e.g. rayon or tokio, would be better
let mut workers: Vec<_> = col_writers
.into_iter()
.map(|mut col_writer| {
let (send, recv) = std::sync::mpsc::channel::<ArrowLeafColumn>();
let handle = std::thread::spawn(move || {
for col in recv {
col_writer.write(&col)?;
}
col_writer.close()
});
(handle, send)
})
.collect();
// Create parquet writer
let root_schema = parquet_schema.root_schema_ptr();
let mut out = Vec::with_capacity(1024); // This could be a File
let mut writer = SerializedFileWriter::new(&mut out, root_schema, props.clone()).unwrap();
// Start row group
let mut row_group = writer.next_row_group().unwrap();
let boolean = Arc::new(BooleanArray::from(vec![false, false, true, true]));
let int = Arc::new(Int32Array::from(vec![42, 28, 19, 31]));
// Columns to encode
let to_write = vec![Arc::new(
StructArray::from(vec![
(
Arc::new(Field::new("b", DataType::Boolean, false)),
boolean.clone() as ArrayRef,
),
(
Arc::new(Field::new("c", DataType::Int32, false)),
int.clone() as ArrayRef,
),
])) as _,
];
// Spawn work to encode columns
let mut worker_iter = workers.iter_mut();
for (arr, field) in to_write.iter().zip(&schema.fields) {
for leaves in compute_leaves(field, arr).unwrap() {
worker_iter.next().unwrap().1.send(leaves).unwrap();
}
}
// Finish up parallel column encoding
for (handle, send) in workers {
drop(send); // Drop send side to signal termination
let chunk = handle.join().unwrap().unwrap();
chunk.append_to_row_group(&mut row_group).unwrap();
}
row_group.close().unwrap();
let metadata = writer.close().unwrap();
assert_eq!(metadata.num_rows, 4);
} |
@tustvold is it apparent to you what the issue is within the DataFusion parallel parquet code? If not, I propose we disable the feature by default and add many more tests to cover writing nested parquet files and other data types like dictionaries (#8854). Then take a longer time and likely multiple PRs to bring the parallel parquet writer in DataFusion to feature parity with the non-parallel version. |
I can take some time to take a look next week, my guess is it is something in the logic that performs slicing for row group parallelism |
I don't think this issue should block the datafusion release. @devinjdangelo set the feature I updated this PR's description to mention that. Once we re-enable single file parallelism by default we should verify this query still works |
STRUCT
to parquet: internal error: entered unreachable code: cannot downcast Int64 to byte arraySTRUCT
to parquet in parallel: internal error: entered unreachable code: cannot downcast Int64 to byte array
Describe the bug
I can't write a struct to parquet when trying to write in parallel, instead I get the following error
To Reproduce
Expected behavior
I expect the parquet file to be written successfully. This works fine with JSON:
Additional context
No response
The text was updated successfully, but these errors were encountered: