Error reading Parquet files after schema evolution #1527

capkurmagati · 2022-01-08T12:14:28Z

Describe the bug
A clear and concise description of what the bug is.

(I'm not sure if it's a arrow-rs or arrow-datafusion bug)
Read parquet files with evolved schema can get an error at
https://github.com/apache/arrow-rs/blob/6.0.0/parquet/src/schema/types.rs#L886-L895
It seems that physical plan doesn't pass the desired schema to parquet reader
https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L422
and the ParquetFileArrowReader can only infer schema from file
https://github.com/apache/arrow-rs/blob/6.4.0/parquet/src/arrow/arrow_reader.rs#L86-L92

To Reproduce
Steps to reproduce the behavior:

Create a parquet file with schema col_1 int
Create another parquet file with schema col_1 int, col_2 int
Implement a TableProvider that uses ParquetExec and also specifies the schema col_1 int, col_2 intinscan`
Register the table and select * from the_table (since * contains col_2 but the some file doesn't have that)

Or

Create a parquet file with schema col_1 int
Create another parquet file with schema col_1 int, col_2 int
Create external table via cli and select * from the_table
Will got the following error

Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))

Expected behavior
A clear and concise description of what you expected to happen.

The query gets executed without error and returns NULL for col_2 if the file doesn't contain the data.

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

houqp · 2022-01-10T17:49:46Z

Probably caused by #132? If so, the fix should be fairly straightforward.

tustvold · 2022-01-15T14:03:41Z

Not sure if related, but in IOx we handle this at the query layer with a thing we call SchemaAdapterStream. This is created with an output schema and then inserts null columns into the RecordBatch that pass through it as needed.

There are some IOx-specific details, but I suspect a generic version could be extracted for use by Datafusion.

@alamb might have more thoughts on this as the original author of that component

alamb · 2022-01-17T14:36:06Z

Thanks for the report @capkurmagati -- I am not sure if your usecase ever worked (in which case it is a bug).

Regardless, as @tustvold mentions, we basically have the same usecase in IOx where some parquet files have a subset of the unified schema and we pad the remaining columns with NULLs.

This picture might help https://github.com/influxdata/influxdb_iox/blob/f3f6f335a93d2910a5cc55e12662dfda82143701/query/src/provider/adapter.rs#L45-L72

We would be happy to contribute this to DataFusion / the file reader. @capkurmagati is there any chance you can write an end to end test (aka make the two parquet files you refer to above)? If so bringing in the SchemaAdapter stream would be pretty straightforward

capkurmagati · 2022-01-24T14:57:05Z

@tustvold @alamb Thanks for the pointer and sorry that I couldn't respond quickly.
I wrote some tests and verified that my problem got resolved by #1622. Let me close this issue.
Thanks again.

capkurmagati added the bug Something isn't working label Jan 8, 2022

capkurmagati closed this as completed Jan 24, 2022

alamb mentioned this issue Jan 24, 2022

Handle merging of evolved schemas in ParquetExec #1622

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Error reading Parquet files after schema evolution #1527

Error reading Parquet files after schema evolution #1527

capkurmagati commented Jan 8, 2022

houqp commented Jan 10, 2022 •

edited

Loading

tustvold commented Jan 15, 2022

alamb commented Jan 17, 2022

capkurmagati commented Jan 24, 2022

Error reading Parquet files after schema evolution #1527

Error reading Parquet files after schema evolution #1527

Comments

capkurmagati commented Jan 8, 2022

houqp commented Jan 10, 2022 • edited Loading

tustvold commented Jan 15, 2022

alamb commented Jan 17, 2022

capkurmagati commented Jan 24, 2022

houqp commented Jan 10, 2022 •

edited

Loading