-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error reading Parquet files after schema evolution #1527
Comments
Probably caused by #132? If so, the fix should be fairly straightforward. |
Not sure if related, but in IOx we handle this at the query layer with a thing we call SchemaAdapterStream. This is created with an output schema and then inserts null columns into the RecordBatch that pass through it as needed. There are some IOx-specific details, but I suspect a generic version could be extracted for use by Datafusion. @alamb might have more thoughts on this as the original author of that component |
Thanks for the report @capkurmagati -- I am not sure if your usecase ever worked (in which case it is a bug). Regardless, as @tustvold mentions, we basically have the same usecase in IOx where some parquet files have a subset of the unified schema and we pad the remaining columns with NULLs. This picture might help https://github.com/influxdata/influxdb_iox/blob/f3f6f335a93d2910a5cc55e12662dfda82143701/query/src/provider/adapter.rs#L45-L72 We would be happy to contribute this to DataFusion / the file reader. @capkurmagati is there any chance you can write an end to end test (aka make the two parquet files you refer to above)? If so bringing in the |
Describe the bug
A clear and concise description of what the bug is.
(I'm not sure if it's a arrow-rs or arrow-datafusion bug)
Read parquet files with evolved schema can get an error at
https://github.com/apache/arrow-rs/blob/6.0.0/parquet/src/schema/types.rs#L886-L895
It seems that physical plan doesn't pass the desired schema to parquet reader
https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L422
and the
ParquetFileArrowReader
can only infer schema from filehttps://github.com/apache/arrow-rs/blob/6.4.0/parquet/src/arrow/arrow_reader.rs#L86-L92
To Reproduce
Steps to reproduce the behavior:
col_1 int
col_1 int, col_2 int
TableProvider
that usesParquetExec
and also specifies the schema col_1 int, col_2 intin
scan`select * from the_table
(since*
containscol_2
but the some file doesn't have that)Or
col_1 int
col_1 int, col_2 int
select * from the_table
Will got the following error
Expected behavior
A clear and concise description of what you expected to happen.
The query gets executed without error and returns
NULL
forcol_2
if the file doesn't contain the data.Additional context
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: