Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error reading Parquet files after schema evolution #1527

Closed
capkurmagati opened this issue Jan 8, 2022 · 4 comments
Closed

Error reading Parquet files after schema evolution #1527

capkurmagati opened this issue Jan 8, 2022 · 4 comments
Labels
bug Something isn't working

Comments

@capkurmagati
Copy link
Contributor

Describe the bug
A clear and concise description of what the bug is.

(I'm not sure if it's a arrow-rs or arrow-datafusion bug)
Read parquet files with evolved schema can get an error at
https://github.com/apache/arrow-rs/blob/6.0.0/parquet/src/schema/types.rs#L886-L895
It seems that physical plan doesn't pass the desired schema to parquet reader
https://github.com/apache/arrow-datafusion/blob/6.0.0/datafusion/src/physical_plan/file_format/parquet.rs#L408-L422
and the ParquetFileArrowReader can only infer schema from file
https://github.com/apache/arrow-rs/blob/6.4.0/parquet/src/arrow/arrow_reader.rs#L86-L92

To Reproduce
Steps to reproduce the behavior:

  1. Create a parquet file with schema col_1 int
  2. Create another parquet file with schema col_1 int, col_2 int
  3. Implement a TableProvider that uses ParquetExec and also specifies the schema col_1 int, col_2 intinscan`
  4. Register the table and select * from the_table (since * contains col_2 but the some file doesn't have that)

Or

  1. Create a parquet file with schema col_1 int
  2. Create another parquet file with schema col_1 int, col_2 int
  3. Create external table via cli and select * from the_table
    Will got the following error

Parquet reader thread terminated due to error: ParquetError(General("Invalid Parquet file. Corrupt footer"))

Expected behavior
A clear and concise description of what you expected to happen.

The query gets executed without error and returns NULL for col_2 if the file doesn't contain the data.

Additional context
Add any other context about the problem here.

@capkurmagati capkurmagati added the bug Something isn't working label Jan 8, 2022
@houqp
Copy link
Member

houqp commented Jan 10, 2022

Probably caused by #132? If so, the fix should be fairly straightforward.

@tustvold
Copy link
Contributor

Not sure if related, but in IOx we handle this at the query layer with a thing we call SchemaAdapterStream. This is created with an output schema and then inserts null columns into the RecordBatch that pass through it as needed.

There are some IOx-specific details, but I suspect a generic version could be extracted for use by Datafusion.

@alamb might have more thoughts on this as the original author of that component

@alamb
Copy link
Contributor

alamb commented Jan 17, 2022

Thanks for the report @capkurmagati -- I am not sure if your usecase ever worked (in which case it is a bug).

Regardless, as @tustvold mentions, we basically have the same usecase in IOx where some parquet files have a subset of the unified schema and we pad the remaining columns with NULLs.

This picture might help https://github.com/influxdata/influxdb_iox/blob/f3f6f335a93d2910a5cc55e12662dfda82143701/query/src/provider/adapter.rs#L45-L72

We would be happy to contribute this to DataFusion / the file reader. @capkurmagati is there any chance you can write an end to end test (aka make the two parquet files you refer to above)? If so bringing in the SchemaAdapter stream would be pretty straightforward

@capkurmagati
Copy link
Contributor Author

@tustvold @alamb Thanks for the pointer and sorry that I couldn't respond quickly.
I wrote some tests and verified that my problem got resolved by #1622. Let me close this issue.
Thanks again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants