-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow suppling a table schema to ParquetExec #12010
Comments
To give a concrete examples, we might have a schema evolution that looks like:
Where our "final"/"current" table schema is (fwiw in reality some of these columns, e.g. |
It's worth noting that allow the position of partition columns to be controlled would be useful beyond this problem: If I have |
I'm not familiar with how people interact with Parquet file, thus not fully understand the difficulty about schema matching and why |
I'm not sure this will be too useful, but our code for creating a
You can call |
I take a look at the code around datafusion/datafusion/core/src/datasource/physical_plan/file_scan_config.rs Lines 224 to 244 in cb1e3f0
However, the datafusion/datafusion/core/src/datasource/physical_plan/parquet/mod.rs Lines 406 to 412 in cb1e3f0
I think we could store the We can see that the
Not sure how do you union with the created ParquetExec, If you have the |
I'm not familiar enough with DataFusion internals to comment on your suggestion, sorry.
We make the union by colllecting the |
Is your feature request related to a problem or challenge?
We have a couple of situations where our schema evolves and we wish to read both old and new data and operate on it seamlessly. We are reading from Parquet and in theory this should just work because of the internal
SchemaAdapter
, however, in practice we can't make this work without doing work which feels like abstraction-breaking. This has happened when we've changed regular columns into partition columns, and more generally when we've reordered or otherwise changed schemas in minor ways.In more detail, we're implementing
TableProvider
and we have a logical schema which we return fromTableProvider::schema
pushed down projections are passed to us based on this schema. In ourscan
method, we create differentParquetExec
s for each version of the file schema usingFileScanConfig
. We then union these execs together to make our data source. To do this, their schemas must match and we ensure they match our logical schema. However, doing so is painful: we have to create a new projection to reorder the fields, and compose this with the pushed-down projection. We have to do some manipulation of the file schema and partition columns (presumably some of this is unavoidable, but it seems unnecessary that we the logical schema, a file schema that we pass in, and a file schema found from the file). This is made more difficult by the fact that you can't control where the partition columns appear in the logical schema, they always end up at the end.Although
ParqetExec
has an internal schema adapter, this is not very useful (in this case) because it's 'target' schema is always the file schema from theFileScanConfig
(in<ParquetExec as ExecutionPlan>::execute
).Describe the solution you'd like
I'm not sure exactly what this should look like. I think I would like to supply a table schema which describes the output of the
ParquetExec
, is used as the 'target' schema for theSchemaAdapter
, specifies the location of partition columns, and is automagically applied to passed-down projections. In other words, the 'reader' is able to fully encapsulate schema changes.Describe alternatives you've considered
A few alternatives which would make this situation easier to handle without this 'complete' change:
Vec<usize>
, which I think would be a good thing anyway).ParquetAdapter
toFileScanConfig
(I'm not sure this would actually help at all and I appreciate that other readers might not be able to apply the adaptation, but it feels like this could help somehow in making projection/partition handling/schema adaption be better integrated).Additional context
No response
The text was updated successfully, but these errors were encountered: