Skip to content

Commit

Permalink
feat(rust,python): cast each parquet file to delta schema (#2615)
Browse files Browse the repository at this point in the history
# Description

By casting the read record batch to the delta schema datafusion can read
tables where the underlying parquet files can be cast to the desired
schema. Fixes:

- Errors querying data where some of the parquet files may not have
columns that were added later because of schema migration. This includes
nested columns for structs that are in Maps, Lists, or children of other
structs
- maps and lists written with different different element names
- timestamps of different units.
- Any other cast supported by arrow-cast.

This can be done now since data-fusion exposes a SchemaAdapter which can
be overwritten.

We should note that this makes all times being read by delta-rs as
having microsecond precision to match the Delta protocol.

# Related Issue(s)
- This makes solving #2478 and #2341 just a matter of adding code to
delta-rs cast.

---------

Co-authored-by: Alex Wilcoxson <[email protected]>
  • Loading branch information
HawaiianSpork and alexwilcoxson-rel authored Jul 21, 2024
1 parent 8fece10 commit d3642a6
Show file tree
Hide file tree
Showing 8 changed files with 290 additions and 63 deletions.
2 changes: 1 addition & 1 deletion crates/core/src/delta_datafusion/find_files/mod.rs
Original file line number Diff line number Diff line change
Expand Up @@ -148,7 +148,7 @@ async fn scan_table_by_files(
// Add path column
used_columns.push(logical_schema.index_of(scan_config.file_column_name.as_ref().unwrap())?);

let scan = DeltaScanBuilder::new(&snapshot, log_store, &state)
let scan = DeltaScanBuilder::new(&snapshot, log_store)
.with_filter(Some(expression.clone()))
.with_projection(Some(&used_columns))
.with_scan_config(scan_config)
Expand Down
Loading

0 comments on commit d3642a6

Please sign in to comment.