-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable schema evolution for merge
write disposition with delta
table format
#1742
Enable schema evolution for merge
write disposition with delta
table format
#1742
Conversation
✅ Deploy Preview for dlt-hub-docs canceled.
|
@@ -6660,63 +6659,52 @@ files = [ | |||
|
|||
[[package]] | |||
name = "pyarrow" | |||
version = "14.0.2" | |||
version = "16.1.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that pyarrow
gets upgraded.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! It looks good! Two things:
- if arrow_ds is empty you do not evolve the schema. IMO that should happen. please add a test for it (
if arrow_ds.head(1).num_rows == 0:
) - should we update all table schemas like in other destinations where it happens in
update_stored_schema
? if you agree let's create a ticket for that - same thing for truncating tables before the load. this is actually used by
refresh
option
…schema-evolution-delta-filesystem-merge
Done.
Three options:
We currently do 3. 1 is not possible yet, but might become possible when the linked tickets are done (they are already assigned, so could be soon). 2 is possible, but is a bigger burden on our side. Which has your preference?
Okay, then we should probably use it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK this is top. let me add windows fix
So what I'd do:
in truncate / drop tables
migrating schema |
partition_by=self._partition_columns, | ||
storage_options=self._storage_options, | ||
) | ||
with self.arrow_ds.scanner().to_reader() as arrow_rbr: # RecordBatchReader |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious why you inserted a with
context here. Is it because arrow_rbr
gets exhausted and is effectively useless after the context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it has a close method... so it has internal unmanaged resources that we should free ASAP. otherwise garbage collector does it way later
…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <[email protected]>
…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <[email protected]>
…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <[email protected]>
…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <[email protected]>
…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <[email protected]>
…ble format (#1742) * black format * increase minimum deltalake version dependency * enable schema evolution for delta table merge * extract delta table merge logic into separate function * remove big decimal exclusion due to upstream bugfix * evolve delta table schema in empty source case * refactor DeltaLoadFilesystemJob * uses right table path format in delta lake load job * allows to pass schema name when getting delta tables and computing table counts * cleansup usage of remote paths and uris in filesystem load jobs * removes tempfile from file_storage --------- Co-authored-by: Marcin Rudolf <[email protected]>
Description
merge
write disposition with thedelta
table formatdeltalake
version to accessadd_columns
methodget_delta_tables
for pipelines with multiple schemas (that may explain problem with "missing" delta tables)Related Issues
Fixes #1739