feature/databricks-delta-incremental-support #130

fivetran-joemarkiewicz · 2024-05-29T20:42:01Z

PR Overview

This PR will address the following Issue/Feature: Internal tickets and Issue #128

This PR will result in the following new package version: v1.8.0

When I tested this locally for Databricks there was actually no error when running without a full refresh. However, the table format did not change. Therefore, a breaking change should be leveraged to ensure a full refresh is ran and the delta table format is applied.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

🚨 Breaking Changes 🚨

⚠️ Since the following changes result in the table format changing, we recommend running a --full-refresh after upgrading to this version to avoid possible incremental failures.

For Databricks All Purpose clusters the fivetran_platform__audit_table model will now be materialized using the delta table format (previously parquet).

Delta tables are generally more performant than parquet and are also more widely available for Databricks users. Previously, the parquet file format was causing compilation issues on customers managed tables.

Documentation Updates

Updated the sync_start and sync_end field descriptions for the fivetran_platform__audit_table to explicitly define that these fields only represent the sync start/end times for when the connector wrote new or modified existing records to the specified table.

Under the Hood

The is_databricks_sql_warehouse macro has been renamed to is_databricks_all_purpose and has been modified to return true if the Databricks runtime being used is an all purpose cluster (previously this macro checked if a sql warehouse runtime was used).

This update was applied as there have been other Databricks runtimes discovered (ie. an endpoint and external runtime) which do not support the insert-overwrite incremental strategy used in the fivetran_platform__audit_table model.

In addition to the above, for Databricks users the fivetran_platform__audit_table model will now leverage the incremental strategy only if the Databricks runtime is all purpose. Otherwise, all other Databricks runtimes will not leverage an incremental strategy.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

dbt run –full-refresh && dbt test
dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

The appropriate issue has been linked, tagged, and properly assigned
All necessary documentation and version upgrades have been applied
docs were regenerated (unless this PR does not include any code or yml updates)
BuildKite integration tests are passing
Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

To validate these changes the validation tests were included and you can see they were successful for the following destinations:

BigQuery
Databricks All Purpose Cluster
Databricks SQL Warehouse

Additionally, I validated that the All Purpose Cluster appropriately runs an incremental strategy and the non All Purpose (SQL Warehouse in this case) does not run an incremental strategy.

Databricks All Purpose Cluster
Databricks SQL Warehouse

Finally, I confirmed that the Delta format runs as expected and without issue on the Databricks All Purpose cluster on incremental runs.

If you had to summarize this PR in an emoji, which would it be?

🌳

fivetran-joemarkiewicz · 2024-06-03T21:43:44Z

The SQL Server buildkite test is currently failing, but that is due to a permission issue which should hopefully be resolved soon. I will re-kick off the integration tests once that is resolved. I don't imagine SQL Server would be failing for any of these changes; therefore, this should be good to review even with the failing buildkite test.

integration_tests/tests/consistency/consistency__audit_table.sql

fivetran-catfritz

Changes lgtm, and I was able to give it a full refresh and incremental run in Databricks. All looks good there, so approved!

CHANGELOG.md

README.md

models/fivetran_platform.yml

CHANGELOG.md

Co-authored-by: Avinash Kunnath <[email protected]>

fivetran-avinash

@fivetran-joemarkiewicz Looks like you handled all the minor tweaks already, so just one question on the new audit table config before approving.

fivetran-avinash · 2024-06-06T21:04:09Z

models/fivetran_platform__audit_table.sql

@@ -1,13 +1,13 @@
 {{ config(
-    materialized='table' if is_databricks_sql_warehouse(target) else 'incremental',
+    materialized='incremental' if is_databricks_all_purpose(target) else 'table',


Just double checking this logic, since the conditions have been flipped:

if is_databricks_all_purpose(target) is true, then it'll be materialized as incremental.
If it's false, then it'll be a table. Which makes sense for databricks.

However, what would this entail for the other warehouses? I'm looking at the macro loop and it seems that it would only be true for the case when it's a databricks runtime is all-purpose. But that would be false for the other warehouses, so they would now be materialized as tables instead of incremental. Is that the intention?

@fivetran-avinash BRILLIANT catch! You are exactly correct and this was not the intention. This will now set the databricks all-purpose cluster to use the incremental strategy and also turn it off for non all-purpose clusters.... BUT ALSO turns the incremental strategy off for all other warehouses 😱.

Extremely thankfully you reviewed this and caught this gap. Let me revisit the code and account for all other warehouses.

I just made some code updates to account for the above issue. @fivetran-avinash @fivetran-catfritz would you be able to review and let me know if you have any questions or there are any other considerations to take into account?

See below for validations that the materializations are working as expected on each platform.

✅ BigQuery (subsequent runs use incremental strategy)

✅ Snowflake (subsequent runs use incremental strategy)

✅ Redshift (subsequent runs use incremental strategy)

✅ Postgres (subsequent runs use incremental strategy)

✅ Databricks All-Purpose Cluster (subsequent runs use incremental strategy)

✅ Databricks SQL Warehouse (no runs use the incremental strategy)

CHANGELOG.md

fivetran-avinash

@fivetran-joemarkiewicz New updates look good!

Only small call-out is if we feel there are any warehouses we set up that might not support incremental materialization in the future, we might want to explicitly do an elif target.type in ('bigquery', 'snowflake', etc.) --> true, else false, just for full coverage. But that is not a present concern and can be revisited if we add more destinations.

A few additional recommended edits in the Changelog but otherwise lgtm.

fivetran-catfritz · 2024-06-06T22:39:26Z

@fivetran-avinash really really good catch. I ran this only in Databricks so definitely didn't catch that then!

@fivetran-joemarkiewicz Tagging on to Avinash's comments, a more future-proof way to handle the logic update might be:

Put the macro back to just check if it's runtime.
Set the config like:

...
materialized='table' if target.type == 'databricks' 
  and not is_databricks_runtime(or whatever the old version name was)() else 'incremental'
...

That way we don't have to list out the other warehouses. What do you think?

fivetran-joemarkiewicz · 2024-06-06T22:44:56Z

@fivetran-catfritz I like that idea, but the benefit of listing out each of the warehouses is we are explicitly only using the incremental strategy if we know the destination is supported. If it is not in our supported list, then we use the table materialization. This likely will provide the greatest opportunity for success if for some reason a destination not supported is used to run this model.

fivetran-catfritz · 2024-06-06T22:53:49Z

@fivetran-joemarkiewicz Makes sense--in that case approved on my end!

fivetran-joemarkiewicz added 3 commits May 29, 2024 15:39

feature/databricks-delta-incremental-support

3306c55

changelog and integration test updates

bc4216e

pre review modifications

37fff66

fivetran-joemarkiewicz self-assigned this May 29, 2024

fivetran-joemarkiewicz marked this pull request as ready for review June 3, 2024 21:14

fivetran-joemarkiewicz requested a review from fivetran-catfritz June 5, 2024 13:52

fivetran-catfritz reviewed Jun 5, 2024

View reviewed changes

integration_tests/tests/consistency/consistency__audit_table.sql Outdated Show resolved Hide resolved

Update consistency__audit_table.sql

e91deb2

fivetran-catfritz approved these changes Jun 6, 2024

View reviewed changes

fivetran-joemarkiewicz requested a review from fivetran-avinash June 6, 2024 20:13