Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/databricks-delta-incremental-support #130

Merged

Conversation

fivetran-joemarkiewicz
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz commented May 29, 2024

PR Overview

This PR will address the following Issue/Feature: Internal tickets and Issue #128

This PR will result in the following new package version: v1.8.0

When I tested this locally for Databricks there was actually no error when running without a full refresh. However, the table format did not change. Therefore, a breaking change should be leveraged to ensure a full refresh is ran and the delta table format is applied.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

🚨 Breaking Changes 🚨

⚠️ Since the following changes result in the table format changing, we recommend running a --full-refresh after upgrading to this version to avoid possible incremental failures.

  • For Databricks All Purpose clusters the fivetran_platform__audit_table model will now be materialized using the delta table format (previously parquet).
    • Delta tables are generally more performant than parquet and are also more widely available for Databricks users. Previously, the parquet file format was causing compilation issues on customers managed tables.

Documentation Updates

  • Updated the sync_start and sync_end field descriptions for the fivetran_platform__audit_table to explicitly define that these fields only represent the sync start/end times for when the connector wrote new or modified existing records to the specified table.

Under the Hood

  • The is_databricks_sql_warehouse macro has been renamed to is_databricks_all_purpose and has been modified to return true if the Databricks runtime being used is an all purpose cluster (previously this macro checked if a sql warehouse runtime was used).
    • This update was applied as there have been other Databricks runtimes discovered (ie. an endpoint and external runtime) which do not support the insert-overwrite incremental strategy used in the fivetran_platform__audit_table model.
  • In addition to the above, for Databricks users the fivetran_platform__audit_table model will now leverage the incremental strategy only if the Databricks runtime is all purpose. Otherwise, all other Databricks runtimes will not leverage an incremental strategy.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

To validate these changes the validation tests were included and you can see they were successful for the following destinations:

  • BigQuery
    image

  • Databricks All Purpose Cluster
    image

  • Databricks SQL Warehouse
    image

Additionally, I validated that the All Purpose Cluster appropriately runs an incremental strategy and the non All Purpose (SQL Warehouse in this case) does not run an incremental strategy.

  • Databricks All Purpose Cluster
    image

  • Databricks SQL Warehouse
    image

Finally, I confirmed that the Delta format runs as expected and without issue on the Databricks All Purpose cluster on incremental runs.
image

If you had to summarize this PR in an emoji, which would it be?

🌳

@fivetran-joemarkiewicz fivetran-joemarkiewicz marked this pull request as ready for review June 3, 2024 21:14
@fivetran-joemarkiewicz
Copy link
Contributor Author

The SQL Server buildkite test is currently failing, but that is due to a permission issue which should hopefully be resolved soon. I will re-kick off the integration tests once that is resolved. I don't imagine SQL Server would be failing for any of these changes; therefore, this should be good to review even with the failing buildkite test.

Copy link
Contributor

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes lgtm, and I was able to give it a full refresh and incremental run in Databricks. All looks good there, so approved!

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@fivetran-avinash fivetran-avinash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz Looks like you handled all the minor tweaks already, so just one question on the new audit table config before approving.

@@ -1,13 +1,13 @@
{{ config(
materialized='table' if is_databricks_sql_warehouse(target) else 'incremental',
materialized='incremental' if is_databricks_all_purpose(target) else 'table',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just double checking this logic, since the conditions have been flipped:

if is_databricks_all_purpose(target) is true, then it'll be materialized as incremental.
If it's false, then it'll be a table. Which makes sense for databricks.

However, what would this entail for the other warehouses? I'm looking at the macro loop and it seems that it would only be true for the case when it's a databricks runtime is all-purpose. But that would be false for the other warehouses, so they would now be materialized as tables instead of incremental. Is that the intention?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-avinash BRILLIANT catch! You are exactly correct and this was not the intention. This will now set the databricks all-purpose cluster to use the incremental strategy and also turn it off for non all-purpose clusters.... BUT ALSO turns the incremental strategy off for all other warehouses 😱.

Extremely thankfully you reviewed this and caught this gap. Let me revisit the code and account for all other warehouses.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just made some code updates to account for the above issue. @fivetran-avinash @fivetran-catfritz would you be able to review and let me know if you have any questions or there are any other considerations to take into account?

See below for validations that the materializations are working as expected on each platform.

  • ✅ BigQuery (subsequent runs use incremental strategy)
    image

  • ✅ Snowflake (subsequent runs use incremental strategy)
    image

  • ✅ Redshift (subsequent runs use incremental strategy)
    image

  • ✅ Postgres (subsequent runs use incremental strategy)
    image

  • ✅ Databricks All-Purpose Cluster (subsequent runs use incremental strategy)
    image

  • ✅ Databricks SQL Warehouse (no runs use the incremental strategy)
    image

CHANGELOG.md Outdated Show resolved Hide resolved
Copy link
Contributor

@fivetran-avinash fivetran-avinash left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fivetran-joemarkiewicz New updates look good!

Only small call-out is if we feel there are any warehouses we set up that might not support incremental materialization in the future, we might want to explicitly do an elif target.type in ('bigquery', 'snowflake', etc.) --> true, else false, just for full coverage. But that is not a present concern and can be revisited if we add more destinations.

A few additional recommended edits in the Changelog but otherwise lgtm.

@fivetran-catfritz
Copy link
Contributor

@fivetran-avinash really really good catch. I ran this only in Databricks so definitely didn't catch that then!

@fivetran-joemarkiewicz Tagging on to Avinash's comments, a more future-proof way to handle the logic update might be:

  1. Put the macro back to just check if it's runtime.
  2. Set the config like:
...
materialized='table' if target.type == 'databricks' 
  and not is_databricks_runtime(or whatever the old version name was)() else 'incremental'
...

That way we don't have to list out the other warehouses. What do you think?

@fivetran-joemarkiewicz
Copy link
Contributor Author

@fivetran-catfritz I like that idea, but the benefit of listing out each of the warehouses is we are explicitly only using the incremental strategy if we know the destination is supported. If it is not in our supported list, then we use the table materialization. This likely will provide the greatest opportunity for success if for some reason a destination not supported is used to run this model.

@fivetran-catfritz
Copy link
Contributor

@fivetran-joemarkiewicz Makes sense--in that case approved on my end!

@fivetran-joemarkiewicz fivetran-joemarkiewicz merged commit ce41a02 into main Jun 11, 2024
10 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants