Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feature/databricks-sql-warehouse-compatibility #121

Merged

Conversation

fivetran-joemarkiewicz
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz commented Apr 2, 2024

PR Overview

This PR will address the following Issue/Feature: Issue #120

This PR will result in the following new package version: v1.7.1

This will not impact existing users who are not using Databricks SQL Warehouse runtimes. A SQL Warehouse runtime user will never have seen success. Therefore, this fix is not breaking and should ensure they may now see success.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

Bug Fixes

  • Users leveraging the Databricks SQL Warehouse runtime were previously unable to run the fivetran_platform__audit_table model due to an incompatible incremental strategy. As such, the following updates have been made:
    • A new macro is_databricks_sql_warehouse() has been added to determine if a databricks runtime is a SQL Warehouse runtime for Databricks. This macro will return a boolean of true if the runtime is determined to be SQL Warehouse and false if it is any other runtime or destination.
    • The above macro is used in determining the incremental strategy within the fivetran_platform__audit_table. For Databricks SQL Warehouses, there will be no incremental strategy used. All other destination runtime strategies are not impacted with this change.
      • For the SQL Warehouse runtime, the best incremental strategy we could elect to use is the merge strategy. However, we do not have full confidence in the resulting data integrity of the output model when leveraging this strategy. Therefore, we opted for the model to replicate a full create or replace behavior for the time being.

Under the Hood

  • Added integration testing pipeline for Databricks SQL Warehouse.
  • Applied modifications to the integration testing pipeline to account for jobs being run on both Databricks All Purpose Cluster and SQL Warehouse runtimes.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

For validating these changes I wanted to ensure the following in relation to the fivetran_platform__audit_table model:

  1. No incremental strategy is being used for SQL Warehouse destinations
  2. The incremental strategy created for all other warehouses is unchanged

See the validations below:

  • SQL Warehouse doesn't have any incremental strategy associated with it. You can see there is no incremental strategy used as the incremental run compiled code does not include the where statement that would identify the incremental strategy is being used.

    • image
  • I was then able to verify for all other warehouses that the incremental strategy was working as expected. You can see the compiled code for all the warehouses is returning the expected results.

    • BigQuery
      • image
    • Databricks (All Purpose Cluster)
      • I have been unable to run this locally. However, we can see from buildkite that this looks to be using the incremental strategy as expected. We can see this in the finished statement where it mentions there was 1 incremental model executed.
      • image
    • Postgres
      • image
    • Snowflake
      • image
    • Redshift
      • image
    • SQL Server
      • I am unable to run this locally, but same as Databricks (All Purpose Clusters) we are able to see the incremental model is being captured in the BuildKite run
      • image

If you had to summarize this PR in an emoji, which would it be?

🧱

@fivetran-joemarkiewicz fivetran-joemarkiewicz self-assigned this Apr 2, 2024
Comment on lines 1 to 15
{% macro is_databricks_sql_warehouse(target) %}
{% if target.type in ('databricks','spark') %}
{% set re = modules.re %}
{% set path_match = target.http_path %}
{% set regex_pattern = "/sql/.+/warehouses/" %}
{% set match_result = re.search(regex_pattern, path_match) %}
{% if match_result %}
{{ return(True) }}
{% else %}
{{ return(False) }}
{% endif %}
{% else %}
{{ return(False) }}
{% endif %}
{% endmacro %}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also have an open question on dbt Slack which asks if there is a better way to do this natively using the dbt-databricks adapter.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems per dbt Slack that this is the preferred route.

Comment on lines +39 to +54
if [ "$db" = "databricks-sql" ]; then
dbt seed --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db" --full-refresh
dbt compile --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db"
dbt run --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db" --full-refresh
dbt run --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db"
dbt test --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db"
dbt run --vars '{fivetran_platform_schema: sqlw_tests, fivetran_platform__usage_pricing: true}' --target "$db" --full-refresh
dbt run --vars '{fivetran_platform_schema: sqlw_tests, fivetran_platform__usage_pricing: true}' --target "$db"
dbt test --target "$db"
dbt run --vars '{fivetran_platform_schema: sqlw_tests, fivetran_platform__credits_pricing: false, fivetran_platform__usage_pricing: true}' --target "$db" --full-refresh
dbt run --vars '{fivetran_platform__credits_pricing: false, fivetran_platform__usage_pricing: true}' --target "$db"
dbt test --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db"
dbt run --vars '{fivetran_platform_schema: sqlw_tests, fivetran_platform__usage_pricing: false, fivetran_platform_using_destination_membership: false, fivetran_platform_using_user: false}' --target "$db" --full-refresh
dbt run --vars '{fivetran_platform_schema: sqlw_tests, fivetran_platform__usage_pricing: false, fivetran_platform_using_destination_membership: false, fivetran_platform_using_user: false}' --target "$db"
dbt test --vars '{fivetran_platform_schema: sqlw_tests}' --target "$db"
else
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is all necessary as the SQL Warehouse still uses the same hive_metastore and comes into conflict with the All Purpose Cluster when they both run in parallel with the same schema.

Therefore, we need to explicitly use a different schema for the two.

@@ -24,7 +24,7 @@ vars:

models:
fivetran_log:
+schema: fivetran_platform
+schema: "{{ 'sqlw_tests' if target.name == 'databricks-sql' else 'fivetran_platform' }}"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an artifact of needing to use two different schemas for the Databricks jobs

@fivetran-joemarkiewicz fivetran-joemarkiewicz marked this pull request as ready for review April 3, 2024 01:26
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the changes to the readme and changelog, and also had one small suggestion for the audit table config, but otherwise this is looking great!

I confirm I was able to reproduce the issue locally on Databricks SQL Warehouse, and it was resolved by this fix, with the audit model materialized as a table.

For good measure, I also ran this locally on SQL server since it was mentioned in your review notes, and all looks good there!

models/fivetran_platform__audit_table.sql Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@fivetran-joemarkiewicz
Copy link
Contributor Author

Thanks @fivetran-catfritz! I agree with your suggestion and committed it to the branch. I also really appreciate your README updates 🙏. I made one small wording change to be consistent with the terminology of the runtime name (Datarbricks All Purpose Cluster). Lastly, I regenerated the docs as the incremental file format change needed to be picked up in the docs.

Let me know if there are any other comments needed before approving.

Copy link
Contributor

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great--approved!

Copy link
Contributor

@fivetran-jamie fivetran-jamie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some mostly doc-related comments!

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
@@ -0,0 +1,15 @@
{% macro is_databricks_sql_warehouse(target) %}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we may want to add this to fivetran_utils in the future since Databricks SQL Warehouse folks using other packages with incremental models will have the same issue

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. This is something we will migrate there in the future.

CHANGELOG.md Outdated Show resolved Hide resolved
Jamie review notes

Co-authored-by: Jamie Rodriguez <[email protected]>
Copy link
Contributor

@fivetran-reneeli fivetran-reneeli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no other comments!

@fivetran-joemarkiewicz fivetran-joemarkiewicz merged commit acb8b61 into main Apr 4, 2024
10 checks passed
@fivetran-joemarkiewicz fivetran-joemarkiewicz deleted the feature/databricks-sql-warehouse-compatibility branch April 4, 2024 13:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants