Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bugfix/too-many-partitions #165

Merged
merged 3 commits into from
Aug 29, 2024

Conversation

fivetran-joemarkiewicz
Copy link
Contributor

@fivetran-joemarkiewicz fivetran-joemarkiewicz commented Aug 23, 2024

PR Overview

This PR will address the following Issue/Feature: Issue #39

This PR will result in the following new package version: v0.17.0

This will be adjusting the partition granularity for all incremental models. This should only impact BigQuery users. However, it will still result in the need for a full refresh. Therefore, this should be a breaking change.

Please provide the finalized CHANGELOG entry which details the relevant changes included in this PR:

Breaking Changes (Full refresh required after upgrading)

  • Incremental models have had the partition_by logic adjusted to include a granularity of a month. This change should only impact BigQuery warehouses and was applied to avoid the common too many partitions error some users have experienced do to over partitioning by day. Therefore, adjusting the partition to a month granularity will increase the partition windows and allow for more performant querying and incremental loads. This change was applied to the following models:
    • int_zendesk__field_calendar_spine
    • int_zendesk__field_history_pivot
    • zendesk__ticket_field_history

Under the Hood

  • Updated seed files to reflect a real world ticket field history update scenario.
  • Modified the consistency_sla_policy_count validation test to group by ticket_id for more accurate testing.

PR Checklist

Basic Validation

Please acknowledge that you have successfully performed the following commands locally:

  • dbt run –full-refresh && dbt test
  • dbt run (if incremental models are present) && dbt test

Before marking this PR as "ready for review" the following have been applied:

  • The appropriate issue has been linked, tagged, and properly assigned
  • All necessary documentation and version upgrades have been applied
  • docs were regenerated (unless this PR does not include any code or yml updates)
  • BuildKite integration tests are passing
  • Detailed validation steps have been provided below

Detailed Validation

Please share any and all of your validation steps:

For basic validation efforts, we can see that the validation tests succeed. I did need to add a ticket to the exclusion list, but PR #164 should address this issue.

image

For additional validation efforts I was able to test the incremental logic worked by stress testing the ticket field history model and artificially limiting the calendar date on incremental runs using the seed data. See validation screenshots below:

Below is fictional ticket 11071 and the field history changes we have in the seed data. You can see this ticket had changes to the status, priority, and assignee_id fields throughout the course of it's open lifetime.
image

Let's explore how the incremental logic holds up with the adjusted partition logic. For this test case the partition logic should only effect BigQuery users. However, I also wanted to test for Snowflake, Redshift, Postgres, and Databricks to make sure there were no unexpected changes.

For each of these warehouses, I followed the same steps:

  1. Locally filter the int_zendesk__calendar_spine model to artificially limit the data used in the ticket field history models to be on or before 2020-08-30. Execute dbt run --full-refresh and see the expected results where the assignee_id, status, and priority are changing based on the field changes pre August 30th, 2020.
    image
  2. Adjusted the filter in the int_zendesk__calendar_spine model to be on or before 2020-11-01. Execute dbt run and see the expected incremental results where the assignee_id, status, and priority are changing based on the field changes pre November 1st, 2020.
    image
  3. Finally adjust the filter in the int_zendesk__calendar_spine to be on or before 2020-11-16. Execute dbt run and see the expected incremental results where the assignee_id, status, and priority are changing based on the field changes pre November 16th, 2020.
    image

The above steps include incremental loads that span +1 month and should stress test to ensure the new month grain partition logic doesn't result in any unexpected incremental loads. We can see that this was successful for the below warehouse tests:

✅ BigQuery

  1. image

(1/3) image
(2/3) image
(3/3) image
5. Only showing the new fields after step 2. image

I was able to verify this for all remaining supported warehouses as well. However, in an effort to not be exhaustive in this PR I have opted to keep these validations in an internal Hex notebook. As such, I am reasonably confident this update will not impact the incremental logic and will ensure more performant query times for BigQuery and address the original issue.

If you had to summarize this PR in an emoji, which would it be?

@fivetran-joemarkiewicz fivetran-joemarkiewicz marked this pull request as ready for review August 26, 2024 21:15
Copy link
Contributor

@fivetran-catfritz fivetran-catfritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a suggestion for the changelog, but otherwise lgtm!

CHANGELOG.md Outdated Show resolved Hide resolved
Co-authored-by: fivetran-catfritz <[email protected]>
@fivetran-catfritz fivetran-catfritz mentioned this pull request Aug 27, 2024
7 tasks
@fivetran-joemarkiewicz fivetran-joemarkiewicz changed the base branch from main to release/v0.17.0 August 29, 2024 18:26
@fivetran-joemarkiewicz
Copy link
Contributor Author

Will merge this into the upcoming release branch to be batched with the other changes from this sprint.

@fivetran-joemarkiewicz fivetran-joemarkiewicz merged commit 031e845 into release/v0.17.0 Aug 29, 2024
8 checks passed
fivetran-catfritz added a commit that referenced this pull request Sep 4, 2024
* initial

* feature/unstructured-data

* add coalesce_cast

* update filters

* update and consolidate models

* model revisions

* restructure

* documentation

* remove extra comma

* regen docs

* formatting

* update max token docs

* Update CHANGELOG.md

* bug/missing-sla-policies

* update changelog and add integrity test

* update test

* update changelog, readme and tests

* update test

* bug/intercepted-period-joins

* adjustmnt

* update weeks

* update weeks

* add integrity test

* update weeks

* update changelog

* bugfix/too-many-partitions (#165)

* bugfix/too-many-partitions

* docs regen

* Update CHANGELOG.md

Co-authored-by: fivetran-catfritz <[email protected]>

---------

Co-authored-by: fivetran-catfritz <[email protected]>

* update changelog

* revert docs to main

* Documentation Standard Updates (#166)

* MagicBot/documentation-updates

* Apply suggestions from code review

* Update README.md

Co-authored-by: fivetran-catfritz <[email protected]>

---------

Co-authored-by: fivetran-catfritz <[email protected]>

* update default max_tokens

* update changelog

* Apply suggestions from code review

Co-authored-by: Joe Markiewicz <[email protected]>

* update readme

* regen docs

* update yml

* Apply suggestions from code review

Co-authored-by: Renee Li <[email protected]>

* add comments and update changelog

* update changelog

* Update packages.yml

---------

Co-authored-by: Renee Li <[email protected]>
Co-authored-by: Joe Markiewicz <[email protected]>
Co-authored-by: Renee Li <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants