Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rework insert_overwrite incremental strategy #2198

Merged
merged 4 commits into from
Mar 18, 2020

Conversation

jtcohen6
Copy link
Contributor

resolves #2196

Description

Reimplement insert_overwrite incremental strategy on BigQuery, with two options:

Benchmarks calculated in this sheet. TL;DR:

  • incremental_overwrite with static (user-supplied) partitions is always cheapest and usually fastest
  • among other approaches:
    • merge (default) with a cluster key is faster and cheaper at small data volumes
    • insert_overwrite dynamic is always slowest, but its cost scales better than either merge at larger data volumes (> 50 GB)

Checklist

  • I have signed the CLA
  • I have run this code in development and it appears to resolve the stated issue
  • This PR includes tests, or tests are not required/relevant for this PR
  • I have updated the CHANGELOG.md and added information about my change to the "dbt next" section.

@jtcohen6 jtcohen6 requested a review from drewbanin March 12, 2020 02:30
@cla-bot cla-bot bot added the cla:yes label Mar 12, 2020
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some quick comments here - let me know what you think @jtcohen6 !

@jtcohen6 jtcohen6 requested a review from drewbanin March 16, 2020 18:04
@drewbanin
Copy link
Contributor

@jtcohen6 I played around with this some more today. Are you opposed to updating the merge statement to also filter on the source records that get inserted in the insert_overwrite flow?

Right now, I'm seeing a generated merge statement like this:

....
    when not matched by source
         and DBT_INTERNAL_DEST.id in (
              1, 2, 3
          ) 
        then delete

    when not matched then insert (`id`, `ts`)
    values (`id`, `ts`)

but we could instead make this merge statement run:

....
    when not matched by source
         and DBT_INTERNAL_DEST.id in (
              1, 2, 3
          ) 
        then delete

    when not matched
      -- this is new:
      and DBT_INTERNAL_SOURCE.id in (
              1, 2, 3
          ) 

    then insert (`id`, `ts`) values (`id`, `ts`)

This just has the benefit of avoiding a weird failure mode if the user should specify a list of partitions to overwrite, but they generate a SQL select statement which returns an incongruent set of partitions.

What do you think?

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Mar 18, 2020

Are you opposed to updating the merge statement to also filter on the source records that get inserted in the insert_overwrite flow?

Not opposed! I just spent a few minutes trying to figure out what would happen to that incongruous data. I don't think it would go... anywhere? Better to make that explicit.

@drewbanin As far as the specific implementation: Given that we define DBT_INTERNAL_DEST as the alias for partition_by.render inside the bq_insert_overwrite, and we would now need to pass two different predicates to default__get_insert_overwrite_merge_sql... I'm not sure of the cleanest way to encode this change.

@drewbanin
Copy link
Contributor

@jtcohen6 can you make a separate issue to add that check? We should ship this PR as-is for 0.16.0, but:

  1. be really clear about usage instructions
  2. consider adding more guardrails as we get more feedback once this is live

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ship it

@jtcohen6
Copy link
Contributor Author

Opened #2210

I can't merge until the Azure pipeline completes. Is it stuck / did it ever kick off? I pushed the most recent commit 2 days ago

@beckjake
Copy link
Contributor

I logged in and manually kicked it, I think it is running now.

@beckjake
Copy link
Contributor

Ok, lesson learned - kicking it made github act weird and now it only wants tests on the merge commit instead of the final commit...
When the azure tests pass, I'll just merge this manually.

@beckjake beckjake merged commit f4c6272 into dev/barbara-gittings Mar 18, 2020
@beckjake beckjake deleted the rework/incremental-overwrite branch March 18, 2020 19:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants