-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/archive blocks #1361
Feature/archive blocks #1361
Conversation
30466b2
to
78a260a
Compare
a54f6e9
to
1030c9a
Compare
/azp run |
f1a5bce
to
84b785c
Compare
43bfb3b
to
cdcd654
Compare
f3759b4
to
ff94d87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some initial comments! I plan on coming back here specifically to look at the Archive materialization implementation. Really nice stuff and glad that you snuck flake8 in here
@@ -303,3 +303,91 @@ def is_cte(self): | |||
@property | |||
def is_view(self): | |||
return self.type == self.View | |||
|
|||
|
|||
class Column(object): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think there's any merit to putting this in its own file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it really matters either way.
@@ -1,6 +1,6 @@ | |||
import copy | |||
from collections import Mapping | |||
from jsonschema import Draft4Validator | |||
from jsonschema import Draft7Validator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's this all about?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed this because I wanted to use some features that changed between draft 4 and draft 7 (jsonschema has its own ref mechanism involving schema IDs), but then I reverted those because it got out of hand on the contracts side. I can change it back but I figured it wouldn't hurt to leave it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also: It would be cool to annotate our contracts the "proper" jsonschema way. It will be a big PR, but then we could use refs properly. The reason I reverted it was because our current way of combining contracts (deep_merge
, etc) doesn't play well with the schema id/ref model jsonschema provides, and changing all that would massively increase the size and risk of this PR, which is already kind of out of hand for my taste.
{% macro default__archive_scd_hash() %} | ||
md5("dbt_pk" || '|' || "dbt_updated_at") | ||
{% macro default__archive_hash_arguments(args) %} | ||
md5({% for arg in args %}{{ arg }}{% if not loop.last %} || '|' || {% endif %}{% endfor %}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is cool :)
Two things:
||
isn't implemented in BQ (and conceivably other dbs)- You need to be careful when concatenating columns, as nulls will propagate. A single null column will make the whole expression
null
!
We'll basically want to implement the logic shown here: https://github.com/fishtown-analytics/dbt-utils/blob/master/macros/sql/surrogate_key.sql
What do you think would be a good mechanism for sharing this code between dbt-core and dbt-utils?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
- Well, yeah, that's why it's an adapter macro and bigquery overrides it to use
concat
... it's basically the same as the oldarchive_scd_hash
behavior but a bit more generic. - Ok, that is a problem. We should probably just pull that
dbt-utils
code intodbt-core
and letdbt-utils
call it for backwards compatibility.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update: actually, the only thing missing from this code is the coalesce
call, and it slots in nicely to the bq/default divide. That seems a lot easier than pulling that macro and its dependencies into dbt, so I'm just going to add the coalesce call.
{# | ||
Cross-db compatible archival implementation | ||
#} | ||
{% macro archive_select(source_relation, target_relation, source_columns, unique_key, updated_at) %} | ||
|
||
{% macro archive_select_timestamp(source_sql, target_relation, source_columns, unique_key, updated_at) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to come back and take a deeper look at this
{{ archive_select_generic(source_sql, target_relation, transforms, scd_hash) }} | ||
{%- endmacro %} | ||
|
||
{# this is gross #} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, let me have a think about how we can do this better....
We create the empty table if it doesn't exist, then we insert into it later in the invocation. Can we just change this to use a single create table as ( .... )
statement for the first run as we do with incremental models?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe? I assumed archives worked the way they do for a good reason, but I don't know what it is, so maybe they don't!
{%- set source_database = config.get('source_database') -%} | ||
{%- set source_schema = config.get('source_schema') -%} | ||
{%- set source_table = config.get('source_table') -%} | ||
{%- set target_table = model.get('alias', model.get('name')) -%} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is clever, but i hesitate to make archivals work too much like models. We don't want them to participate in generate_schema_name
, for instance. I think it might make more sense to specify the target table name here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have to think about this some more, but I think we really do want archives to look a lot more like models when possible, because that makes it easier to both implement and reason about their behavior. I do agree on using target_schema
/target_database
, but I think the table name matching the block name makes a lot of sense... models will do that in the future when we have model blocks.
@@ -526,7 +524,7 @@ def __init__(self, config, adapter, node, node_index, num_nodes): | |||
|
|||
def handle_exception(self, e, ctx): | |||
if isinstance(e, dbt.exceptions.Exception): | |||
if hasattr(e, 'node'): | |||
if isinstance(e, dbt.exceptions.RuntimeException): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this a flake8 thing or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember if it was flake8 or just hygiene but I saw an opportunity to replace a hasattr
with an isinstance
check, and I think it's useful to always take those if you can.
|
||
for n in nodes: | ||
node_path, node_parsed = self.parse_sql_node(n, tags) | ||
|
||
# Ignore disabled nodes | ||
if not node_parsed.config['enabled']: | ||
disabled.append(node_parsed) | ||
results.disable(node_parsed) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is slick
raise on non-archive during parsing break archive materialization
tests fix event tracking test Fix print statements make archives not inherit configs from models archive now uses the name/alias properly for everything instead of target_table skip non-archive blocks in archive parsing instead of raising make archives ref-able - test for archive ref, test for archive selects raise a more useful message on incorrect archive targets add "--models" and "--exclude" arguments to archives - pass them through to selection - change get_fqn to take a full node object, have archives use that so selection behaves well - added tests Improve error handling on invalid archive configs Added a special archive-only node that has extra config restrictions add tests for invalid archive config
Contracts: some anyOf shenanigans to add support for check_cols Macros: split apart archive selection, probably too much copy+paste Legacy: Archive configs now include a "timestamp" strategy when parsed from dbt_project.yml Add integration tests fix aliases test Unquote columns in archives handle null columns attr -> use_profile
b00c824
to
416cc72
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM - ship it when the tests pass!
Fixes #1175
Fixes #251
Fixes #1333
Fixes #706
Fixes #1167
Fixes #1066
Changes in this branch:
archive-paths
that defaults to["archives"]
. The contents ofarchive-paths
should be.sql
files that containarchive
blocksvalid_to
instead ofdbt_valid_to
, etc)On missing config parameters to a dbt archive named
{% archive foo %}...
:On an invalid (old archive style) target for the archive:
Currently, the
target_table
field is not supported - instead you can use the name of your archive block or set an alias in your config() call. I think that's what we should do in general but I'm more than happy to discuss/change it.