Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/archive blocks #1361

Merged
merged 9 commits into from
Apr 26, 2019
Merged

Conversation

beckjake
Copy link
Contributor

@beckjake beckjake commented Mar 21, 2019

Fixes #1175
Fixes #251
Fixes #1333
Fixes #706
Fixes #1167
Fixes #1066

Changes in this branch:

  • implements an archive block parser
  • slightly changes how the GraphLoader works for sql nodes
  • passes flake8
  • passes at all integration tests
  • supports both old-style archive blocks in dbt_project.yml and a new archive-paths that defaults to ["archives"]. The contents of archive-paths should be .sql files that contain archive blocks
  • it seems to me like the target database/schema should be optional, and default to the connection settings. But maybe that's wrong. For now it's required, though see above bullet.
  • I guess I technically have not tested macros but since archives use macros and aren't special anymore, I am pretty confident that calling them will work!
  • Added ref'ing support for archives + tests
  • Added explicit tests of the 'archives as a query' behavior
  • Added support for model selection syntax on archives + tests
  • Added a more helpful error when an old-style existing archive table is found (valid_to instead of dbt_valid_to, etc)

On missing config parameters to a dbt archive named {% archive foo %}...:

$ cat archives/test.sql
{% archive foo %}
    {{ config(target_schema=schema, unique_key='order_id', strategy='timestamp', updated_at='order_date') }}
    select * from {{ ref('fct_orders') }} where status = 'completed'
{% endarchive %}
$
$ dbt compile
Running with dbt=0.13.0
Encountered an error:
Compilation Error in archive foo (archives/test.sql)
  Runtime Error
    Invalid arguments passed to "ParsedArchiveNode" instance: config.'target_database' is a required property

On an invalid (old archive style) target for the archive:

$ dbt archive
Running with dbt=0.13.0
Found 8 models, 20 tests, 1 archives, 0 analyses, 98 macros, 0 operations, 3 seed files, 0 sources

11:41:54 | Concurrency: 2 threads (target='default')
11:41:54 |
11:41:54 | 1 of 1 START archive foo --> postgres.dbt_postgres.foo............... [RUN]
11:41:54 | 1 of 1 ERROR archiving foo --> postgres.dbt_postgres.foo............. [ERROR in 0.14s]
11:41:55 |
11:41:55 | Finished running 1 archives in 0.53s.

Completed with 1 errors:

Compilation Error in archive foo (archives/test.sql)
  Archive target has ("scd_id", "valid_from", "valid_to") but not ("dbt_scd_id", "dbt_valid_from", "dbt_valid_to") - is it an unmigrated previous version archive?

  > in macro materialization_archive_default (macros/materializations/archive/archive.sql)
  > called by archive foo (archives/test.sql)

Done. PASS=0 ERROR=1 SKIP=0 TOTAL=1

Currently, the target_table field is not supported - instead you can use the name of your archive block or set an alias in your config() call. I think that's what we should do in general but I'm more than happy to discuss/change it.

@beckjake beckjake force-pushed the feature/archive-blocks-as-regex branch 2 times, most recently from 30466b2 to 78a260a Compare March 25, 2019 15:52
@beckjake beckjake force-pushed the feature/archive-blocks-as-regex branch 5 times, most recently from a54f6e9 to 1030c9a Compare April 9, 2019 15:47
@beckjake
Copy link
Contributor Author

beckjake commented Apr 9, 2019

/azp run

@beckjake beckjake force-pushed the feature/archive-blocks-as-regex branch 3 times, most recently from f1a5bce to 84b785c Compare April 9, 2019 19:42
@beckjake beckjake marked this pull request as ready for review April 9, 2019 19:47
@beckjake beckjake force-pushed the feature/archive-blocks-as-regex branch from 43bfb3b to cdcd654 Compare April 10, 2019 17:36
@beckjake beckjake requested a review from drewbanin April 12, 2019 14:36
@beckjake beckjake force-pushed the feature/archive-blocks-as-regex branch 2 times, most recently from f3759b4 to ff94d87 Compare April 12, 2019 16:23
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments! I plan on coming back here specifically to look at the Archive materialization implementation. Really nice stuff and glad that you snuck flake8 in here

@@ -303,3 +303,91 @@ def is_cte(self):
@property
def is_view(self):
return self.type == self.View


class Column(object):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think there's any merit to putting this in its own file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it really matters either way.

@@ -1,6 +1,6 @@
import copy
from collections import Mapping
from jsonschema import Draft4Validator
from jsonschema import Draft7Validator
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's this all about?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed this because I wanted to use some features that changed between draft 4 and draft 7 (jsonschema has its own ref mechanism involving schema IDs), but then I reverted those because it got out of hand on the contracts side. I can change it back but I figured it wouldn't hurt to leave it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also: It would be cool to annotate our contracts the "proper" jsonschema way. It will be a big PR, but then we could use refs properly. The reason I reverted it was because our current way of combining contracts (deep_merge, etc) doesn't play well with the schema id/ref model jsonschema provides, and changing all that would massively increase the size and risk of this PR, which is already kind of out of hand for my taste.

{% macro default__archive_scd_hash() %}
md5("dbt_pk" || '|' || "dbt_updated_at")
{% macro default__archive_hash_arguments(args) %}
md5({% for arg in args %}{{ arg }}{% if not loop.last %} || '|' || {% endif %}{% endfor %})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is cool :)

Two things:

  1. || isn't implemented in BQ (and conceivably other dbs)
  2. You need to be careful when concatenating columns, as nulls will propagate. A single null column will make the whole expression null!

We'll basically want to implement the logic shown here: https://github.com/fishtown-analytics/dbt-utils/blob/master/macros/sql/surrogate_key.sql

What do you think would be a good mechanism for sharing this code between dbt-core and dbt-utils?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  1. Well, yeah, that's why it's an adapter macro and bigquery overrides it to use concat... it's basically the same as the old archive_scd_hash behavior but a bit more generic.
  2. Ok, that is a problem. We should probably just pull that dbt-utils code into dbt-core and let dbt-utils call it for backwards compatibility.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update: actually, the only thing missing from this code is the coalesce call, and it slots in nicely to the bq/default divide. That seems a lot easier than pulling that macro and its dependencies into dbt, so I'm just going to add the coalesce call.

{#
Cross-db compatible archival implementation
#}
{% macro archive_select(source_relation, target_relation, source_columns, unique_key, updated_at) %}

{% macro archive_select_timestamp(source_sql, target_relation, source_columns, unique_key, updated_at) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to come back and take a deeper look at this

{{ archive_select_generic(source_sql, target_relation, transforms, scd_hash) }}
{%- endmacro %}

{# this is gross #}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agree, let me have a think about how we can do this better....

We create the empty table if it doesn't exist, then we insert into it later in the invocation. Can we just change this to use a single create table as ( .... ) statement for the first run as we do with incremental models?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe? I assumed archives worked the way they do for a good reason, but I don't know what it is, so maybe they don't!

{%- set source_database = config.get('source_database') -%}
{%- set source_schema = config.get('source_schema') -%}
{%- set source_table = config.get('source_table') -%}
{%- set target_table = model.get('alias', model.get('name')) -%}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is clever, but i hesitate to make archivals work too much like models. We don't want them to participate in generate_schema_name, for instance. I think it might make more sense to specify the target table name here

Copy link
Contributor Author

@beckjake beckjake Apr 15, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have to think about this some more, but I think we really do want archives to look a lot more like models when possible, because that makes it easier to both implement and reason about their behavior. I do agree on using target_schema/target_database, but I think the table name matching the block name makes a lot of sense... models will do that in the future when we have model blocks.

@@ -526,7 +524,7 @@ def __init__(self, config, adapter, node, node_index, num_nodes):

def handle_exception(self, e, ctx):
if isinstance(e, dbt.exceptions.Exception):
if hasattr(e, 'node'):
if isinstance(e, dbt.exceptions.RuntimeException):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this a flake8 thing or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember if it was flake8 or just hygiene but I saw an opportunity to replace a hasattr with an isinstance check, and I think it's useful to always take those if you can.


for n in nodes:
node_path, node_parsed = self.parse_sql_node(n, tags)

# Ignore disabled nodes
if not node_parsed.config['enabled']:
disabled.append(node_parsed)
results.disable(node_parsed)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is slick

Jacob Beck added 7 commits April 23, 2019 14:30
raise on non-archive during parsing
break archive materialization
tests
fix event tracking test
Fix print statements
make archives not inherit configs from models
archive now uses the name/alias properly for everything instead of target_table
skip non-archive blocks in archive parsing instead of raising
make archives ref-able
 - test for archive ref, test for archive selects
raise a more useful message on incorrect archive targets
add "--models" and "--exclude" arguments to archives
 - pass them through to selection
 - change get_fqn to take a full node object, have archives use that so selection behaves well
 - added tests

Improve error handling on invalid archive configs

Added a special archive-only node that has extra config restrictions
add tests for invalid archive config
Contracts: some anyOf shenanigans to add support for check_cols
Macros: split apart archive selection, probably too much copy+paste
Legacy: Archive configs now include a "timestamp" strategy when parsed from dbt_project.yml
Add integration tests
fix aliases test
Unquote columns in archives
handle null columns
attr -> use_profile
@beckjake beckjake force-pushed the feature/archive-blocks-as-regex branch from b00c824 to 416cc72 Compare April 23, 2019 20:49
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM - ship it when the tests pass!

@beckjake beckjake merged commit 96cb056 into dev/wilt-chamberlain Apr 26, 2019
@beckjake beckjake deleted the feature/archive-blocks-as-regex branch May 10, 2019 20:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment