This release adds support for auto-generated dbt documentation, adds a new syntax for schema.yml
files, and fixes a number of minor bugs. With the exception of planned changes to Snowflake's default quoting strategy, this release should not contain any breaking changes. Check out the blog post for more information about this release.
- Add autogenerated dbt project documentation (docs) (#375, #863, #941, #815)
- Version 2 of schema.yml, which allows users to create table and column comments that end up in the manifest (docs) (#880)
- Extend catalog and manifest to also support Snowflake, BigQuery, and Redshift, in addition to existing Postgres support (#866, #857, #849)
- Add a 'generated_at' field to both the manifest and the catalog. (#887)
- Add
docs
blocks that users can put into.md
files anddoc()
value for schema v2 description fields (#888) - Write out a 'run_results.json' after dbt invocations. (#904)
- Type inference for interpreting CSV data is now less aggressive (#905)
- Remove distinction between
this.table
andthis.schema
by refactoring materialization SQL (#940)
- Fix for identifier clashes in BigQuery merge statements (#914)
- Fix for unneccessary downloads of
bumpversion.cfg
, handle failures gracefully (#907) - Fix for incompatible
boto3
requirements (#959) - Fix for invalid
relationships
test when the parent column contains null values (#921)
Thanks for contributing!
- @rsmichaeldunn (#799)
- @lewish (#915)
- @MartinLue (#872)
This release makes it possible to alias relation names, rounds out support for BigQuery with incremental, archival, and hook support, adds the IAM Auth method for Redshift, and builds the foundation for autogenerated dbt project documentation, to come in the next release.
Additionally, a number of bugs have been fixed including intermittent BigQuery 404 errors, Redshift "table dropped by concurrent query" errors, and a probable fix for Redshift connection timeout issues.
We want to extend a big thank you to our outside contributors for this release! You all are amazing.
- @danielchalef (#818)
- @mjumbewu (#796)
- @abelsonlive (#800)
- @jon-rtr (#800)
- @mturzanska (#797)
- @cpdean (#780)
- BigQuery
- Redshift: IAM Auth (#818) (docs)
- Model aliases (#800)(docs)
- Write JSON manifest file to disk during compilation (#761)
- Add forward and backward graph edges to the JSON manifest file (#762)
- Add a 'dbt docs generate' command to generate a JSON catalog file (#774, #808)
- BigQuery: fix concurrent relation loads (#835)
- BigQuery: support external relations (#828)
- Redshift: set TCP keepalive on connections (#826)
- Redshift: fix "table dropped by concurrent query" (#825)
- Fix the error handling for profiles.yml validation (#820)
- Make the
--threads
parameter actually change the number of threads used (#819) - Ensure that numeric precision of a column is not
None
(#796) - Allow for more complex version comparison (#797)
- Use a subselect instead of CTE when building incremental models (#787)
- Internals
- Improved dependency selection, rip out some unused dependencies (#848)
- Stop tracking
run_error
in tracking code (#817) - Use Mapping instead of dict as the base class for APIObject (#756)
- Split out parsers (#809)
- Fix
__all__
parameter in submodules (#780) - Switch to CircleCI 2.0 (#843, #850)
- Added tox environments that have the user specify what tests should be run (#837)
This release focuses on achieving functional parity between all of dbt's adapters. With this release, most dbt functionality should work on every adapter except where noted here.
- Configure model schema and name quoting in your
dbt_project.yml
file (Docs) - Add a
Relation
object to the context to simplify model quoting Docs - Implement BigQuery materializations using new
create table as (...)
syntax, supportpartition by
clause (Docs) - Override seed column types (Docs)
- Add
get_columns_in_table
context function for BigQuery (Docs)
- Consistent schema and identifier quoting (#727)
- Use the new
create table as (...)
syntax on BigQuery (#717)- Support
partition by
clause
- Support
- CSV Updates:
- Improve
get_columns_in_table
context function (#709)- Support numeric types on Redshift, Postgres
- Support BigQuery (including nested columns in
struct
types) - Support cross-database
information_schema
queries for Snowflake - Retain column ordinal positions
- Fix for incorrect var precendence when using
--vars
on the CLI (#739) - Fix for closed connections in
on-run-end
hooks for long-running dbt invocations (#693) - Fix: don't try to run empty hooks (#620, #693)
- Fix: Prevent seed data from being serialized into
graph.gpickle
file (#720) - Fix: Disallow seed and model files with the same name (#737)
This release overhauls dbt's package management functionality, makes seeding csv files work across all adapters, and adds date partitioning support for BigQuery.
- Check out full installation and upgrading instructions here
- Transition the
repositories:
section of yourdbt_project.yml
file to apackages.yml
file as described here - You may need to clear out your
dbt_modules
directory if you use packages like dbt-utils. Depending how your project is configured, you can do this by runningdbt clean
. - We're using a new CSV parsing library,
agate
, so be sure to check that all of your seed tables are parsed as you would expect!
- Support for variables defined on the CLI with
--vars
(#640) (docs) - Improvements to
dbt seed
(docs)- Support seeding csv files on all adapters (#618)
- Make seed csv's
ref()
-able in models (#668) - Support seed file configuration (custom schemas, enabled / disabled) in the
dbt_project.yml
file (#561) - Support
--full-refresh
instead of--drop-existing
(deprecated) for seed files (#515) - Add
--show
argument todbt seed
to display a sample of data in the CLI (#74)
- Improvements to package management (docs)
- Support date partitioning in BigQuery (#641) (docs)
- Move schema creation to after
on-run-start
hooks (#652) - Replace
csvkit
dependency withagate
(#598) - Switch snowplow endpoint to pipe directly to Fishtown Analytics (#682)
- Throw a compilation exception if a required test macro is not present in the context (#655)
- Make the
adapter_macro
use thereturn()
function (#635) - Fix bug for introspective query on late binding views (redshift) (#647)
- Disable any non-dbt log output on the CLI (#663)
This release fixes bugs and adds supports for late binding views on Redshift.
- Support late binding views on Redshift (#614) (docs)
- Make
run_started_at
timezone-aware (#553) (Contributed by @mturzanska) (docs)
- Include hook run time in reported model run time (#607)
- Add warning for missing test constraints (#600)
- Fix for schema tests used or defined in packages (#599)
- Run hooks in defined order (#601)
- Skip tests that depend on nonexistent models (#617)
- Fix for
adapter_macro
called within a package (#630)
This release focuses on improvements to macros, materializations, and package management. Check out the blog post to learn more about what's possible in this new version of dbt.
Full installation instructions for macOS, Windows, and Linux can be found here. If you use Windows or Linux, installation works the same as with previous versions of dbt. If you use macOS and Homebrew to install dbt, note that installation instructions have changed:
brew update
brew tap fishtown-analytics/dbt
brew install dbt
- More powerful macros and materializations
- Custom model schemas
- BigQuery improvements
- Bugfixes
- Documentation (0.9.0 docs can be found here)
adapter
functions must be namespaced to theadapter
context variable. To fix this error, useadapter.already_exists
instead of justalready_exists
, or similar for other adapter functions.
- use
adapter
,ref
, andvar
inside of macros (#466) - Build custom tests and materializations in dbt packages (#466)
- Support pre- and post- hooks that run outside of a transaction (#510)
- Support table materializations for BigQuery (#507)
- Support querying external data sources in BigQuery (#507)
- Override which schema models are materialized in (#522) (docs)
- Make
{{ ref(...) }}
return the same type of object as{{ this }}
(#530) - Replace schema test CTEs with subqueries to speed them up for Postgres (#536) (@ronnyli)
- Bump Snowflake dependency, remove pyasn1 (#570)
- Document how to create a package
- Document how to make a materialization
- Document how to make custom schema tests
- Document how to use hooks to vacuum
- Document all context variables
- Bump Snowflake dependency, remove pyasn1 (#570)
- Fix for federated queries on BigQuery with Service Account json credentials (#547)
- Bugfixes
- Faster schema tests on Postgres
- Fix for broken environment variables
- Fix broken integration tests (#539)
- Fix for
--non-destructive
on views (#539) - Fix for package models materialized in the wrong schema (#538)
- Fix for broken environment variables (#543)
- Custom model schemas
- BigQuery updates
ref
improvements
- Parity for
statement
interface on BigQuery (#526)
- Override which schema models are materialized in (#522) (docs)
- Make
{{ ref(...) }}
return the same type of object as{{ this }}
(#530)
- More powerful macros
- BigQuery improvements
- Bugfixes
- Documentation (0.9.0 docs can be found here)
dbt 0.9.0 Alpha 1 introduces a number of new features intended to help dbt-ers write flexible, reusable code. The majority of these changes involve the macro
and materialization
Jinja blocks. As this is an alpha release, there may exist bugs or incompatibilites, particularly surrounding these two blocks. A list of known breaking changes is provided below. If you find new bugs, or have questions about dbt 0.9.0, please don't hesitate to reach out in slack or open a new issue.
This will manifest as a compilation error that looks like:
Compilation Error in model {your_model} (models/path/to/your_model.sql)
'already_exists' is undefined
To fix this error, use adapter.already_exists
instead of just already_exists
, or similar for other adapter functions.
- use
adapter
,ref
, andvar
inside of macros (#466) - Build custom tests and materializations in dbt packages (#466)
- Support pre- and post- hooks that run outside of a transaction (#510)
- Support table materializations for BigQuery (#507)
- Support querying external data sources in BigQuery (#507)
- Document how to create a package
- Document how to make a materialization
- Document how to make custom schema tests
- Add suppport for Google BigQuery
- Significant exit codes
- Load credentials from environment variables
- Fix errant warning for
dbt archive
commands (#476) - Show error (instead of backtrace) for failed hook statements (#478)
dbt init
no longer leaves the repo in an invalid state (#487)- Fix bug which ignored git tag specs for package repos (#463)
- Support BigQuery as a target (#437) (#438)
- Make dbt exit codes significant (0 = success, 1/2 = error) (#297)
- Add context function to pull in environment variables (#450)
- Document target configuration for BigQuery here
- Document dbt exit codes here
- Document environment variable usage here
- UI/UX improvements (colorized output, failures summary, better error messages)
- Cancel running queries on ctrl+c
- Bugfixes
- Docs
- Fix bug for interleaved sort keys on Redshift (#430)
- Don't try to create schema if it already exists (#446)
- Summarize failures for dbt invocations (#443)
- Colorized dbt output (#441)
- Cancel running queries on ctrl-c (#444)
- Better error messages for common failure modes (#445)
- Upgrade dependencies (#431)
- Improvements to
dbt init
and first time dbt usage experience (#439)
- Document full-refresh requirements for incremental models (#417)
- Document archival (#433)
- Document the two-version variant of
ref
(#432)
- Bugfixes
- Reintroduce
compile
command - Moved docs to readme.io
- Fix bug preventing overriding a disabled package model in the current project (#391)
- Fix bug which prevented multiple sort keys (provided as an array) on Redshift (#397)
- Fix race condition while compiling schema tests in an empty
target
directory (#398)
- Bugfixes
- True concurrency
- More control over "advanced" incremental model configurations more info
- Fix ephemeral load order bug (#292, #285)
- Support composite unique key in archivals (#324)
- Fix target paths (#331, #329)
- Ignore commented-out schema tests (#330, #328)
- Fix run levels (#343, #340, #338)
- Fix concurrency, open a unique transaction per model (#345, #336)
- Handle concurrent
DROP ... CASCADE
s in Redshift (#349) - Always release connections (use
try .. finally
) (#354)
- Changed: different syntax for "relationships" schema tests (#339)
- Added:
already_exists
context function (#372) - Graph refactor: fix common issues with load order (#292)
- Graph refactor: multiple references to an ephemeral models should share a CTE (#316)
- Graph refactor: macros in flat graph (#332)
- Refactor: factor out jinja interactions (#309)
- Speedup: detect cycles at the end of compilation (#307)
- Speedup: write graph file with gpickle instead of yaml (#306)
- Clone dependencies with
--depth 1
to make them more compact (#277, #342) - Rewrite materializations as macros (#356)
- Improved graph selection
- A new home for dbt
- Snowflake improvements
- improved graph selection for
dbt run
anddbt test
(more information) (#279) - profiles.yml now supports Snowflake
role
as an option (#291)
In v0.7.1, dbt was moved from the analyst-collective org to the fishtown-analytics org (#300)
- nicer error if
run-target
was not changed totarget
during upgrade to dbt>=0.7.0
- Snowflake Support
- Deprecations
dbt now supports Snowflake as a target in addition to Postgres and Redshift! All dbt functionality is supported in this new warehouse. There is a sample snowflake profile in sample.profiles.yml -- you can start using it right away.
There are a few deprecations in 0.7:
run-target
in profiles.yml is no longer supported. Usetarget
instead.- Project names (
name
in dbt_project.yml) can now only contain letters, numbers, and underscores, and must start with a letter. Previously they could contain any character. --dry-run
is no longer supported.
- use adapter for sort/dist (#274)
- fixed a typo in the docs related to post-run hooks (#271)
- refactored tracking code to refresh invocation id in a multi-run context (#273)
- added unit tests for the graph (#270)
- condense error output when
--debug
is not set (#265)
- respect
config
options in profiles.yml (#255) - use correct
on-run-end
option for post-run hooks (#261)
- add
--debug
flag, replace calls toprint()
with a global logger (#256) - add pep8 check to continuous integration tests and bring codebase into compliance (#257)
- Macros
- More control over how models are materialized
- Minor improvements
- Bugfixes
- Connor McArthur
Macros are snippets of SQL that can be called like functions in models. Macros make it possible to re-use SQL between models in keeping with the engineering principle of DRY (Dont Repeat Yourself). Moreover, packages can expose Macros that you can use in your own dbt project.
For detailed information on how to use Macros, check out the pull request here
DBT Version 0.6.0 introduces two new ways to control the materialization of models:
Non-destructive dbt run more info
If you provide the --non-destructive
argument to dbt run
, dbt will minimize the amount of time during which your models are unavailable. Specfically, dbt
will
- Ignore models materialized as
views
- Truncate tables and re-insert data instead of dropping and re-creating
This flag is useful for recurring jobs which only need to update table models and incremental models.
dbt run --non-destructive
Incremental Model Full Refresh more info
If you provide the --full-refresh
argument to dbt run
, dbt will treat incremental models as table models. This is useful when
- An incremental model schema changes and you need to recreate the table accordingly
- You want to reprocess the entirety of the incremental model because of new logic in the model code
dbt run --full-refresh
Note that --full-refresh
and --non-destructive
can be used together!
For more information, run
dbt run --help
Minor improvements more info
Add a {{ target }}
variable to the dbt runtime more info
Use {{ target }}
to interpolate profile variables into your model definitions. For example:
-- only use the last week of data in development
select * from events
{% if target.name == 'dev' %}
where created_at > getdate() - interval '1 week'
{% endif %}
User-specified profiles.yml
dir more info
DBT looks for a file called profiles.yml
in the ~/.dbt/
directory. You can now overide this directory with
$ dbt run --profiles-dir /path/to/my/dir
Add timestamp to console output more info
Informative and pretty
Run dbt from subdirectory of project root more info
A story in three parts:
cd models/snowplow/sessions
vim sessions.sql
dbt run # it works!
Pre and post run hooks more info
# dbt_project.yml
name: ...
version: ...
...
# supply either a string, or a list of strings
on-run-start: "create table public.cool_table (id int)"
on-run-end:
- insert into public.cool_table (id) values (1), (2), (3)
- insert into public.cool_table (id) values (4), (5), (6)
We fixed 10 bugs in this release! See the full list here
- added support for custom SQL data tests
- SQL returns 0 results --> pass
- SQL returns > 0 results --> fail
- dbt-core integration tests
- running in Continuous Integration environments
- with code coverage
Schema tests have proven to be an essential part of a modern analytical workflow. These schema tests validate basic constraints about your data. Namely: not null, unique, accepted value, and foreign key relationship properties can be asserted using schema tests.
With dbt v0.5.4, you can now write your own custom "data tests". These data tests are SQL SELECT statements that return 0 rows on success, or > 0 rows on failure. A typical data test might look like:
-- tests/assert_less_than_5_pct_event_cookie_ids_are_null.sql
-- If >= 5% of cookie_ids are null, then the test returns 1 row (failure).
-- If < 5% of cookie_ids are null, then the test returns 0 rows (success)
with calc as (
select
sum(case when cookie_id is null then 1 else 0 end)::float / count(*)::float as fraction
from {{ ref('events') }}
)
select * from calc where fraction < 0.05
To enable data tests, add the test-paths
config to your dbt_project.yml
file:
name: 'Vandelay Industries`
version: '1.0'
source-paths: ["models"]
target-path: "target"
test-paths: ["tests"] # look for *.sql files in the "tests" directory
....
Any .sql
file found in the test-paths
director(y|ies) will be evaluated as data tests. These tests can be run with:
dbt test # run schema + data tests
dbt test --schema # run only schema tests
dbt test --data # run only data tests
dbt test --data --schema # run schema + data tests
# For more information, try
dbt test -h
With the dbt 0.5.4 release, dbt now features a robust integration test suite. These integration tests will help mitigate the risk of software regressions, and in so doing, will help us develop dbt more quickly. You can check out the tests here, and the test results here (linux/osx) and here (windows).
You can check out the DBT roadmap here. In the next few weeks, we'll be working on bugfixes, minor features, improved macro support, and expanded control over runtime materialization configs.
As always, feel free to reach out to us on Slack with any questions or comments!
Bugfix release.
Fixes regressions introduced in 0.5.1 and 0.5.2.
Incremental models were broken by the new column expansion feature. Column expansion is implemented as
alter table ... add column tmp_col varchar({new_size});
update ... set tmp_col = existing_col
alter table ... drop column existing_col
alter table ... rename tmp_col to existing_col
This has the side-effect of moving the existing_col
to the "end" of the table. When an incremental model tries to
insert into {table} (
select * from tmp_table
)
suddenly the columns in {table}
are incongruent with the columns in tmp_table
. This insert subsequently fails.
The fix for this issue is twofold:
- If the incremental model table DOES NOT already exist, avoid inserts altogether. Instead, run a
create table as (...)
statement - If the incremental model table DOES already exist, query for the columns in the existing table and use those to build the insert statement, eg:
insert into "dbt_dbanin"."sessions" ("session_end_tstamp", "session_start_tstamp", ...)
(
select "session_end_tstamp", "session_start_tstamp", ...
from "sessions__dbt_incremental_tmp"
);
In this way, the source and destination columns are guaranteed to be in the same order!
We attempted to refactor the way profiles work in dbt. Previously, a default user
profile was loaded, and the profiles specified in dbt_project.yml
or on the command line (with --profile
) would be applied on top of the user
config. This implementation is some of the earliest code that was committed to dbt.
As dbt
has grown, we found this implementation to be a little unwieldy and hard to maintain. The 0.5.2 release made it so that only one profile could be loaded at a time. This profile needed to be specified in either dbt_project.yml
or on the command line with --profile
. A bug was errantly introduced during this change which broke the handling of dependency projects.
The additions of automated testing and a more comprehensive manual testing process will go a long way to ensuring the future stability of dbt. We're going to get started on these tasks soon, and you can follow our progress here: https://github.com/fishtown-analytics/dbt/milestone/16 .
As always, feel free to reach out to us on Slack with any questions or concerns:
Patch release fixing a bug that arises when profiles are overridden on the command line with the --profile
flag.
See https://github.com/fishtown-analytics/dbt/releases/tag/v0.5.1
- Raiders of the Lost Archive -- version your raw data to make historical queries more accurate
- Column type resolution for incremental models (no more
Value too long for character type
errors) - Postgres support
- Top-level configs applied to your project + all dependencies
- --threads CLI option + better multithreaded output
1. Source table archival #183
Commonly, analysts need to "look back in time" at some previous state of data in their mutable tables. Imagine a users
table which is synced to your data warehouse from a production database. This users
table is a representation of what your users look like now. Consider what happens if you need to look at revenue by city for each of your users trended over time. Specifically, what happens if a user moved from, say, Philadelphia to New York? To do this correctly, you need to archive snapshots of the users
table on a recurring basis. With this release, dbt now provides an easy mechanism to store such snapshots.
To use this new feature, declare the tables you want to archive in your dbt_project.yml
file:
archive:
- source_schema: synced_production_data # schema to look for tables in (declared below)
target_schema: dbt_archive # where to archive the data to
tables: # list of tables to archive
- source_table: users # table to archive
target_table: users_archived # table to insert archived data into
updated_at: updated_at # used to determine when data has changed
unique_key: id # used to generate archival query
- source_table: some_other_table
target_table: some_other_table_archive
updated_at: "updatedAt"
unique_key: "expressions || work || LOWER(too)"
- source_schema: some_other_schema
....
The archived tables will mirror the schema of the source tables they're generated from. In addition, three fields are added to the archive table:
valid_from
: The timestamp when this archived row was inserted (and first considered valid)valid_to
: The timestamp when this archived row became invalidated. The first archived record for a givenunique_key
hasvalid_to = NULL
. When newer data is archived for thatunique_key
, thevalid_to
field of the old record is set to thevalid_from
field of the new record!scd_id
: A unique key generated for each archive record. Scd = Slowly Changing Dimension.
dbt models can be built on top of these archived tables. The most recent record for a given unique_key
is the one where valid_to
is null
.
To run this archive process, use the command dbt archive
. After testing and confirming that the archival works, you should schedule this process through cron (or similar).
2. Incremental column expansion #175
Incremental tables are a powerful dbt feature, but there was at least one edge case which makes working with them difficult. During the first run of an incremental model, Redshift will infer a type for every column in the table. Subsequent runs can insert new data which does not conform to the expected type. One example is a varchar(16)
field which is inserted into a varchar(8)
field.
In practice, this error looks like:
Value too long for character type
DETAIL:
-----------------------------------------------
error: Value too long for character type
code: 8001
context: Value too long for type character varying(8)
query: 3743263
location: funcs_string.hpp:392
process: query4_35 [pid=18194]
-----------------------------------------------
With this release, dbt will detect when column types are incongruent and will attempt to reconcile these different types if possible. Specifically, dbt will alter the incremental model table schema from character varying(x)
to character varying(y)
for some y > x
. This should drastically reduce the occurrence of this class of error.
3. First-class Postgres support #183
With this release, Postgres became a first-class dbt target. You can configure a postgres database target in your ~/.dbt/profiles.yml
file:
warehouse:
outputs:
dev:
type: postgres # configure a target for Postgres
host: localhost
user: Drew
....
run-target: dev
While Redshift is built on top of Postgres, the two are subtly different. For instance, Redshift supports sort and dist keys, while Postgres does not! dbt will use the database target type
parameter to generate the appropriate SQL for the target database.
4. Root-level configs #161
Configurations in dbt_project.yml
can now be declared at the models:
level. These configurations will apply to the primary project, as well as any dependency projects. This feature is particularly useful for setting pre- or post- hooks that run for every model. In practice, this looks like:
name: 'My DBT Project'
models:
post-hook:
- "grant select on {{this}} to looker_user" # Applied to 'My DBT Project' and 'Snowplow' dependency
'My DBT Project':
enabled: true
'Snowplow':
enabled: true
5. --threads CLI option #143
The number of threads that DBT uses can now be overridden with a CLI argument. The number of threads used must be between 1 and 8.
dbt run --threads 1 # fine
# or
dbt run --threads 4 # great
# or
dbt run --threads 42 # too many!
In addition to this new CLI argument, the output from multi-threaded dbt runs should be a little more orderly now. Models won't show as START
ed until they're actually queued to run. Previously, the output here was a little confusing. Happy threading!
To upgrade to version 0.5.1 of dbt, run:
pip install --upgrade dbt
- Join us on slack with questions or comments
Made with
- use a temp table when executing incremental models
- arbitrary configuration (using config variables)
- specify branches for dependencies
- more & better docs
1. new incremental model generation #138
In previous versions of dbt, an edge case existed which caused the sql_where
query to select different rows in the delete
and insert
steps. As a result, it was possible to construct incremental models which would insert duplicate records into the specified table. With this release, DBT uses a temp table which will 1) circumvent this issue and 2) improve query performance. For more information, check out the GitHub issue: #138
2. Arbitrary configuration #146
Configuration in dbt is incredibly powerful: it is what allows models to change their behavior without changing their code. Previously, all configuration was done using built-in parameters, but that actually limits the user in the power of configuration.
With this release, you can inject variables from dbt_project.yml
into your top-level and dependency models. In practice, variables work like this:
# dbt_project.yml
models:
my_project:
vars:
exclude_ip: '192.168.1.1'
-- filtered_events.sql
-- source code
select * from public.events where ip_address != '{{ var("exclude_ip") }}'
-- compiles to
select * from public.events where ip_address != '192.168.1.1'
The vars
parameter in dbt_project.yml
is compiled, so you can use jinja templating there as well! The primary use case for this is specifying "input" models to a dependency.
Previously, dependencies used ref(...)
to select from a project's base models. That interface was brittle, and the idea that dependency code had unbridled access to all of your top-level models made us a little uneasy. As of this release, we're deprecating the ability for dependencies to ref(...)
top-level models. Instead, the recommended way for this to work is with vars! An example:
-- dbt_modules/snowplow/models/events.sql
select * from {{ var('snowplow_events_table') }}
and
models:
Snowplow:
vars:
snowplow_events_table: "{{ ref('base_events') }}"
This effectively mirrors the previous behavior, but it much more explicit about what's happening under the hood!
3. specify a dependency branch #165
With this release, you can point DBT to a specific branch of a dependency repo. The syntax looks like this:
repositories:
- https://github.com/fishtown-analytics/dbt-audit.git@development # use the "development" branch
Check em out! And let us know if there's anything you think we can improve upon!
To upgrade to version 0.5.0 of dbt, run:
pip install --upgrade dbt
--version
command- pre- and post- run hooks
- windows support
- event tracking
1. --version #135
The --version
command was added to help aid debugging. Further, organizations can use it to ensure that everyone in their org is up-to-date with dbt.
$ dbt --version
installed version: 0.4.7
latest version: 0.4.7
Up to date!
2. pre-and-post-hooks #147
With this release, you can now specify pre-
and post-
hooks that are run before and after a model is run, respectively. Hooks are useful for running grant
statements, inserting a log of runs into an audit table, and more! Here's an example of a grant statement implemented using a post-hook:
models:
my_project:
post-hook: "grant select on table {{this}} to looker_user"
my_model:
materialized: view
some_model:
materialized: table
post-hook: "insert into my_audit_table (model_name, run_at) values ({{this.name}}, getdate())"
Hooks are recursively appended, so the my_model
model will only receive the grant select...
hook, whereas the some_model
model will receive both the grant select...
and insert into...
hooks.
Finally, note that the grant
statement uses the (hopefully familiar) {{this}}
syntax whereas the insert
statement uses the {{this.name}}
syntax. When DBT creates a model:
- A temp table is created
- The original model is dropped
- The temp table is renamed to the final model name
DBT will intelligently uses the right table/view name when you invoke {{this}}
, but you have a couple of more specific options available if you need them:
{{this}} : "schema"."table__dbt_tmp"
{{this.schema}}: "schema"
{{this.table}}: "table__dbt_tmp"
{{this.name}}: "table"
3. Event tracking #89
We want to build the best version of DBT possible, and a crucial part of that is understanding how users work with DBT. To this end, we've added some really simple event tracking to DBT (using Snowplow). We do not track credentials, model contents or model names (we consider these private, and frankly none of our business). This release includes basic event tracking that reports 1) when dbt is invoked 2) when models are run, and 3) basic platform information (OS + python version). The schemas for these events can be seen here
You can opt out of event tracking at any time by adding the following to the top of you ~/.dbt/profiles.yml
file:
config:
send_anonymous_usage_stats: False
4. Windows support #154
dbt v0.4.1 provides improvements to incremental models, performance improvements, and ssh support for db connections.
- slightly modified dbt command structure
unique_key
setting for incremental models- connect to your db over ssh
- no more model-defaults
- multithreaded schema tests
If you encounter an SSL/cryptography error while upgrading to this version of dbt, check that your version of pip is up-to-date
pip install -U pip
pip install -U dbt
1. new dbt command structure #109
# To run models
dbt run # same as before
# to dry-run models
dbt run --dry # previously dbt test
# to run schema tests
dbt test # previously dbt test --validate
2. Incremental model improvements #101
Previously, dbt calculated "new" incremental records to insert by querying for rows which matched some sql_where
condition defined in the model configuration. This works really well for atomic datasets like a clickstream event log -- once inserted, these records will never change. Other datasets, like a sessions table comprised of many pageviews for many users, can change over time. Consider the following scenario:
User 1 Session 1 Event 1 @ 12:00 User 1 Session 1 Event 2 @ 12:01 -- dbt run -- User 1 Session 1 Event 3 @ 12:02
In this scenario, there are two possible outcomes depending on the sql_where
chosen: 1) Event 3 does not get included in the Session 1 record for User 1 (bad), or 2) Session 1 is duplicated in the sessions table (bad). Both of these outcomes are inadequate!
With this release, you can now add a unique_key
expression to an incremental model config. Records matching the unique_key
will be delete
d from the incremental table, then insert
ed as usual. This makes it possible to maintain data accuracy without recalculating the entire table on every run.
The unique_key
can be any expression which uniquely defines the row, eg:
sessions:
materialized: incremental
sql_where: "session_end_tstamp > (select max(session_end_tstamp) from {{this}})"
unique_key: user_id || session_index
3. Run schema validations concurrently #100
The threads
run-target config now applies to schema validations too. Try it with dbt test
4. Connect to database over ssh #93
Add an ssh-host
parameter to a run-target to connect to a database over ssh. The ssh-host
parameter should be the name of a Host
in your ~/.ssh/config
file more info
warehouse:
outputs:
dev:
type: redshift
host: my-redshift.amazonaws.com
port: 5439
user: my-user
pass: my-pass
dbname: my-db
schema: dbt_dbanin
threads: 8
ssh-host: ssh-host-name # <------ Add this line
run-target: dev
Remove the model-defaults config #111
The model-defaults
config doesn't make sense in a dbt world with dependencies. To apply default configs to your package, add the configs immediately under the package definition:
models:
My_Package:
enabled: true
materialized: table
snowplow:
...
dbt v0.4.0 provides new ways to materialize models in your database.
- new types of materializations:
incremental
andephemeral
- if upgrading, change
materialized: true|false
tomaterialized: table|view|incremental|ephemeral
- optionally specify model configs within the SQL file
1. Feature: {{this}}
template variable #81
The {{this}}
template variable expands to the name of the model being compiled. For example:
-- my_model.sql
select 'the fully qualified name of this model is {{ this }}'
-- compiles to
select 'the fully qualified name of this model is "the_schema"."my_model"'
2. Feature: materialized: incremental
#90
After initially creating a table, incremental models will insert
new records into the table on subsequent runs. This drastically speeds up execution time for large, append-only datasets.
Each execution of dbt run will:
- create the model table if it doesn't exist
- insert new records into the table
New records are identified by a sql_where
model configuration option. In practice, this looks like:
sessions:
materialized: incremental
sql_where: "session_start_time > (select max(session_start_time) from {{this}})"
There are a couple of new things here. Previously, materialized
could either be set to true
or false
. Now, the valid options include view
, table,
incremental
, and ephemeral
(more on this last one below). Also note that incremental models generally require use of the {{this}} template variable to identify new records.
The sql_where
field is supplied as a where
condition on a subquery containing the model definition. This resultset is then inserted into the target model. This looks something like:
insert into schema.model (
select * from (
-- compiled model definition
) where {{sql_where}}
)
3. Feature: materialized: ephemeral
#78
Ephemeral models are injected as CTEs (with
statements) into any model that ref
erences them. Ephemeral models are part of the dependency graph and generally function like any other model, except ephemeral models are not compiled to their own files or directly created in the database. This is useful for intermediary models which are shared by other downstream models, but shouldn't be queried directly from outside of dbt.
To make a model ephemeral:
employees:
materialized: ephemeral
Suppose you wanted to exclude employees
from your users
table, but you don't want to clutter your analytics schema with an employees
table.
-- employees.sql
select * from public.employees where is_deleted = false
-- users.sql
select *
from {{ref('users')}}
where email not in (select email from {{ref('employees')}})
The compiled SQL would look something like:
with __dbt__CTE__employees as (
select * from public.employees where is_deleted = false
)
select *
from users
where email not in (select email from __dbt__CTE__employees)
Ephemeral models play nice with other ephemeral models, incremental models, and regular table/view models. Feel free to mix and match different materialization options to optimize for performance and simplicity.
4. Feature: In-model configs #88
Configurations can now be specified directly inside of models. These in-model configs work exactly the same as configs inside of the dbt_project.yml file.
An in-model-config looks like this:
-- users.sql
-- python function syntax
{{ config(materialized="incremental", sql_where="id > (select max(id) from {{this}})") }}
-- OR json syntax
{{
config({"materialized:" "incremental", "sql_where" : "id > (select max(id) from {{this}})"})
}}
select * from public.users
The config resolution order is:
- dbt_project.yml
model-defaults
- in-model config
- dbt_project.yml
models
config
5. Fix: dbt seed null values #102
Previously, dbt seed
would insert empty CSV cells as "None"
, whereas they should have been NULL
. Not anymore!
Version 0.3.0 comes with the following updates:
1. Parallel model creation #83
dbt will analyze the model dependency graph and can create models in parallel if possible. In practice, this can significantly speed up the amount of time it takes to complete dbt run
. The number of threads dbt uses must be between 1 and 8. To configure the number of threads dbt uses, add the threads
key to your dbt target in ~/.dbt/profiles.yml
, eg:
user:
outputs:
my-redshift:
type: redshift
threads: 4 # execute up to 4 models concurrently
host: localhost
...
run-target: my-redshift
For a complete example, check out a sample profiles.yml file
2. Fail only within a single dependency chain #63
If a model cannot be created, it won't crash the entire dbt run
process. The errant model will fail and all of its descendants will be "skipped". Other models which do not depend on the failing model (or its descendants) will still be created.
dbt will log output from the dbt run
and dbt test
commands to a configurable logging directory. By default, this directory is called logs/
. The log filename is dbt.log
and it is rotated on a daily basic. Logs are kept for 7 days.
To change the name of the logging directory, add the following line to your dbt_project.yml
file:
log-path: "my-logging-directory" # will write logs to my-logging-directory/dbt.log
4. Minimize time models are unavailable in the database #68
Previously, dbt would create models by:
- dropping the existing model
- creating the new model
This resulted in a significant amount of time in which the model was inaccessible to the outside world. Now, dbt creates models by:
- creating a temporary model
{model-name}__dbt_tmp
- dropping the existing model
- renaming the tmp model name to the actual model name
5. Arbitrarily deep nesting #50
Previously, all models had to be located in a directory matching models/{model group}/{model_name}.sql
. Now, these models can be nested arbitrarily deeply within a given dbt project. For instance, models/snowplow/sessions/transformed/transformed_sessions.sql
is a totally valid model location with this release.
To configure these deeply-nested models, just nest the config options within the dbt_project.yml
file. The only caveat is that you need to specify the dbt project name as the first key under the models
object, ie:
models:
'Your Project Name':
snowplow:
sessions:
transformed:
transformed_sessions:
enabled: true
More information is available on the issue and in the sample dbt_project.yml file
6. don't try to create a schema if it already exists #66
dbt run would execute create schema if not exists {schema}
. This would fail if the dbt user didn't have sufficient permissions to create the schema, even if the schema already existed! Now, dbt checks for the schema existence and only attempts to create the schema if it doesn't already exist.
The previous release of dbt was v0.2.3.0 which isn't a semantic version. This and all future dbt releases will conform to semantic version in the format {major}.{minor}.{patch}
.
Version 0.2.3.0 of dbt comes with the following updates:
Referential integrity validations in a schema.yml
file were previously defined relative to the parent table:
account:
constraints:
relationships:
- {from: id, to: people, field: account_id}
Now, these validations are specified relative to the child table
people:
constraints:
relationships:
- {from: account_id, to: accounts, field: id}
For more information, run dbt test -h
Previously, auxiliary data needed to be shoehorned into a view comprised of union statements, eg.
select 22 as "type", 'Chat Transcript' as type_name, 'chatted via olark' as event_name union all
select 21, 'Custom Redirect', 'clicked a custom redirect' union all
select 6, 'Email', 'email sent' union all
...
That's not a scalable solution. Now you can load CSV files into your data warehouse:
- Add a CSV file (with a header) to the
data/
directory - Run
dbt seed
to create a table from the CSV file! - The table name with be the filename (sans
.csv
) and it will be placed in yourrun-target
's schema
Subsequent calls to dbt seed
will truncate the seeded tables (if they exist) and re-insert the data. If the table schema changes, you can run dbt seed --drop-existing
to drop the table and recreate it.
For more information, run dbt seed -h
Versioning your SQL models with dbt is a great practice, but did you know that you can also version your analyses? Any SQL files in the analysis/
dir will be compiled (ie. table names will be interpolated) and placed in the target/build-analysis/
directory. These analytical queries will not be run against your data warehouse with dbt run
-- you should copy/paste them into the data analysis tool of your choice.
In your schema.yml
file, you can now add accepted-values
validations:
accounts:
constraints:
accepted-values:
- {field: type, values: ['paid', 'free']}
This test will determine how many records in the accounts
model have a type
other than paid
or free
.
Switch between profiles with --profile [profile-name]
and switch between run-targets with --target [target-name]
.
Targets should be something like "prod" or "dev" and profiles should be something like "my-org" or "my-side-project"
side-project:
outputs:
prod:
type: redshift
host: localhost
port: 5439
user: Drew
pass:
dbname: data_generator
schema: ac_drew
dev:
type: redshift
host: localhost
port: 5439
user: Drew
pass:
dbname: data_generator
schema: ac_drew_dev
run-target: dev
To compile models using the dev
environment of my side-project
profile:
$ dbt compile --profile side-project --target dev
or for prod
:
$ dbt compile --profile side-project --target prod
You can also add a "profile' config to the dbt_config.yml
file to fix a dbt project to a specific profile:
...
test-paths: ["test"]
data-paths: ["data"]
# Fix this project to the "side-project" profile
# You can still use --target to switch between environments!
profile: "side-project"
model-defaults:
....