Replies: 8 comments 22 replies
-
I think you would want to run all of them, a project owner can be focused on the latest version, but running this means all versions of this model are up to date. If you want to specifically run a version, the syntax of |
Beta Was this translation helpful? Give feedback.
-
From conversation with @aranke:
In all our documentation, guides, best practices, everywhere, we can treat
|
Beta Was this translation helpful? Give feedback.
-
Let's say I'm a SQL developer building reports on top of a warehouse managed by dbt. Or I'm maintaining a Power BI model on top of it. I'm living in a scenario where I'm at the frontier of the dbt worklow. So I'm neither in another dbt project, nor a dbt semantic model. I have a fleet of reports, which is basically a collection of SQL queries, that I need to test against the new table. The intent here is to "[...] communicate and facilitate an intended breaking change, while allowing downstream consumers a migration window." Of the overall dbt-managed schema I'm consuming from, one model has a breaking change. models:
- name: dim_customers
version: 3
columns: ...
- name: dim_customers
version: 2
is_latest: true
columns: ... I can test my reports by changing my queries to point to So I run my tests the old way, but I detect 5 reports will need to be updated. Now let's say I realize I won't have the time to update 2 reports that were failing. The business haven't signed on the changes, whatever. I can update these reports to specifically target the old version when the switch is made ( models:
- name: dim_customers
version: 3
columns: ...
- name: dim_customers
version: 2
deprecation_date: '2023-07-01'
columns: ... Ok I retract the objections I made during our last conversation. I like that ;) |
Beta Was this translation helpful? Give feedback.
-
Good stuff! Would versioning allow even the latest version to be deprecated? We sometimes have models that really should have a sunset date defined already at the creation time. Think things like 2022 targets, a model running for the duration of a campaign etc. which might be useless the following year or quarter, and we'd know it when adding them. Or perhaps we never want to commit to anything for ever, and want to have a default deprecation date two years from start. Other times we'll want to completely deprecate an old model as unnecessary, or in favor of some other model. (Also linking the successor model could be neat...) Perhaps there's currently a way to communicate model deprecation in dbt that I'm not aware of... |
Beta Was this translation helpful? Give feedback.
-
Throwing in a very plain-jane, unmagical, but explicit option for 'Defining versions' here: Option D: version property in schema.ymlMultiple versions of a model can have any file or directory name. They’re just models! Version metadata is only configurable in a schema.yml file (or potentially a new
|
Beta Was this translation helpful? Give feedback.
-
the following is authored collectively by the DX team at dbt Labs
Our findings follow. Recommendations for model version syntaxOk here we are. The main event - the big kahuna. There was surprisingly strong consensus here - I think we expected to all end up in different places after pre-registering our thoughts but the strong response was Option D being the most dbtonic method because configuration was done in what feels the most sensible location and it was done in a way that is explicit while leaving space open for a sprinkle of magic if you follow a common convention. Every member placed Option D in the top spot or second choice. There was also a strong second choice - Option B, which provides strong benefits of convenience in naming and strong alignment with existing AE workflows, and provides a clearly-lighted path to developer ergonomics while retaining an escape hatch. Ultimately our proposal combines the best of these two strategies, but if we had to select from the proposed options A through D then D is our selection. Tweaks to the proposed YAML syntaxEven if it’s legal YAML, it feels very strange to have multiple nodes with the same name and different versions. It also means that theoretically you could have version definitions spilling across multiple YAML files (or even spread all over a single very long Instead, we propose bringing the additional versioning metadata into the one primary model, which looks something like this: - models:
- name: dim_customers
# the variant outside of the versions block is always `latest`.
# If you have a prerelease version that's not ready for prime time yet, it still goes into the versions block below.
# `current_version` / `latest_version` / `primary_version` depending on which one we pick as referring to this
current_version: 2
config:
materialized: table
meta:
key: value
columns:
- name: customer_id
description: This is the primary key
data_type: int
- name: ...
versions:
- version: 1
implemented_in: dim_customers_OLD
deprecation_date: '2022-01-01'
config:
meta: {} #remove the meta from v2
columns:
include: ...
exclude: ...
- version: 3
description: A dramatic reimagining of our customers model with no column overlap? What are they thinking?!
implemented_in: dim_customers_NEW_BUT_NOT_READY
# deprecation_date is optional and not included for this version
columns:
- name: totally_new_column_1
- name: totally_new_column_2
- name: totally_new_column_3
... Making it easier to roll out versions in a convention-driven wayAs Florian notes, the new One important difference from Option B as originally described: the original Option B would have required the latest/primary version be called The current version of a model does not have an Materializing versions in the warehouse (aka how does this impact data consumers?)It is incredibly important that when considering the impact of versioned models we don’t just consider the ergonomics on the creation and upkeep of versioned models but also of querying them in the warehouse. As an analyst, the idea of having to remember whether to query Just like
This does break away from the 1 model = 1 object principle, but in the opposite direction to what people normally ask of us! If we’re not comfortable making that change, then we could instead do:
To be clear: this is not great! If you know that v3 is going to break your dashboard when it’s released then you can’t proactively pin to v2, you have to wait for it to be demoted and then do a speedy swap. How to handle building versions when dbt is run? What gets run?We propose that by default dbt runs all models whose deprecation date has not yet passed (side note: date math is hard, we recommend that run_started_at should be the arbiter of truth here). This means that deprecated models are excluded from future dbt runs but no other action happens automatically (like dropping the underlying tables/views). We lean towards saying that they should be disabled (i.e the same as There was discussion about a desire to automatically delete models after their deprecation date along with potential data preservation measures such as a final snapshot to a separate archive schema before deletion, but ultimately that is over-prescriptive for the level of this construct. Instead we’d propose future tooling that allows handling of deprecated models as well as the DX team working with the Community to create best practices and guides for how to manage deprecated models. We also discussed the difference between building in development vs production environments; in development it’s probably desirable for speed reasons to only build the current version. This shouldn’t be built into dbt itself per se, but could be achieved via a default YAML selector. This means that dbt just needs to expose version data for selection purposes. Terminology for configurations in different lifecycle stagesWe are hoping for an intuitive way to describe the status(es) of each versioned model. Here's terminology that largely aligns with language used widely in software:
Alternatively, we could have a
One important potential failure mode of too-liberal versioning is that every subteam ultimately relies on a different version of data, and we regress to a place where different teams have different numbers because they’re using outdated tables but can’t/won’t migrate. |
Beta Was this translation helpful? Give feedback.
-
Really digging the idea of nesting the versioning metadata within one primary model. @jtcohen6 and I added a few modifications to the proposed spec from the DX group:
We took this double-tweaked spec for a spin and put together this ‘flipbook’ to go through a versioned models evolution lifecycle.
models:
- name: dim_customers
config:
materialized: table
contract: true
meta:
key: value
columns:
- name: customer_id
description: This is the primary key
data_type: int
- name: country_name
description: Where this customer lives
data_type: string
# Manifest representation
# model.project_name.dim_customers
# * path: dim_customers.sql
# * alias: dim_customers Then, a breaking change comes along and it’s time to create a versioned model - preserving the existing model as version 1 and creating a new + latest version 2: models:
- name: dim_customers
latest_version: 2
config:
materialized: table
contract: true
meta:
key: value
columns:
- name: customer_id
description: This is the primary key
data_type: float # this is the breaking change!
- name: country_name
description: Where this customer lives
data_type: string
versions:
- name: 2
- name: 1
config:
alias: dim_customers # keep old relation location
deprecation_date: '2022-01-01'
columns:
- include: *
exclude: customer_id # should this be optional if we're redefining customer_id below? think so
- name: customer_id
description: This is the primary key
data_type: int
# Manifest representation
# model.project_name.dim_customers.v1
# * name: dim_customers
# * version: 1
# * path: dim_customers_v1.sql (defined_in is not provided, defaulting to conventional naming, implies user has renamed .sql file)
# * alias: dim_customers
# model.project_name.dim_customers.v2
# * name: dim_customers
# * version: 2
# * path: dim_customers_v2.sql
# * alias: dim_customers_v2 Just for fun, another breaking change comes along (deleting the models:
- name: dim_customers
latest_version: 3
config:
contract: true
materialized: table
meta:
key: value
columns:
- name: customer_id
description: This is the primary key
data_type: float
- name: country_name
description: Where this customer lives
data_type: string
versions:
- name: 2
- name: 1
config:
alias: dim_customers
deprecation_date: '2022-01-01'
columns:
- include: *
exclude: customer_id
- name: customer_id
data_type: int
- name: 3
defined_in: my_new_customers_definition # matches SQL or python file name
columns:
- include: *
exclude:
- country_name # this is the breaking change between versions 2 and 3
# Manifest representation
# model.project_name.dim_customers.v1
# * name: dim_customers
# * version: 1
# * path: dim_customers_v1.sql
# * alias: dim_customers
# model.project_name.dim_customers.v2
# * name: dim_customers
# * version: 2
# * path: dim_customers_v2.sql
# * alias: dim_customers_v2
# model.project_name.dim_customers.v3
# * name: dim_customers
# * version: 3
# * path: my_new_customers_definition.sql
# * alias: dim_customers_v3 |
Beta Was this translation helpful? Give feedback.
-
I've heard a lot of valuable feedback on model versions during the RC period, during yesterday's internal training, and in the awesome comments @joellabes left on my docs PR: If there are ways for us to clarify intent, or to make this feature more ergonomic, by putting more of this into As the producer of a versioned model, I want to privilege the model's latest version as the source of truth. The other versions are there to facilitate migrations; they are not equal players. So, as a producer:
As the consumer of a versioned model:
|
Beta Was this translation helpful? Give feedback.
-
Part of the larger initiative for Multi-project collaboration (#6725)
The big idea
[Aside] Versioned models vs. versioned deployments
The concept of model versioning that I'm proposing here is distinct from versioned deployments. What's the difference?
Many organizations using dbt follow a "blue/green deployment" pattern, which requires multiple versions of a model/schema/database to exist simultaneously. The goal of this pattern is to catch unexpected breaking changes before they're pushed live into production, and to enable quick rollback to an older stable version if those changes are accidentally pushed live.
The purpose of a model version is to communicate and facilitate an intended breaking change, while allowing downstream consumers a migration window.
Premises: recommended patterns
For each of the following, our goal is to make the recommended option easy to implement. We may still want other options to be may seek to make several options possible. Our goal, here & everywhere, is to define a standard pattern that's opinionated enough to mean something; works out-of-the-box for 80% of users; and is flexible enough to be adopted/adapted for the remaining 20%.
Behind each of these decisions is several weeks of discussion & debate, with lengthy rationale behind each one. I'm leaving out the full details here, because these discussions are already long enough. If you're interested, let's chat about it in the threads below!
Versioning scheme
dim_customers_v1
dim_customers_v1.2.3
dim_customers_v20220118
(Still, we may choose to add just enough logic that it's possible to use other versioning schemes if desired.)
"Bump version" workflow
Model vs. project level
The rest of this discussion will proceed with the premises that:
v1
,v2
,v3
)Goals
For end users, how should this feel different from just defining two separate models,
customers_v1
+customers_v2
? The goal is to offer enough out-of-the-box functionality that this feels like a first-class feature, and a standard pattern for 80+% of users.ref
calls to those older versions raise a warning whenever the ref'ing model is compiled or run.What can be versioned?
The initial focus is on models. Any model may define a
version
.version
would be included in the node'sunique_id
, following the format:model.<project_name>.<model_name>.<version>
.We may want to extend versioning to other resource types.
unique_id
. If you have defined aunique
test on thecustomer_id
column ofdim_customers
, and you have both v2 and v3 ofdim_customers
, you have two tests.When there are multiple enabled versions of a resource:
latest: true
boolean attribute. This is useful in cases where a newer version (v3
) is under active development, and needs to be beta tested before replacing the current default (v2
).access: public
), for referencing in other groups.Defining versions
Version should not be a configuration that is set + inherited from
dbt_project.yml
for multiple models at once. (We may want it to be settable + inheritable at thegroup
level, if extracting from FQN is not an option. See below.)If/when we extend
version
support to "yaml-only" resources (e.g.exposures
,metrics
), theversion
attribute would simply be specified within yaml configuration, alongside the rest of their definition. But models are a trickier case, because their primary definition is in.sql
and.py
files—their very name is derived from the file name.Option A: In-file config
Multiple versions of a model must have the same file name, and live in separate directories. Within each file:
Option B: File name
Each model has a unique file name, with a standard suffix, which we strip off to derive the actual
name
:dim_customers_v1.sql
vsdim_customers_v2.sql
dim_customers.v1.sql
vsdim_customers.v2.sql
Option C: FQN (derived from file path)
This approach feels the most magical by far (not in a good way). The most promising thing about it is, making it very easy to bump the version for an entire folder/group of models at once:
Materializing versioned models
A model's
version
will be used when calculating thealias
for that model in the database. For example, version 2 of thedim_customers
model would materialize a table calleddim_customers_v2
.We would do this by updating the default implementation of the
generate_alias_name
macro. Users can override that macro to customize the behavior.Referencing versioned models
When a versioned model is referenced, the
ref
call may include a keyword argument specifying the version. If the version argument isn't specified, it resolves to the latest version. (Models without versions cannot be ref'd at a specific version.)This would also hold for referencing models within exposures (and in the future, entities):
Within a project, one model version should also be able to reference another version of the same model. In this case, we’re materializing the latest version (v3) of
dim_customers
as a table, and then providing a thin view on top that renames and recalculates for backward compatibility.Configuring versioned models
Let's keep it DRY! By default, all versions of a model will use the same yaml attributes & configurations defined for that model. Specific older versions can override specific attributes that need to be different, to provide backwards compatibility. Of course, in-file configs can also override configs defined in yaml.
Example: We start by defining the configuration for the latest version of
dim_customers
. Then, we redefine just the diffs for specific older versions.Selecting versioned models
If I have multiple versions of
dim_customers
, should this select all of them to run, or just the latest version?Potential syntax:
Lifecycle
Detecting need for version bumps
Owners of contracted public models need to know if they've made a breaking change, and bump their model's version accordingly.
dbt can help: The
state:modified
comparison (in CI) can detect if the contract has undergone breaking changes, without a corresponding version bump, and raise an error. (Adding new columns is not a breaking change. Removing existing columns, modifying their data types, or removing constraints would be.)Should we look to incorporate other signals of potentially breaking changes?
metrics
Deprecating old versions
All models may declare a deprecation date, as a way of signalling maturity, and communicating plans for long-term maintenance.
For models with multiple versions, older versions MUST declare a deprecation date. The producer of that model MAY explicitly set the deprecation date to a special string value (perhaps
"never"
or"unplanned"
), to signal that there is no target date for deprecation.dbt should raise a warning any time it is compiling or running a resource that includes a
ref
to a model with a set deprecation date.[Future] After a model's deprecation date has passed, should dbt also…?
[Future] When cross-project references are supported, dbt would raise this warning in downstream projects that are
ref
'ing an older version of a public model from an upstream project.Visualizing versioned models
In an ideal world...
Each model (whether one version or several) would be presented as a single object in metadata and in documentation. Users would be able to view information about the latest version by default, and older versions when they want to; it should be identify differences between versions.
In a DAG visualization, by default, we should show only the latest version of each model. In addition, it would be desirable to have the option of viewing:
I'm not sure if we'll be able to make it all happen just like that, with the existing
dbt-docs
site. Even so, it's worth agreeing on what "good" would look like.Postscripts
Let's get rid of
version: 2
in yaml filesAs in, no longer require that line at the top of every yaml file. Otherwise, this is just going to get confusing.
The
version: 1
yaml file spec hasn't been supported for years! &version: 3
is unlikely to arrive anytime soon.Let's do the same for
dbt_project.yml
:config-version: 2
optionalversion
field optional, too. It does something entirely different (specifically, it does nothing).Optimizing models with multiple versions
The simplest way to create a new model version requires duplicating its source code and its storage in the database—and doing the same, potentially, for every model downstream of it in the DAG.
When possible, users would be fully able to redefine an older model version in terms of the latest on, e.g. a view that renames or recalculates columns for backward compatibility. Users could also minimize data platform cost/impact by materializing older versions as views—or, where the performance differences are significant, by cloning one version and applying diffs to arrive at another one. (The former approach is more in keeping with dbt's philosophy; the latter approach goes against it write declarative
select
statements, and not imperative DML.)This does add friction, but I don't believe that friction is inappropriate. This is not an easy problem, neither for data APIs nor software APIs. I would be happy if the result of this initiative is to encourage more dbt developers to make non-breaking changes, by adding new columns instead of removing/redefining existing ones—and then to provide them with a mechanism to eventually deprecate those no-longer-used columns, in major version bumps, on a regular & predictable cadence.
Beta Was this translation helpful? Give feedback.
All reactions