Model versions #6736

jtcohen6 · 2023-01-25T23:41:13Z

jtcohen6
Jan 25, 2023
Maintainer

Part of the larger initiative for Multi-project collaboration (#6725)

The big idea

If models can be "contracted", they can also have breaking changes.
When a model has a breaking change, its owner creates a new version.
Anywhere that model is used downstream, it can be referenced at a specific version.
Older versions specify a deprecation date, after which they go away.

[Aside] Versioned models vs. versioned deployments

The concept of model versioning that I'm proposing here is distinct from versioned deployments. What's the difference?

Many organizations using dbt follow a "blue/green deployment" pattern, which requires multiple versions of a model/schema/database to exist simultaneously. The goal of this pattern is to catch unexpected breaking changes before they're pushed live into production, and to enable quick rollback to an older stable version if those changes are accidentally pushed live.

The purpose of a model version is to communicate and facilitate an intended breaking change, while allowing downstream consumers a migration window.

Premises: recommended patterns

For each of the following, our goal is to make the recommended option easy to implement. We may still want other options to be may seek to make several options possible. Our goal, here & everywhere, is to define a standard pattern that's opinionated enough to mean something; works out-of-the-box for 80% of users; and is flexible enough to be adopted/adapted for the remaining 20%.

Behind each of these decisions is several weeks of discussion & debate, with lengthy rationale behind each one. I'm leaving out the full details here, because these discussions are already long enough. If you're interested, let's chat about it in the threads below!

Versioning scheme

✅ API (major version only): dim_customers_v1
❌ SemVer: dim_customers_v1.2.3
❌ Calendar versioning: dim_customers_v20220118

(Still, we may choose to add just enough logic that it's possible to use other versioning schemes if desired.)

"Bump version" workflow

✅ Duplication within one codebase version / deployment
❌ Tied to git workflow (version = branch/tag/release)

Model vs. project level

✅ Models can be versioned individually
❌ Entire project should have one consistent version
❓Model groups - this is an open question!

The rest of this discussion will proceed with the premises that:

Individual models can be versioned independently of one another
Those versions are "API"-style (v1, v2, v3)
Multiple versions of one model exist simultaneously, side by side, in the same project codebase

Goals

For end users, how should this feel different from just defining two separate models, customers_v1 + customers_v2? The goal is to offer enough out-of-the-box functionality that this feels like a first-class feature, and a standard pattern for 80+% of users.

dbt should provide native support for referencing a versioned model, including the ability to reference whatever the latest version is.
Older model versions declare a deprecation date. ref calls to those older versions raise a warning whenever the ref'ing model is compiled or run.
Ergonomics for configuration. It should be easy (or easier) to share properties & configuration between multiple versions of a model, for attributes that haven't changed.
Within docs & metadata, dbt understands that these are really the same model, and presents associated information accordingly.

What can be versioned?

The initial focus is on models. Any model may define a version.

The version would be included in the node's unique_id, following the format: model.<project_name>.<model_name>.<version>.

We may want to extend versioning to other resource types.

Versions make sense for exposures, metrics, and (soon) entities. Whether they're included in the first cut will depend on our capacity during implementation.
For starters, version support is not planned or prioritized for sources, seeds, snapshots, analyses, or macros. (But I'm interested in hearing potential use cases!)
While tests are not versioned per se, specific instances of generic tests, defined on different versions models, will be disambiguated in their unique_id. If you have defined a unique test on the customer_id column of dim_customers, and you have both v2 and v3 of dim_customers, you have two tests.

When there are multiple enabled versions of a resource:

There is exactly one latest version. By default, it's exactly the one you'd think (numerically greatest). Users may optionally override the obvious default, by specifying a latest: true boolean attribute. This is useful in cases where a newer version (v3) is under active development, and needs to be beta tested before replacing the current default (v2).
Multiple versions MAY be declared "public" (access: public), for referencing in other groups.
Older versions MUST specify a deprecation date (even if the date is "never" or "unplanned").

Defining versions

Version should not be a configuration that is set + inherited from dbt_project.yml for multiple models at once. (We may want it to be settable + inheritable at the group level, if extracting from FQN is not an option. See below.)

If/when we extend version support to "yaml-only" resources (e.g. exposures, metrics), the version attribute would simply be specified within yaml configuration, alongside the rest of their definition. But models are a trickier case, because their primary definition is in .sql and .py files—their very name is derived from the file name.

💬 This is open for debate!

Option A: In-file config

Multiple versions of a model must have the same file name, and live in separate directories. Within each file:

-- models/.../customers/dim_customers.sql
{{ config(version = 3) }}

-- models/.../customers/deprecated/dim_customers.sql
{{ config(version = 2) }}

-- models/.../customers/very_deprecated/dim_customers.sql
{{ config(version = 1) }}

Option B: File name

Each model has a unique file name, with a standard suffix, which we strip off to derive the actual name:

Underscores: dim_customers_v1.sql vs dim_customers_v2.sql
Dots: dim_customers.v1.sql vs dim_customers.v2.sql

Option C: FQN (derived from file path)

This approach feels the most magical by far (not in a good way). The most promising thing about it is, making it very easy to bump the version for an entire folder/group of models at once:

$ tree models/marts/core/
.
├── v1/
│   ├── config.yml
│   ├── dim_customers.sql
│   ├── exposures.yml
│   ├── fct_orders.sql
│   ├── intermediate/
│   │   ├── int__orders_aggregated.sql
│   │   └── int__customers_identified.sql
│   └── metrics.yml
└── v2/
    ├── config.yml
    ├── dim_customers.sql
    ├── exposures.yml
    ├── fct_orders.sql
    ├── intermediate/
    │   ├── int__orders_aggregated.sql
    │   └── int__customers_identified.sql
    └── metrics.yml

Materializing versioned models

A model's version will be used when calculating the alias for that model in the database. For example, version 2 of the dim_customers model would materialize a table called dim_customers_v2.

We would do this by updating the default implementation of the generate_alias_name macro. Users can override that macro to customize the behavior.

Referencing versioned models

When a versioned model is referenced, the ref call may include a keyword argument specifying the version. If the version argument isn't specified, it resolves to the latest version. (Models without versions cannot be ref'd at a specific version.)

-- 1 arg 'ref': no version / latest version
select * from {{ ref('dim_customers') }}

-- version must be a keyword argument
select * from {{ ref('dim_customers', version=2) }}

-- 3 arg 'ref', including project/package name
select * from {{ ref('project_name', 'dim_customers', version=2) }}

This would also hold for referencing models within exposures (and in the future, entities):

exposures:
  - name: my_exposure
    depends_on:
      - ref('dim_customers', 2)

entities:
  - name: customers
    model: ref('dim_customers', 2)

Within a project, one model version should also be able to reference another version of the same model. In this case, we’re materializing the latest version (v3) of dim_customers as a table, and then providing a thin view on top that renames and recalculates for backward compatibility.

-- models/dim_customers.sql

{{ config(version=2, materialized='view') }}

select

    customer_id,
    
    -- rename for back-compat
    last_login_date as last_date_logged_in,
    
    -- previous LTV calc didn't account for discounts
    lifetime_value + total_discounts as lifetime_value,

    ...

from {{ ref('dim_customers', version=3) }}

Configuring versioned models

Let's keep it DRY! By default, all versions of a model will use the same yaml attributes & configurations defined for that model. Specific older versions can override specific attributes that need to be different, to provide backwards compatibility. Of course, in-file configs can also override configs defined in yaml.

Example: We start by defining the configuration for the latest version of dim_customers. Then, we redefine just the diffs for specific older versions.

models:
  - name: dim_customers
    version: 3  # latest version -- does not need to be specified explicitly
    config:
      materialized: table
      meta:
      some_key: some_value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: int
      - name: ...		

  # If older versions weren't redefined here, they would all just 'inherit' the same configs defined above

  # Version 2 differs in a few ways, and that's all we *need* to define here. (Of course, we can duplicate identical config if we want to.)
  - name: dim_customers
    version: 2
    deprecation_date: '2023-07-01'
    config:
      meta: {}  # overrides
    columns:    # default: 'include: all'
      include: ...
      exclude: ...

  # Version 1 has already been deprecated & removed from the project

Selecting versioned models

If I have multiple versions of dim_customers, should this select all of them to run, or just the latest version?

$ dbt run --select dim_customers

Potential syntax:

$ dbt run --select dim_customers.v3  # run just v3
$ dbt run --select dim_customers.*   # run all versions

Lifecycle

Detecting need for version bumps

Owners of contracted public models need to know if they've made a breaking change, and bump their model's version accordingly.

dbt can help: The state:modified comparison (in CI) can detect if the contract has undergone breaking changes, without a corresponding version bump, and raise an error. (Adding new columns is not a breaking change. Removing existing columns, modifying their data types, or removing constraints would be.)

Should we look to incorporate other signals of potentially breaking changes?

Test failures
Calculated values of metrics

Deprecating old versions

All models may declare a deprecation date, as a way of signalling maturity, and communicating plans for long-term maintenance.

For models with multiple versions, older versions MUST declare a deprecation date. The producer of that model MAY explicitly set the deprecation date to a special string value (perhaps "never" or "unplanned"), to signal that there is no target date for deprecation.

dbt should raise a warning any time it is compiling or running a resource that includes a ref to a model with a set deprecation date.

[Future] After a model's deprecation date has passed, should dbt also…?

Stop selecting it to run by default?
Consider that model disabled?
Drop that model from the warehouse?

[Future] When cross-project references are supported, dbt would raise this warning in downstream projects that are ref'ing an older version of a public model from an upstream project.

Visualizing versioned models

In an ideal world...

Each model (whether one version or several) would be presented as a single object in metadata and in documentation. Users would be able to view information about the latest version by default, and older versions when they want to; it should be identify differences between versions.

In a DAG visualization, by default, we should show only the latest version of each model. In addition, it would be desirable to have the option of viewing:

All versions of one or more selected models
Combined views of model versions, where all upstream + downstream edges are shown to stem from a single node

I'm not sure if we'll be able to make it all happen just like that, with the existing dbt-docs site. Even so, it's worth agreeing on what "good" would look like.

Postscripts

Let's get rid of `version: 2` in yaml files

As in, no longer require that line at the top of every yaml file. Otherwise, this is just going to get confusing.

The version: 1 yaml file spec hasn't been supported for years! & version: 3 is unlikely to arrive anytime soon.

Let's do the same for dbt_project.yml:

Make config-version: 2 optional
Make the version field optional, too. It does something entirely different (specifically, it does nothing).

Optimizing models with multiple versions

The simplest way to create a new model version requires duplicating its source code and its storage in the database—and doing the same, potentially, for every model downstream of it in the DAG.

When possible, users would be fully able to redefine an older model version in terms of the latest on, e.g. a view that renames or recalculates columns for backward compatibility. Users could also minimize data platform cost/impact by materializing older versions as views—or, where the performance differences are significant, by cloning one version and applying diffs to arrive at another one. (The former approach is more in keeping with dbt's philosophy; the latter approach goes against it write declarative select statements, and not imperative DML.)

This does add friction, but I don't believe that friction is inappropriate. This is not an easy problem, neither for data APIs nor software APIs. I would be happy if the result of this initiative is to encourage more dbt developers to make non-breaking changes, by adding new columns instead of removing/redefining existing ones—and then to provide them with a mechanism to eventually deprecate those no-longer-used columns, in major version bumps, on a regular & predictable cadence.

chamini2 · 2023-01-26T20:38:07Z

chamini2
Jan 26, 2023

If I have multiple versions of dim_customers, should this select all of them to run, or just the latest version?
$ dbt run --select dim_customers

I think you would want to run all of them, a project owner can be focused on the latest version, but running this means all versions of this model are up to date.

If you want to specifically run a version, the syntax of dim_customers.v3 handles that. But in general, dim_customers should give the most simliar experience to consumers of that model (no matter the version).

3 replies

lostmygithubaccount Jan 26, 2023

I disagree -- shouldn't these be treated like software versions where unpinned == latest? In my mental model these would be in-file configs somewhere (and, if you could set a default, effectively entire-project versioning) and roughly map to SemVer. Teams in an organization would likely start with models without versions until they experience some cross-project breaking change. Then team B with dependency on team A can start locking to v1.*, and the organization can define the exact semantics around the versioning scheme. I'd expect like SemVer to have major-version bumps for changes that will definitely break downstream data, minor for things that shouldn't but could, and patch for cleanup/bug fixes.

Then it should be on downstream models to refresh/run previous versions (somehow implied from their use of a given version).

chamini2 Jan 26, 2023

shouldn't these be treated like software versions where unpinned == latest

If you are doing ref('project_a', 'dim_customers') from project_b this would be the behavior and it makes a lot of sense.

But the step of building / running a model has a different responsibility. I see it as the project_a is the API and serves the project_b. So if the project_a owner runs a dbt run -s dim_customers, the idea would be that all data for that model is run, as they don't want to know which version the project_b is using, they just want to know that whatever version they are using, it is up to date.

Then it should be on downstream models to refresh/run previous versions (somehow implied from their use of a given version).

But this hints that it is project_b's responsibility to run the models they are using, even the ones outside of their project? If that is the case, I see your point!

jtcohen6 Jan 27, 2023
Maintainer Author

@lostmygithubaccount There are merits to SemVer, with its more granular control for model producers (major, minor, patch). Insofar as downstream consumers would expect to lock to a major version (v1.*) anyway, I think that's actually the right level of granularity to expose outside of the project. That's why my vote is for "API"-style (major-only) versioning. The producing team can, of course, maintain their own more-granular versioning schemes, especially as it relates to staged project-level deployments (or rollbacks).

When a model producer communicates upcoming breaking changes to a downstream consumer, the consumer can switch from simple ref (latest) to a pin on the last compatible version. That buys them time to migrate (until deprecation_date), without blocking the upstream team from shipping the changes they need.

Then it should be on downstream models to refresh/run previous versions (somehow implied from their use of a given version).

This holds true for models imported as source code (i.e. packages), where consumers should be able to pin to any specific (semantic) version of that source code, and keep rerunning the model with that specific definition. But I'm with @chamini2 that this proposal is a different thing! Team B does not have the responsibility (or even the ability) to run project_a.dim_customers — they are being served that model by Team A (the maintainers of project_a). If push really came to shove, and Team A is deprecating a model that Team B badly needs, B could ask A for the source code to "vendor" that model themselves—assuming, of course, that its upstream dependencies are also stable, contracted, public.

jtcohen6 · 2023-02-06T11:50:44Z

jtcohen6
Feb 6, 2023
Maintainer Author

From conversation with @aranke:

(Still, we may choose to add just enough logic that it's possible to use other versioning schemes if desired.)

In all our documentation, guides, best practices, everywhere, we can treat version as an integer (major version only). But we may want to support other versioning schemes within our implementation, or at least leave space for adding them in the future. Adding a new versioning scheme should require only:

Logic to validate a given input (string/integer) for this scheme
Logic to sort versions, and determine which is the "latest"

1 reply

aranke Feb 8, 2023
Maintainer

I think it's fine to support three versioning strategies to start:

Semantic Versioning
Calendar Versioning
You're On Your Own, Kid Versioning – Unicode-aware lexical sort

Fleid · 2023-02-08T19:48:18Z

Fleid
Feb 8, 2023

Let's say I'm a SQL developer building reports on top of a warehouse managed by dbt. Or I'm maintaining a Power BI model on top of it. I'm living in a scenario where I'm at the frontier of the dbt worklow. So I'm neither in another dbt project, nor a dbt semantic model.

I have a fleet of reports, which is basically a collection of SQL queries, that I need to test against the new table.

The intent here is to "[...] communicate and facilitate an intended breaking change, while allowing downstream consumers a migration window."

Of the overall dbt-managed schema I'm consuming from, one model has a breaking change.
I imagine a process where the new version is first introduced as version 3, to offer me the opportunity to update my queries:

models:
  - name: dim_customers
    version: 3
    columns: ...
   
  - name: dim_customers
    version: 2
    is_latest: true
    columns: ...

I can test my reports by changing my queries to point to dim_customers_3. To be honest I don't like that. I'd rather have a whole schema/database I can point the whole fleet of reports to, and test them without having to write any SQL.

So I run my tests the old way, but I detect 5 reports will need to be updated.
Then when I'm actually updating my SQL, I definitely appreciate being able to target both versions from the same environment.

Now let's say I realize I won't have the time to update 2 reports that were failing. The business haven't signed on the changes, whatever. I can update these reports to specifically target the old version when the switch is made (from dim_customers_2):

models:
  - name: dim_customers
    version: 3
    columns: ...
   
  - name: dim_customers
    version: 2
    deprecation_date: '2023-07-01'
    columns: ...

Ok I retract the objections I made during our last conversation. I like that ;)

0 replies

ilmari-aalto · 2023-02-15T17:34:57Z

ilmari-aalto
Feb 15, 2023

Good stuff!

Would versioning allow even the latest version to be deprecated? We sometimes have models that really should have a sunset date defined already at the creation time. Think things like 2022 targets, a model running for the duration of a campaign etc. which might be useless the following year or quarter, and we'd know it when adding them. Or perhaps we never want to commit to anything for ever, and want to have a default deprecation date two years from start.

Other times we'll want to completely deprecate an old model as unnecessary, or in favor of some other model. (Also linking the successor model could be neat...)

Perhaps there's currently a way to communicate model deprecation in dbt that I'm not aware of...

1 reply

aranke Feb 15, 2023
Maintainer

Agreed, this should be part of the spec as well.

MichelleArk · 2023-02-21T20:46:55Z

MichelleArk
Feb 21, 2023
Maintainer

Throwing in a very plain-jane, unmagical, but explicit option for 'Defining versions' here:

Option D: version property in schema.yml

Multiple versions of a model can have any file or directory name. They’re just models! Version metadata is only configurable in a schema.yml file (or potentially a new version macro, similar to config)

-- models/.../customers/dim_customers.sql
-- models/.../customers/dim_customers_any_name.sql
-- maybe??
{{ version(tag = 'v2', latest=True, ...) }}
-- models/.../anything/dim_customers_any_name_again.sql

-- schema.yml

- models: 
  - dim_customers:
    version:
      number: v1
      deprecation_date: eventually
  - dim_customers_any_name:
    version:
      model: dim_customers
      number: v2
      latest: true
  - dim_customers_any_name_again:
    version:
      number: v3
      model: dim_customers

1 reply

Fleid Feb 24, 2023

This is a good get-out-of-jail card in case the model name option lands us in a bad spot.

I would like to start there, basing the version out of the model name (option B), because it abides to the core rule that a model will be materialized under its file name. But it may be limited in ways we can't anticipate, so having this alternative is re-assuring.

jasnonaz · 2023-03-03T15:16:45Z

jasnonaz
Mar 3, 2023
Collaborator

the following is authored collectively by the DX team at dbt Labs
Background: this week, the DX team partook in an exercise to attempt to deeply consider the various versioning strategies as laid out in this GitHub Discussion. Led by @doug Beatty the goal of this session was for us to take the time to map out how these various strategies would have impacted us as data practitioners as well as our best understanding of how they would be received by the overall dbt Developer Community. This exercise consisted of four components:

Individually reading and writing our baseline opinions on the proposed architectures in order to come to our own conclusions before discussing with a broader group
A hands on prototyping activity based on this repository
A structured discussion about the ergonomics of doing hands-on prototyping and how that shaped our ultimate thoughts on the design decisions to be made
Iterative writeup of our findings to ensure, as best as possible to accurately reflect the overall sentiment of the team

Our findings follow.

Recommendations for model version syntax

Ok here we are. The main event - the big kahuna. There was surprisingly strong consensus here - I think we expected to all end up in different places after pre-registering our thoughts but the strong response was Option D being the most dbtonic method because configuration was done in what feels the most sensible location and it was done in a way that is explicit while leaving space open for a sprinkle of magic if you follow a common convention. Every member placed Option D in the top spot or second choice. There was also a strong second choice - Option B, which provides strong benefits of convenience in naming and strong alignment with existing AE workflows, and provides a clearly-lighted path to developer ergonomics while retaining an escape hatch. Ultimately our proposal combines the best of these two strategies, but if we had to select from the proposed options A through D then D is our selection.

Tweaks to the proposed YAML syntax

Even if it’s legal YAML, it feels very strange to have multiple nodes with the same name and different versions. It also means that theoretically you could have version definitions spilling across multiple YAML files (or even spread all over a single very long properties.yml file), which increases the risk of one version’s properties languishing forgotten somewhere and adds to the cognitive load of keeping track of everything.

Instead, we propose bringing the additional versioning metadata into the one primary model, which looks something like this:

- models:
  - name: dim_customers
    # the variant outside of the versions block is always `latest`. 
    # If you have a prerelease version that's not ready for prime time yet, it still goes into the versions block below.
    # `current_version` / `latest_version` / `primary_version` depending on which one we pick as referring to this
    current_version: 2 
    config:
      materialized: table
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: int
      - name: ...
    versions:
      - version: 1
        implemented_in: dim_customers_OLD
        deprecation_date: '2022-01-01'
        config:
          meta: {} #remove the meta from v2
        columns:
          include: ...
          exclude: ...
      - version: 3
        description: A dramatic reimagining of our customers model with no column overlap? What are they thinking?!
        implemented_in: dim_customers_NEW_BUT_NOT_READY
        # deprecation_date is optional and not included for this version
        columns:
          - name: totally_new_column_1
          - name: totally_new_column_2
          - name: totally_new_column_3
          ...

Making it easier to roll out versions in a convention-driven way

As Florian notes, the new implemented_in property act as a get out of jail card if you need to break convention for some reason, but it could be optional if you name your models consistently (a la Option B). In the example above dbt would find dim_customers_v1, dim_customers and dim_customers_v3 and map them correctly (if you define implemented_in but also have a different dim_customers_v3.sql file, dbt needs to error and reject the ambiguity).

One important difference from Option B as originally described: the original Option B would have required the latest/primary version be called dim_customers_v2.sql and would have split out the _v2 to get the “real” model name. By contrast, this approach starts with the model name and appends _v{n} to get the file name to find its implementation, and only does this when the version isn’t the current one.

The current version of a model does not have an implemented_in property - that’s what aliases are for. This also simplifies the git and code review flow - when you bump a version, you are comparing the old version of the file to the new version of the same file instead of doing a diff between dim_customers_v1.sql and dim_customers_v2.sql

Materializing versions in the warehouse (aka how does this impact data consumers?)

It is incredibly important that when considering the impact of versioned models we don’t just consider the ergonomics on the creation and upkeep of versioned models but also of querying them in the warehouse. As an analyst, the idea of having to remember whether to query analytcs.dim_orders_v3 or analytics.dim_orders_v4 could add a massive amount of cognitive overload to my job and make things like ad-hoc data tasks and in depth analyses incredibly painful - it’s important we make this work for them.

Just like refing a model without a pinned version should get me the latest version, I should be able to query the latest version of my model in the warehouse with just the unversioned model name. I should also be able to query a pinned version of the latest model though. To do this, we suggest that dbt create an additional view on top of the current latest version of a model:

analytics.dim_customers_v3 (prerelease, table)
analytics.dim_customers_v2 (latest version, table)
analytics.dim_customers_v1 (deprecated, table)
analytics.dim_customers (view pointing to latest version, i.e. v2)

This does break away from the 1 model = 1 object principle, but in the opposite direction to what people normally ask of us!

If we’re not comfortable making that change, then we could instead do:

analytics.dim_customers_v3 (table)
analytics.dim_customers (table representing v2)
analytics.dim_customers_v1 (table)

To be clear: this is not great! If you know that v3 is going to break your dashboard when it’s released then you can’t proactively pin to v2, you have to wait for it to be demoted and then do a speedy swap.

How to handle building versions when dbt is run? What gets run?

We propose that by default dbt runs all models whose deprecation date has not yet passed (side note: date math is hard, we recommend that run_started_at should be the arbiter of truth here). This means that deprecated models are excluded from future dbt runs but no other action happens automatically (like dropping the underlying tables/views). We lean towards saying that they should be disabled (i.e the same as enabled: false would do) which would break downstream dependents in order to prevent silent failures leading to incorrect data showing up for end users.

There was discussion about a desire to automatically delete models after their deprecation date along with potential data preservation measures such as a final snapshot to a separate archive schema before deletion, but ultimately that is over-prescriptive for the level of this construct. Instead we’d propose future tooling that allows handling of deprecated models as well as the DX team working with the Community to create best practices and guides for how to manage deprecated models.

We also discussed the difference between building in development vs production environments; in development it’s probably desirable for speed reasons to only build the current version. This shouldn’t be built into dbt itself per se, but could be achieved via a default YAML selector. This means that dbt just needs to expose version data for selection purposes.

Terminology for configurations in different lifecycle stages

We are hoping for an intuitive way to describe the status(es) of each versioned model. Here's terminology that largely aligns with language used widely in software:

end_of_life - any version whose deprecation_date has passed is "end of life" (EOL). We don’t have consensus on this as a name yet but it is important this concept gets described.
runnable - all versions not yet at end of life. We don’t have consensus on this as a name yet but it is important this concept gets described.
- deprecated - any version lower than latest that has a deprecation_date not yet passed
- latest - a single version designated to be the "latest stable" one. While the term latest is used in other contexts, we have concerns that latest may cause confusion (a “who’s on first situation” as Joel says) and lean weakly towards other candidates such as current or primary which avoid the potential confusion of v2 being latest despite the existence of v3 (prerelease).
- prerelease - any version higher than latest

Alternatively, we could have a version_status field to delineate the various statuses that a model can have, such as:

Current - equivalent to latest
Next - the model that is on deck to replace the Current model
Legacy - not Current but accessible to be run
Deprecated - no longer run

One important potential failure mode of too-liberal versioning is that every subteam ultimately relies on a different version of data, and we regress to a place where different teams have different numbers because they’re using outdated tables but can’t/won’t migrate.

2 replies

Fleid Mar 3, 2023

This is fantastic. Thanks so much for the thoroughness!
I'm still ingesting all that info, but I have a couple of questions for you already ;)

Tweaks to the proposed YAML syntax

Does that mean that all versions of a model can only be described in a single properties file?
I'm thinking yes, but I'd rather be certain :)

Materializing versions in the warehouse (aka how does this impact data consumers?)

Have you thought about making the view the versioned object? So we still have the ease of use but we don't break expectations of what that model is?

analytics.dim_customers_v3 (prerelease, table)
analytics.dim_customers_v2 (latest version, ~~table~~ view pointing to actual model)
analytics.dim_customers_v1 (deprecated, table)
analytics.dim_customers (table ~~view pointing to latest version, i.e. v2~~)

jasnonaz Mar 7, 2023
Collaborator

Memorializing from the DX meeting with Florian - we're on board with this proposal

MichelleArk · 2023-03-22T00:30:03Z

MichelleArk
Mar 22, 2023
Maintainer

Really digging the idea of nesting the versioning metadata within one primary model. @jtcohen6 and I added a few modifications to the proposed spec from the DX group:

Renaming to implemented_in to defined_in
- We like the idea of following a default convention: model_name_v1.sql. If the naming convention is followed, specifying defined_in is optional. implemented_in sounds more like a pointer to a physical object (”the data is IN the table”), instead of a logical one (”version 3 of model dim_customers is defined in this file”).
Defining all versions within the versions list, including the latest version.
- Rationale: Easier to see all the versions in one place. It’s clearer that the top-level properties are the common ones shared across all versions, and optionally reimplemented below. Adding a new version always looks like adding something to the versions block; removing a deprecated version removes it from the versions block. Changing the latest version doesn’t require rearranging several lines of code and inverting the inheritance order.
- The latest_version property would still be top-level to make it simpler to validate that there is just one latest_version. This also requires less skimming from the reader to determine which version is the latest.
We’re not planning to do any magic around, creating a view that sits on top of the latest version / abstracts over the versions. While we might someday break the tight coupling between one model <> one database object, we’re not ready to take the plunge just yet. Instead, if users want to keep one of these objects (e.g. the latest) in a specific long-lived database location, they can do that with alias!

We took this double-tweaked spec for a spin and put together this ‘flipbook’ to go through a versioned models evolution lifecycle.

dim_customers starts out as a familiar, un-versioned model:

models:
  - name: dim_customers
    config:
      materialized: table
      contract: true
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: int
      - name: country_name
        description: Where this customer lives
        data_type: string

# Manifest representation
# model.project_name.dim_customers
#   * path: dim_customers.sql
#   * alias: dim_customers

Then, a breaking change comes along and it’s time to create a versioned model - preserving the existing model as version 1 and creating a new + latest version 2:

models:
  - name: dim_customers
    latest_version: 2
    config:
      materialized: table
      contract: true
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: float # this is the breaking change!
      - name: country_name
        description: Where this customer lives
        data_type: string
    versions:
      - name: 2
      - name: 1
        config:
          alias: dim_customers # keep old relation location
          deprecation_date: '2022-01-01'
        columns:
          - include: *
            exclude: customer_id # should this be optional if we're redefining customer_id below? think so
          - name: customer_id
            description: This is the primary key
            data_type: int

# Manifest representation
# model.project_name.dim_customers.v1
# * name: dim_customers
# * version: 1
# * path: dim_customers_v1.sql (defined_in is not provided, defaulting to conventional naming, implies user has renamed .sql file)
# * alias: dim_customers

# model.project_name.dim_customers.v2
# * name: dim_customers
# * version: 2
# * path: dim_customers_v2.sql
# * alias: dim_customers_v2

Just for fun, another breaking change comes along (deleting the country_name column!) and we’d like to maintain 3 versions of the model. This time, v3 will be implemented by a file that doesn’t meet the naming convention, so we specify it in defined_in.

models:
  - name: dim_customers
    latest_version: 3
    config:
      contract: true
      materialized: table
      meta:
        key: value
    columns:
      - name: customer_id
        description: This is the primary key
        data_type: float
      - name: country_name
        description: Where this customer lives
        data_type: string
    versions:
      - name: 2
      - name: 1
        config:
          alias: dim_customers
          deprecation_date: '2022-01-01'
        columns:
          - include: * 
            exclude: customer_id
          - name: customer_id
            data_type: int
      - name: 3
        defined_in: my_new_customers_definition # matches SQL or python file name
        columns:
          - include: *
            exclude: 
              - country_name # this is the breaking change between versions 2 and 3

# Manifest representation
# model.project_name.dim_customers.v1
# * name: dim_customers
# * version: 1
# * path: dim_customers_v1.sql
# * alias: dim_customers

# model.project_name.dim_customers.v2
# * name: dim_customers
# * version: 2
# * path: dim_customers_v2.sql
# * alias: dim_customers_v2

# model.project_name.dim_customers.v3
# * name: dim_customers
# * version: 3
# * path: my_new_customers_definition.sql
# * alias: dim_customers_v3

12 replies

joellabes Mar 30, 2023
Collaborator

Why not make v implicit?

With my JSON Schema maintainer hat on, this will make me very sad - there's no straightforward way to tell JSON Schema about implicit keys without losing a bunch of autocomplete assistance, unless we hardcode the most common versions in.

I can get on board with v on its own! In context I think it's pretty comprehensible

Fleid Apr 4, 2023

So you can end up with downstream models consuming a latest_version that is different from the "main" one in database (as in using the model name)?

latest_version: 2
versions:
- 2
- 1:
    config:
      alias: dim_customers
      ...
- 3

And if you don't alias dim_customers, then dim_customers is the one defined as the latest_version?
If that's the case, please let's not use that example in the documentation, it's really confusing :D

jtcohen6 Apr 4, 2023
Maintainer Author

if you don't alias dim_customers, then dim_customers is the one defined as the latest_version?

If you don't define a custom alias for any of the versioned models, then dbt will not produce anything in the database by the name <target_schema>.dim_customers. Every versioned model will, by default, contain its version identifier as its suffix: dim_customers_v{v}.

chamini2 Apr 5, 2023

Why not make v implicit?

If it went for something like this, I think just going for a dict instead of list would actually make more sense:

versions:
  2:
  1:
    config:
      alias: dim_customers
      ...
  3:

becomes

{
  "versions": {
    "1": {
      "config": {
        "alias": "dim_customers"
      }
    },
    "2": null,
    "3": null
  }
}

This also speaks about the uniqueness for the version name. But this is not the way it is done for models, which share the same properties, they are

models:
  - name: model_1
    description: my first model
  - name: model_2
    description: my second model

instead of

models:
  model_1:
    description: my first model
  model_2:
    description: my second model

ktiwary-latitude Jul 20, 2023

@jtcohen6
Isn't it true that the latest version is referenced directly as dim_customers without the need for an additional table alias?
i.e. if the latest version is 2 for example, a consumer DBT model can reference it directly as:

select *
    from {{ ref('dim_customers') }}

where dim_customers reference will point to dim_customers_v2 table?
And now if we also have a reference alias to an older version called dim_customers, how will that be resolved?

jtcohen6 · 2023-04-22T13:39:21Z

jtcohen6
Apr 22, 2023
Maintainer Author

I've heard a lot of valuable feedback on model versions during the RC period, during yesterday's internal training, and in the awesome comments @joellabes left on my docs PR:

Revisions & additions to Model Versions docs.getdbt.com#3232.

If there are ways for us to clarify intent, or to make this feature more ergonomic, by putting more of this into dbt-core — let's do it. I'm including a handful of proposed changes below. Everyone should feel free to disagree, and to weigh in with more!

As the producer of a versioned model, I want to privilege the model's latest version as the source of truth. The other versions are there to facilitate migrations; they are not equal players. So, as a producer:

By default, I want the latest version's alias to be un-suffixed: my_model, rather than my_model_v2.
- Note: This would be a behavior change from v1.5.0RC. I need to decide if we should change this before the final release planned for April 27.
By default, I want the ability to define the latest version in an un-suffixed file: my_model.sql, instead of my_model_v2.sql. (I think we could support either.) This way, I get a nice git diff view when I'm bumping the version, because the latest logic is always the one in the un-suffixed file.

As the consumer of a versioned model:

I want to be notified when I have an unpinned reference to a model that has published as newer (prerelease) version in production. I should be made aware that, when the model's maintainer bumps the latest_version, my unpinned reference will switch over as well — and I should know what my options are in the meantime.
(Future: [CT-2461] Support deprecating models #7433) I want to be warned when I'm referencing a model that's slated for deprecation. If it's a versioned model, with a newer version available, the warning should say so. This added bit of communication, from producers to consumers, is an advantage of using dbt's built-in functionality around model versions to facilitate migrations.

2 replies

jtcohen6 Apr 23, 2023
Maintainer Author

I'm less convinced about (1), the proposed change to generate_alias_name, following a conversation with @dbeatty10:

Doug:

Let's say we have these table names in the database:

dim_customers_v1 (legacy/old version on path to deprecation)

dim_customers (latest_version)

dim_customers_v3 (pre-release)

One consumer on the BI team has already switched over to using version 3 that has table name dim_customers_v3.
A different consumer needs/wants to stay on version 2 after the switch over because they can't currently upgrade to v3, and they will be on an extended vacation when v3 is planned to be the latest.
What will happen to each of these two consumers when version 3 becomes the latest?

Jeremy:

Doug, that's such a compelling point. The magic view (always pointing to latest) would preserve optionality here—the same degree of optionality that dbt developers would have by pinning their ref, or leaving it unpinned. That might have just pushed me over the edge - 51% - in favor of the magic view (or table clone), and preserving the consistent alias. If that's the move, is it something we could add later on? Or would that change in behavior be disorienting? (still processing)

Doug:

If there was some way that users could manually add some kind of clever view, then that might be acceptable as well...
... but only if they can "set it and forget it!" I haven't been able to come up with an elegant way that I, Data Producer Doug, can set up something like either of those and have it Just Work™️

Jeremy:

Ways we might be able to implement this:

magically add another node to the manifest during parsing

update our materializations to create multiple objects if this model is a "latest" version

post-hook / on-run-end hook that creates the view(s) for you, but would require explicit opt-in - [example]

Option (3) feels the least gross to me - and also the least magical - not a bad thing!

Note that this extra view wouldn't appear in dbt metadata/documentation. I think that's ok - it's really just a mail-forwarding address, a redirect to The Real Thing. (Caveat: If we were to implement schema management, this would be an "orphaned" relation, since it can't be mapped back to a currently-enabled node. We could add logic for that...)

If that's where we land, this wouldn't be a change to the existing behavior - but it would be prime content for the docs &/or the Dougie Fresh v1.5 Dev Blog. And if it lands well, and we feel high confidence that everyone should do it - we could make that macro available out of the box, or find a less-gross way to bake it into dbt-core

jtcohen6 Apr 24, 2023
Maintainer Author

Let's keep discussing:

[CT-2468] [Feature] For versioned models, automatically create view/clone of latest version in unsuffixed database location #7442

Model versions #6736

jtcohen6 Jan 25, 2023 Maintainer

The big idea

[Aside] Versioned models vs. versioned deployments

Premises: recommended patterns

Versioning scheme

"Bump version" workflow

Model vs. project level

Goals

What can be versioned?

Defining versions

Option A: In-file config

Option B: File name

Option C: FQN (derived from file path)

Materializing versioned models

Referencing versioned models

Configuring versioned models

Selecting versioned models

Lifecycle

Detecting need for version bumps

Deprecating old versions

Visualizing versioned models

Postscripts

Let's get rid of version: 2 in yaml files

Optimizing models with multiple versions

Replies: 8 comments · 22 replies

jtcohen6 Jan 27, 2023 Maintainer Author

jtcohen6 Feb 6, 2023 Maintainer Author

aranke Feb 8, 2023 Maintainer

aranke Feb 15, 2023 Maintainer

MichelleArk Feb 21, 2023 Maintainer

Option D: version property in schema.yml

jasnonaz Mar 3, 2023 Collaborator

Recommendations for model version syntax

Tweaks to the proposed YAML syntax

Making it easier to roll out versions in a convention-driven way

Materializing versions in the warehouse (aka how does this impact data consumers?)

How to handle building versions when dbt is run? What gets run?

Terminology for configurations in different lifecycle stages

Tweaks to the proposed YAML syntax

Materializing versions in the warehouse (aka how does this impact data consumers?)

jasnonaz Mar 7, 2023 Collaborator

MichelleArk Mar 22, 2023 Maintainer

joellabes Mar 30, 2023 Collaborator

jtcohen6 Apr 4, 2023 Maintainer Author

jtcohen6 Apr 22, 2023 Maintainer Author

jtcohen6 Apr 23, 2023 Maintainer Author

jtcohen6 Apr 24, 2023 Maintainer Author

jtcohen6
Jan 25, 2023
Maintainer

Let's get rid of `version: 2` in yaml files

Replies: 8 comments 22 replies

jtcohen6 Jan 27, 2023
Maintainer Author

jtcohen6
Feb 6, 2023
Maintainer Author

aranke Feb 8, 2023
Maintainer

aranke Feb 15, 2023
Maintainer

MichelleArk
Feb 21, 2023
Maintainer

jasnonaz
Mar 3, 2023
Collaborator

jasnonaz Mar 7, 2023
Collaborator

MichelleArk
Mar 22, 2023
Maintainer

joellabes Mar 30, 2023
Collaborator

jtcohen6 Apr 4, 2023
Maintainer Author

jtcohen6
Apr 22, 2023
Maintainer Author

jtcohen6 Apr 23, 2023
Maintainer Author

jtcohen6 Apr 24, 2023
Maintainer Author