Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Application of DRY Principle to Source Freshness Definitions in .yml files in large dbt projects #3397

Closed
codigo-ergo-sum opened this issue May 26, 2021 · 6 comments
Labels
enhancement New feature or request stale Issues that have gone stale

Comments

@codigo-ergo-sum
Copy link

Describe the feature

Provide a better method for unified, nonrepetitive management of source freshness definition (and potentially other source-related settings) in dbt.

dbt currently allows for the hierarchical definition of source freshness directives at the source level which will cascade down to individual tables under the source: https://docs.getdbt.com/reference/resource-properties/freshness

However, this only works when the source in question and the tables defined underneath it are in the same .yml file. The problem with this is that in larger dbt projects we often have tens or hundreds of tables being loaded from an individual source, and we also tend to have a significant number of tests and other info about each table. So, putting all the tests, descriptions, and source freshness directives for all tables in one source in one .yml file tends not to work well because with one giant multi-thousand-line .yml file we get issues around versioning, change detection/merge conflicts, code review (much easier for a reviewer if they can see just by file name which source table definitions were edited), etc.

Source freshness definition inheritance won't work if you define the source-level freshness directives in one separate .yml file with just that source-related info, and then have separate files for each source table. There's no notion of implicit inheritance across .yml files.

This means in practice we end up having to repeat the source freshness definitions in every source .yml file, which leads to a lot of repetition (and likely errors/misconfiguration somewhere along the line) which goes against one of dbt's core principles, which is Don't Repeat Yourself.

Describe alternatives you've considered

  1. One alternative is for dbt to allow implicit inheritance across different .yml files. But this can be tricky to code and also to maintain (e.g. what happens if conflicting source-level directives in different places?)
  2. Maintaining centralized source-level inheritance in dbt_project.yml somehow. But the problem here is that dbt_project.yml can already get pretty big and we don't necessarily want it to balloon even more.
  3. Some other project-level configuration file like dbt_project_source.yml which explicitly handles global source configuration.

Additional context

All databases would be relevant.

Who will this benefit?

Anyone who has a large enough dbt project with enough tables per source to not want to have to use just one .yml file per source and wants to really take advantage of the benefits of many small(er), well-named files for source table management.

Are you interested in contributing this feature?

I can try :).

@codigo-ergo-sum codigo-ergo-sum added enhancement New feature or request triage labels May 26, 2021
@jtcohen6 jtcohen6 removed the triage label May 29, 2021
@jtcohen6
Copy link
Contributor

@codigo-ergo-sum Thanks for an excellent write-up of the problem! I see this as one use case motivating some changes that we've long had in mind for how resources are configured, and properties defined, in dbt projects.

I think you've proposed some solid solutions, which I'll take one by one:

  1. Peer-to-peer property inheritance: I agree that this shouldn't be implicit. If a source is defined in one file with loaded_at_field A, and defined in another file with loaded_at_field B, there's no good way for a third file defining that source to know what its loaded_at_field should be. Instead, I think explicit property inheritance could have a role to play here, as discussed in Doc (and potentially, Test) Inheritance #2995, whether that's via macro or souped-up YAML anchor. In that world, I think it would make a lot of sense to define the top-level source properties all in one "base" file, and then the several files that each define one source table can explicitly extend/inherit its properties via macro or anchor. I don't think we have consensus on the exact implementation details for that extension/inheritance, and I don't think it's something we can prioritize in the short term, but it's a conversation I'm following with a lot of interest.
  2. Top-down property inheritance: I badly want to reconcile node configs and resource properties (as proposed in Set configs in schema.yml files #2401), ahead of dbt v1.0 if possible. Today, you can configure models' database and schema from dbt_project.yml because they're configs, but you can't define sources' database and schema because they're properties. That's needlessly confusing and inconsistent. In a future where that's possible, I do see that as a good answer here. Top-down hierarchical inheritance feels much clearer to define and override.
  3. More files the merrier: You've got a good point that the second solution risks ballooning an already massive file, dbt_project.yml. I think this is a separate problem, warranting a separate solution, and that solution looks a lot like the proposal in Defining vars, folder-level configs outside dbt_project.yml #2955: supporting multiple "project-level" files, which can define vars and configs, with the understanding that dbt will merge them into the same Project superstructure that is exclusively defined by dbt_project.yml today. I think we'd raise aggressive errors if the same var or config is set in multiple places across those files.

@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Nov 26, 2021
@codigo-ergo-sum
Copy link
Author

I think this issue is still relevant in the future, particularly on larger dbt projects. So commenting here to keep it open.

@github-actions github-actions bot removed the stale Issues that have gone stale label Nov 30, 2021
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please remove the stale label or comment on the issue, or it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Jun 11, 2022
@leahwicz leahwicz removed the stale Issues that have gone stale label Jun 13, 2022
@github-actions
Copy link
Contributor

This issue has been marked as Stale because it has been open for 180 days with no activity. If you would like the issue to remain open, please comment on the issue or else it will be closed in 7 days.

@github-actions github-actions bot added the stale Issues that have gone stale label Dec 11, 2022
@github-actions
Copy link
Contributor

Although we are closing this issue as stale, it's not gone forever. Issues can be reopened if there is renewed community interest. Just add a comment to notify the maintainers.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Dec 19, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale Issues that have gone stale
Projects
None yet
Development

No branches or pull requests

3 participants