Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbt extensions / dbt jinja extensions #3830

Closed
jpeak-intellify opened this issue Aug 30, 2021 · 4 comments
Closed

dbt extensions / dbt jinja extensions #3830

jpeak-intellify opened this issue Aug 30, 2021 · 4 comments
Labels
discussion enhancement New feature or request

Comments

@jpeak-intellify
Copy link

Describe the feature

As a dbt user I should be able to install extensions through pip such that I can:

  • pip install dbt-ext-myextension OR
  • pip install dbt-jinja-ext-myextension

And the dbt cli autoloads these extensions similar to pytest allowing control of the event hooks and the Jinja environment for custom behaviour.

Describe alternatives you've considered

I am trying to automate overriding the generate_database_name and generate_schema_name macros but even if I install a dbt-package with my custom macro logic I still need to manually shim these macros in each repo.

Additional context

N/A

Who will this benefit?

Power users that have an architecture or naming conventions dictated by business users that has outrun what dbt Out-Of-Box (OOB) can achieve.

Ideally allowing the Jinja environment to be extended would be an end goal here. It looks like the do command is already being used from jinja2.ext.do. https://github.com/dbt-labs/dbt/blob/develop/core/dbt/clients/jinja.py#L489

My ideal UX/DX would be:

  • to develop a plugin with naming convention dbt-jinja-ext-myextension.
  • dbt then uses the import_metalib.distributions() functionality like pluggy does for autodiscovery.
  • From that list of discovered plugins we have the array of strings to pass to the Jinja Environment.
  • From there it is standard Jinja Extensions

This leverages prior art for plugin design.
It is minimally invasive (leaving plugins for dbt event hooks as a separate scope of work).
It leverages the Jinja Extension community as the scope of what can be done.
Authoring a python extension better leverages python tooling to maintain that code than Jinja macros in a dbt package.

Are you interested in contributing this feature?

Happy to help contribute this feature. Wanted to discuss scope of work before starting work though to fit in the roadmap.

For reference I was a maintainer of vcrpy API testing library for 6 months in 2019.

@jpeak-intellify jpeak-intellify added enhancement New feature or request triage labels Aug 30, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 2, 2021

@jpeak-intellify Thanks for opening! It's a really solid thought, and I hope we can kick off some cool discussion.

Big idea

As we continue to solidify the foundations of dbt-core, I'm also interested in making it much more pluggable. We already have some thinking in this vein, ranging from the quite broad to the pretty specific:

So, why not the Jinja context, too?

I think there are two different ways to plug into / extend the syntax and functionality available in the dbt-Jinja templating context:

Are you thinking more about turning on extensions that are installed along with jinja2, or installing custom extensions? If the extensions are all built-ins, I'm thinking that the right "on/off" switch could even be an environment variable (!), rather than another python program/plugin. If you're talking about custom user-defined extensions, though, the need for a python plugin makes sense.

My concerns:

  • dbt-core is using jinja2==2.11.3; the latest version available is 3.0.1. I can't guarantee that we're going to eagerly upgrade to new Jinja versions. We expose a lot of Jinja syntax to users, and we'd risk breaking a lot of user-space code.
  • If we're serious about statically analyzing Jinja for better parse-time performance, something we've started on (https://github.com/dbt-labs/tree-sitter-jinja2, https://github.com/dbt-labs/dbt-extractor), supporting arbitrarily many syntaxx extensions makes this harder (or perhaps impossible). That's not the end of the world, for now, if the only trade-off for custom syntax is slower parsing; in the long run, though, I'm hopeful that this static analysis will enable much more than just speedier parsing and extraction.
  • For security reasons, I don't think dbt Cloud or other hosted deployment platforms would be able to support python modules such as os or requests, or custom Jinja extensions to perform system/networking functionality. That's not a reason not to make changes in dbt-core, but it's definitely something I'm mindful of—as we think about the products we want to build into the future, our aim is to keep dbt-core support (and more broadly, dbt language support) as consistent and deployment-agnostic as possible.

Specific use case

I am trying to automate overriding the generate_database_name and generate_schema_name macros but even if I install a dbt-package with my custom macro logic I still need to manually shim these macros in each repo.

This is a more specific to dbt functionality—I totally hear you, and I think there's a pretty neat way we can enable this!

There's a detailed proposal (#3456) to specify dbt as the macro_namespace of all globally defined macros. This would allow you to do things like set this one-time config in each project:

# dbt_project.yml
dispatch:
  - macro_namespace: dbt
    search_order: ['current_root_project', 'my_package_with_custom_macro_logic', 'dbt']

Whenever dbt goes to resolve a "global" macro, it would look for candidate implementations (prefixed adapter__ or default__) first in current_root_project, then in my_package_with_custom_macro_logic, then (finally) in the dbt repo.

Your issue prompted me to actually give this a go, which I did over in #3851. I think it might... just... work :)

Now, to work for a specific global macro, that macro needs to be dispatched, and the generate_x_name macros aren't today. (Historically, we've tended to dispatch macros that require different syntax across database adapter plugins.) If we dispatch those, with the explicit macro_namespace='dbt', and your package reimplements default__generate_schema_name, and you specify that config in your root project file, it should just work—it did when I tested locally just now.

Having said all that, I'm actually quite curious to hear about how you were planning to accomplish the above with Jinja extensions.

@jtcohen6 jtcohen6 added discussion and removed triage labels Sep 2, 2021
@jpeak-intellify
Copy link
Author

Thanks @jtcohen6 for the comprehensive context catchup.

Big Ideas

It seems like overall we are on the same page about a larger scope of work about the extensibility of dbt-core which is why I referenced pytest as prior art. dbt (in my mind) sits in a similar space that codifies the backbone of a common process.

dbt seems to have solved a problem for 80-90% of people, but you've given the enterprises that have been datawarehousing for decades a teaser of a better future. A future where the gap between the sophistication of the tool and the complexity of the data modelling are not matched and in honesty that gap is a bit too enterprise specific.

Really awesome reading through the list of points of extension (that look like to hooks I was trying to find) that allow enterprises to do their too specific quirks.

Specific use case

My specific problem is:

How do I codify our custom naming conventions for databases and schemas? a reusable dbt-package
How do I reliably get our macros preferred over the default global macros? 🤷

My poor assumption

I think I naively went down the route thinking I am constrained to the Jinja2 environment and blindly suggested trying to gain access to modify the Jinja2 environment.

After looking at some Jinja extensions I am not confident it would achieve what I am after, and if it did, I'm not confident the person following me would be able to maintain that extension. So that was a bad idea in hindsight...

Custom macro namespace precedence:

The following dispatch & search_order issue and PRs seem more in line with the specific problem I am trying to solve for.

This would allow us to refactor our database and schema naming conventions into a dbt package and evolve them independently. Our consumer dbt projects would know that our flavour of macros have higher precedence.

Minor concern:

My only concern is that it is using adapter.dispatch(...), and it is now dispatching for more reasons than just adaptors. That can be a separate scope of work for that refactor-rename though.

Ideal outcome:

My ideal outcome is that we have our own BigCorp flavoured dbt package dbt-bigcorp-macros that we reference internally like:

packages.yml

packages:
   - git: "[email protected]:bigcorp/dbt-bigcorp-macros.git"
     revision: vX.Y.Z

Then it sounds like we would just create a dispatch namespace for them:

bigcorp-macros.sql

{% macro bigcorp__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

 {% macro bigcorp__generate_database_name(custom_database_name=none, node=none) -%}
    ...
{% endmacro %}

Then if I understand correctly, opting into custom ordering would add the following to dbt_project.yml:

dbt_project.yml

dispatch:
  - macro_namespace: dbt
   search_order: ['my_root_project', 'bigcorp', 'dbt']

This would solve a whole class of problems for us, getting predictable behaviour when adopting our own flavour of conventions that deviate from the default dbt training modules.

Next steps?

  • I think we can safely abandon this issue in favour of Allow dispatch & overrides of global macros via packages #3456
  • I can pip install https://github.com/dbt-labs/dbt/archive/refs/heads/feat/allow-dispatch-global-macros.zip the draft branch and POC our variant and provide feedback on the above issue and PR to move this forward.

How does that sound?

@jtcohen6
Copy link
Contributor

jtcohen6 commented Sep 3, 2021

@jpeak-intellify Thanks for the detailed follow-up, and for engaging with the multiplicity of thoughts I threw your way.

I think you've understood the dispatch strategy quite well; I just want to clarify a few points. You're right that the primary historical purpose of adapter.dispatch has been to handle cross-database functionality. The advent of dispatch in v0.18 (replacing adapter_macro) added a critical capability: dispatching across packages (macro namespaces). We initially intended this to solve a highly specific problem: extending dbt_utils macros to arbitrarily many database adapters, without requiring a PR into the main package for each one. But the capability has really expanded and matured since then: You define a "routing" macro that users call. Then, statically at parse time (!), on the basis of your project-level dispatch config and your current database adapter, dbt will figure out which macro they actually want to use.

So the syntax of dispatch retains both parts of the equation:

  1. The project-level dispatch config that defines search_order, allowing you to rank installed packages above other installed packages, or above the global project
  2. Adapter-specific viability, in the form of the macro prefix and ranked preference order (e.g. redshift__postgres__default__)

The workflow we've been describing primarily depends on the first piece, but you'll still need to prefix your macros with an adapter signifier (almost certainly default__). So it's exactly the code you shared above, except:

-- dbt_bigcorp_macros/macros.sql

{% macro default__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

 {% macro default__generate_database_name(custom_database_name=none, node=none) -%}
    ...
{% endmacro %}

The bigcorp__ prefix would make sense if you were using a custom dbt-bigcorp database adapter. At the point where it's a preference for the candidate macros defined in the dbt-bigcorp-macros package, that's the role of the project config's search_order.

I get that the default__ prefix may feel unnecessary in this case. But imagine if dbt-bigcorp-macros wanted to be usable by projects running on Apache Spark or Trino or UnknownAdapter, and do different things on each. Then you could define:

-- dbt_bigcorp_macros/macros.sql

{% macro spark__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

{% macro trino__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

{% macro default__generate_schema_name(custom_schema_name, node) -%}
    ...
{% endmacro %}

Following your search_order, dbt would always prefer a macro in your package. It would pick dbt_bigcorp_macros.spark__generate_schema_name on Spark, dbt_bigcorp_macros.trino__generate_schema_name on Trino, and dbt_bigcorp_macros.default__generate_schema_name when running anywhere else.

So, if that all still sounds okay to you, I'm going to:

@jpeak-intellify
Copy link
Author

Thanks for the clarifications about the use of dispatched namespaces.

That makes sense that the macros are resolved statically at parse time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants