Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for sql as a header to create or replace #1967

Merged
merged 10 commits into from
Dec 20, 2019

Conversation

kconvey
Copy link
Contributor

@kconvey kconvey commented Dec 2, 2019

#1879

Any SQL that does not work within create or replace table can be extracted and injected ahead of create or replace table by using a call block like:

{{%- call sql_header(config) -%}}
CREATE TEMPORARY FUNCTION yes_no_to_boolean(answer STRING)
RETURNS BOOLEAN AS (
  CASE
  WHEN LOWER(answer) = 'yes' THEN True
  WHEN LOWER(answer) = 'no' THEN False
  ELSE NULL
  END
);
{{%- endcall -%}}

The resulting SQL will look like:

CREATE TEMPORARY FUNCTION yes_no_to_boolean(answer STRING)
RETURNS BOOLEAN AS (
  CASE
  WHEN LOWER(answer) = 'yes' THEN True
  WHEN LOWER(answer) = 'no' THEN False
  ELSE NULL
  END
);

create or replace table `project.dataset.table`
...

This is most obviously useful for BigQuery's temporary UDFs, which do not have a perfect dbt analogue. dbt's support for persistent UDFs or use of macros can likely stand-in in most cases, but this feature facilitates dbt execution of SQL that was written (or generated) to BQ's SQL standard, rather than Jinja-templated SQL.

@cla-bot cla-bot bot added the cla:yes label Dec 2, 2019
@drewbanin drewbanin self-requested a review December 3, 2019 16:53
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really slick implementation! I like it!

I think we should also make sql_header a "known" config - that will allow the config to be set for whole groups of models from the dbt_project.yml file. I actually don't have a good use-case in mind, but this isn't necessarily BigQuery specific. I think it makes sense to add this as a "clobber field" over here:
https://github.com/fishtown-analytics/dbt/blob/e51c942e91a94936f68f2965963d3b46f1257658/core/dbt/source_config.py#L11-L18

We'd also probably want to add the {{ sql_header if sql_header is not none }} to each of postgres/redshift/snowflake too, but that should be pretty straightforward.

For BigQuery (and really every plugin we implement this for), let's definitely include the query header in both bigquery__create_table_as as well as bigquery__create_view_as -- BQ views support temporary UDFs, right?

Last, I'd call the sql_header macro something more explicit, like set_sql_header. I think that will help clarify the difference between:

{% call sql_header() %}
... 
{% endcall %}

and

{{ config(sql_header=my_temp_udf()) }}

both of which are supported in this implementation.

Really cool PR - thanks for opening it!!

@kconvey
Copy link
Contributor Author

kconvey commented Dec 3, 2019

@drewbanin Think I got to all of your suggestions. Let me know if 'common.sql' is a reasonable place to put the set_sql_header() macro across adapters.

Attempted to add an integration test for bigquery that creates an innocuous table as a header with a create or replace table, that should fail if it is not extracted properly (create or replace table ( create or repalce table() ) vs create or replace table() create or replace table(). I think a similar pattern should work with the other adapters, but don't know enough about how they work / how the integration tests are set up to best implement them.

@kconvey kconvey requested a review from drewbanin December 3, 2019 23:09
@sdepablos
Copy link

Would this work for anything database specific, like BigQuery SQL scripting?

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, this is looking really good! One quick note on the test. Can you also update create_table_as and create_view_as for the Snowflake plugin?

@kconvey
Copy link
Contributor Author

kconvey commented Dec 4, 2019

@drewbanin Can't believe I would forget Snowflake :/

Better test + Snowflake should be added now.

@sdepablos Tried it out and it seems like this works for BigQuery SQL scripting! Originally only had temporary UDFs as a clear use case, but glad there are others.

The following should work, as an example:

{{ call set_sql_header(config) }}
  -- Declare a variable to hold names as an array.
  DECLARE top_names ARRAY<STRING>;
  -- Build an array of the top 100 names from the year 2017.
  SET top_names = (
    SELECT ARRAY_AGG(name ORDER BY number DESC LIMIT 100)
    FROM `bigquery-public-data`.usa_names.usa_1910_current
    WHERE year = 2017
  );
{{ endcall }}

@kconvey kconvey requested a review from drewbanin December 4, 2019 16:34
Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

d'oh @kconvey I think there are a couple more macros to update:

https://github.com/fishtown-analytics/dbt/blob/2dd604b0397880664f2d25c50a1d9906054aa559/core/dbt/include/global_project/macros/adapters/common.sql#L71-L77

and

https://github.com/fishtown-analytics/dbt/blob/2dd604b0397880664f2d25c50a1d9906054aa559/core/dbt/include/global_project/macros/adapters/common.sql#L83-L87

These are the "default" materializations which are used if a plugin does not supply its own materialization. I think that, slightly longer term, we're going to want a better answer here than copy/pasting this snippet around everywhere. For now though, if you don't mind making that change, we'll be able to get this one merged :)

@kconvey kconvey changed the base branch from dev/louisa-may-alcott to dev/0.15.1 December 9, 2019 22:13
@kconvey kconvey requested a review from drewbanin December 9, 2019 22:13
@kconvey kconvey requested a review from drewbanin December 16, 2019 18:03
@drewbanin
Copy link
Contributor

Ok - thanks! I just kicked off the tests, but it does look like there's a small merge conflict introduced by the addition of BigQuery labels. Can you take a quick look at that?

@kconvey
Copy link
Contributor Author

kconvey commented Dec 17, 2019

Pulled the latest changes and labeled the column in the integration test (so hopefully that passes). May need you to kick off tests again.

Thanks!

@drewbanin
Copy link
Contributor

Meep! I'm seeing

Database Error in model sql_header_model (models/sql_header_model.sql)
  Creating logical views with temporary user-defined functions is not supported
  compiled SQL at target/run/test/sql_header_model.sql

in the integration tests. Should we configure that model to be materialized as a table instead of a view? I did't realize this was a limitation of temporary UDFs.

I think you can ignore the postgres error - that seems like a network blip to me

@kconvey
Copy link
Contributor Author

kconvey commented Dec 18, 2019

Should be configured as a table now.

@drewbanin
Copy link
Contributor

Gosh, lots of little tweaks to make around the test suite here. FYI, we plan on making PRs build for community contributed PRs to help tighten this CI loop.

I think we need to bump the number of expected tests from 7 to 8 here. We do this to make sure that all of the specified models in a test actually run, and this PR adds a new model. Check out the code here:

https://github.com/fishtown-analytics/dbt/blob/6be4ee00aaf5e794005835d0144e4e6a3e87834e/test/integration/022_bigquery_test/test_simple_bigquery_view.py#L56-L69

And an error from the CI build here:

______________ TestUnderscoreBigQueryRun.test_bigquery_run_twice _______________
[gw2] linux -- Python 3.8.0 /home/dbt_test_user/project/.tox/integration-bigquery-py38/bin/python

self = <test_simple_bigquery_view.TestUnderscoreBigQueryRun testMethod=test_bigquery_run_twice>

    @use_profile('bigquery')
    def test_bigquery_run_twice(self):
        self.run_dbt(['seed'])
        results = self.run_dbt()
>       self.assertEqual(len(results), 7)
E       AssertionError: 8 != 7

We'll want to change each instance of 7 in the snippet above to an 8.

@kconvey
Copy link
Contributor Author

kconvey commented Dec 19, 2019

Should have remembered that one from the labels PR. Thanks for catching!

@drewbanin
Copy link
Contributor

/azp run

@drewbanin
Copy link
Contributor

@beckjake can you try kicking off the tests here? Azure doesn't seem to want to run this with env vars for me :/

@beckjake
Copy link
Contributor

@drewbanin I tried to kick it off manually - we'll see if it works, I guess.

@beckjake
Copy link
Contributor

The tests are successfully running here: https://dev.azure.com/fishtown-analytics/dbt/_build/results?buildId=1260&view=results
I have no idea how to make them show up in github though 🤷‍♂

Copy link
Contributor

@drewbanin drewbanin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM - thanks for your contribution @kconvey!

We've already cut a release candidate for 0.15.1, but this is pretty tightly scoped and well tested so I feel comfortable merging it. It will ship in 0.15.1 after the holidays :D

@zestyping
Copy link

zestyping commented Dec 1, 2021

For future visitors to this issue: I tried a few ways of defining temporary BigQuery functions for use in model definitions, and eventually arrived at the following setup, which you might also enjoy.

  1. Define the functions in a macro, and inside the macro call set_sql_header like this:
{% macro define_bigquery_functions() %}
{% call set_sql_header(config) %}
create temporary function int(value any type)
returns int64 as (
  cast(cast(value as decimal) as int64)
);
{% endcall %}
{% endmacro %}
  1. Define a shorthand macro that wraps config like this:
{% macro table() %}
{{ define_bigquery_functions() }}
{{ config(materialized='table', **kwargs) }}
{% endmacro %}
  1. Now you can just invoke table() instead of config(materialized='table') at the top of your model files, and you'll have access to your temporary functions!

Advantages of this method:

  • You don't have to specify the schema name in front of every function call
  • You don't have to wait for an extra on-run-start operation to run before every dbt run
  • Actually reduces boilerplate instead of adding more of it
  • You'll never again wonder whether config() takes materialize= or materialized=

Disadvantages of this method:

  • You can't call these functions inside view definitions (BigQuery doesn't support temporary functions in views)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants