Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Propagate docs to the database #1031

Closed
jakebiesinger opened this issue Sep 27, 2018 · 26 comments
Closed

Propagate docs to the database #1031

jakebiesinger opened this issue Sep 27, 2018 · 26 comments
Labels
enhancement New feature or request

Comments

@jakebiesinger
Copy link

Feature

We should propagate docs to BQ table and column descriptions.

Feature description

Cool that dbt generates a web page and docs for your project. Unfortunately, it's a bit removed from where our org is used to looking for documentation. We use BQ extensively and rely on the table and column descriptions provided in the BQ UI.

Who will this benefit?

Anyone using BQ would benefit.

@jakebiesinger
Copy link
Author

Since the BQ description field is just a giant text blob, I'd suggest putting the raw markdown description in there. I'd also love to see the raw (uncompiled) query appended after the description in the same field. Then, folks would have some documentation even if there were no description provided for the table.

Plus, devs tend to document their SQL in the .sql file itself. If we included both, we'd be getting decent documentation in the native BQ UI, with minimal documentation effort.

@drewbanin
Copy link
Contributor

Got it, thanks for the report @jakebiesinger! The big challenge here is that we need to make this work for BigQuery, Redshift, Postgres, and Snowflake. Table and column comments are supported for all of these databases, so I don't view that as a problem, but it's a decent amount of work!

When do you see these descriptions being persisted? Is it when the models are built? Or via some other process, like dbt docs --persist?

@jakebiesinger
Copy link
Author

It'd be awesome if this happened as part of the normal dbt run IMHO, so that documentation (always an afterthought) isn't left off in the normal case.

dbt docs --persist would be nice as a way to update them w/out rerunning the world though.

@drewbanin drewbanin changed the title Propagate docs to BQ table and column descriptions Propagate docs to the database Jan 2, 2019
@drewbanin drewbanin added the enhancement New feature or request label Jan 2, 2019
@darrenhaken
Copy link
Contributor

This feature would be very helpful.
It would also be nice to add additional comments like "made by DBT" with a link to the Git repo.

I'd be more than happy to contribute to this if we can decide on a solution 👍

@drewbanin
Copy link
Contributor

drewbanin commented Jan 2, 2019

Cool idea @darrenhaken. To summarize, there's a matrix of comments that dbt should support to the extent supported by the underlying databases:

schema/dataset view table column
bigquery
snowflake
redshift
postgres

Doing this correctly will require knowledge of and access to each of these databases. For the more SQL compatible ones (snowflake, redshift, postgres), we can issue an alter statement:

alter table schema.table add comment '...';

And for BigQuery, I think we can supply a value for the object description.

I'm not sure how best to surface schema-level descriptions, as these aren't currently supported in dbt's documentation. Let's leave that out for the first cut, but we should consider ways to support them in dbt in the near future.

@darrenhaken you mentioned using labels in dbt slack. Do you think those map to dbt's tags? I haven't used them before, and I'm curious what kind of information you'd be interested in putting into a label.

@darrenhaken
Copy link
Contributor

@drewbanin I think the dbt's tags could be used for labels. I wanted to have labels like:
created_by: dbt,
github_project_url: https://github.com/project-id`,

@darrenhaken
Copy link
Contributor

Re BigQuery - I would propose focusing on object description as a first iteration of expanding out the matrix.

@drewbanin
Copy link
Contributor

drewbanin commented Jan 2, 2019

cool, i agree, let's start with bigquery table/view descriptions. I think those can be embedded into the create or replace table/view statements that dbt uses, right?

In practice, that looks something like:

create or replace table `dbt_dbanin`.`my_table`
options (
  description="my description goes here"
)
as (

  select 1 as id

)

The macro that generates create table|view as ... statements can be found here: https://github.com/fishtown-analytics/dbt/blob/b0a4f5c9814629d48230898a19eadb9c53938109/plugins/bigquery/dbt/include/bigquery/macros/adapters.sql#L30

You can see where dbt actually calls this create_table_as macro here: https://github.com/fishtown-analytics/dbt/blob/b0a4f5c9814629d48230898a19eadb9c53938109/plugins/bigquery/dbt/include/bigquery/macros/materializations/table.sql#L68

We do something similar for views and incremental models in the corresponding materializations as well. This materialization code is kind of a mess and in need of a cleanup, but I think this is where all of the magic will likely need to happen.

@darrenhaken
Copy link
Contributor

@drewbanin the links you shared create 404 pages, is that because you shared your branch and not master?

I'd be keen to start development on this if we agree on an approach, thoughts?
It doesn't sound like a tonne of work.

@drewbanin
Copy link
Contributor

Thanks @darrenhaken - just updated my comment to use permalinks. Database-specific implementation questions aside, I think you're right that we just need to agree upon a syntax for specifying docs persistence.

Let's add a config option to models called persist_docs which accepts a dictionary or None. If the value is None, dbt should not persist docs. If the value is a dictionary, then it may contain the keys:

{
  "relation": true/false,
  "columns": true/false,
  "schema": true/false
}

Let's definitely only implement this for relations initially (i think columns is going to be more involved for sure). I like the idea of using a dict from the outset though, as it will afford us the option to persist docs to other types of object in the future.

I think that for the first cut of this, dbt should only update docs when models are rebuilt. While I do think there's merit to updating documentation without re-running models, that's also a lot more involved.

Lmk if you buy all of this. I imagine you'll need some more info (like how to get documentation for a relation inside of a materialization), but very happy to help you with that in turn once we're all set on the implementation.

@darrenhaken
Copy link
Contributor

darrenhaken commented Jan 8, 2019

@drewbanin trying to get up to speed with all the dbt lingo 👍
What would relation contain in your example?

Could you show a dummy model file containing the relation with a description for the table?
So I can see how it would look for the user.

Or is the idea to use the description from the schema.yml file for the model?

@drewbanin
Copy link
Contributor

hey @darrenhaken - relation is a database term for objects like "tables" and "views". Both of these things are types of "relations", and it's a convenient way to refer to them! In dbt, using table to refer to both "tables" and "views" is confusing IMO.

I'm picturing something like:

-- models/my_model.sql
{{ config(materialized='table', persist_docs={"relation": true}) }}

Because dbt's configs are hierarchical, you can also configure all of your model to use this setting in your dbt_project.yml file, eg:

models:
  my_project:
    persist_docs:
      relation: true

I absolutely think dbt should use the description from a schema.yml file! If relation is true (as in the above example), then dbt should grab the description (if any) specified from schema.yml and use that in the create table as statement to supply a description for a relation

@darrenhaken
Copy link
Contributor

Thanks for all the info! That all makes complete sense and the approach sounds good to me.

Can you point me in the right direction which files in the codebase need to be modified to support it, I assume Python?

@drewbanin
Copy link
Contributor

@darrenhaken just out the links to the code above. I don't know that any python changes will be required here (at least for the first cut).

You can find the description for a model in the materialization context with: {{ model.description }}

You can find the value for a config (like persist_docs) with:

{%- set persist_docs = config.get('persist_docs', {}) -%}
{%- set persist_relation_docs = persist_docs.get('relation', false) -%}

Give the BigQuery links above a once-over and lmk if you have any other questions!

@darrenhaken
Copy link
Contributor

I’ll try and take a look at this week in more detail

@darrenhaken
Copy link
Contributor

I'm picking this up now, sorry for the delay.

@drewbanin could you keep an eye on this issue as I'd love some feedback on it?

@darrenhaken
Copy link
Contributor

@drewbanin what's the correct way to set the persist_docs in a model with the dict like you suggested:

{
  "relation": true/false,
  "columns": true/false,
  "schema": true/false
}

Got any example?

@darrenhaken
Copy link
Contributor

@drewbanin I worked this out, see PR

@ggiill
Copy link

ggiill commented Feb 28, 2019

@drewbanin @darrenhaken @jakebiesinger

Super psyched about this PR: #1285

BigQuery's DDL also supports:

  • field descriptions
  • tags

This means these could be inserted via a macro from the schema.yml as well: https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language

This looks something like:

create or replace table `test-project.test_dataset.test_table`

(
  column_int INT64 NOT NULL OPTIONS(description='This is an INT64 field'),
  column_str STRING NOT NULL OPTIONS(description='This is a STRING field'),
  column_array ARRAY <
    STRUCT <
      column_struct_str STRING NOT NULL OPTIONS(description='This is an ARRAY-STRUCT-STRING field'),
      column_struct_int INT64 NOT NULL OPTIONS(description='This is a ARRAY-STRUCT-INT64 field')
    >
  > OPTIONS(description='This is an ARRAY of STRUCTS field'),
  column_bool BOOL NOT NULL OPTIONS(description='This is a BOOL field')
)

OPTIONS(
  description='Test DDL Schema \n - https://docs.getdbt.com/docs/schemayml-files \n - https://cloud.google.com/bigquery/docs/reference/standard-sql/data-definition-language',
  labels=[('key1', 'value1'),('key2', 'value2')]
)

I think this is on your radar, but we have a very similar use case to @jakebiesinger and having these docs persist through to BQ would be really great.

Thanks,
Gil

@darrenhaken
Copy link
Contributor

Great to hear! Yes, I also wanted the labels/tags and column descriptions. If you have time feel free to contribute beyond this work to add them in! If not I’m going to try and get the time over the next month to do it but help would be ideal.

@drewbanin
Copy link
Contributor

@gturetsky thanks for the writeup! Yeah - this should be totally possible. Should be straightforward-ish to extend #1285 for this use case :)

@drewbanin drewbanin added this to the Wilt Chamberlain milestone Jun 3, 2019
beckjake added a commit that referenced this issue Jun 12, 2019
@drewbanin
Copy link
Contributor

This was (finally) merged in #1532 :D

Thanks for your hard work on the original PR @darrenhaken!

@darrenhaken
Copy link
Contributor

darrenhaken commented Jun 13, 2019 via email

@Jaxing
Copy link

Jaxing commented Feb 2, 2022

Hey, this seems awesome! I'm having some trouble to find any documentation about this though and just writing documentation in a yaml file doesn't seem to work. Anyone have some pointers 🙂

@jtcohen6
Copy link
Contributor

jtcohen6 commented Feb 2, 2022

@Jaxing
Copy link

Jaxing commented Feb 2, 2022

@Jaxing Check out these docs: https://docs.getdbt.com/reference/resource-configs/persist_docs

Thanks for the quick reply! It worked 🙂

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants