Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

inline documentation rendered as eg. Markdown #375

Closed
jthandy opened this issue Apr 12, 2017 · 17 comments
Closed

inline documentation rendered as eg. Markdown #375

jthandy opened this issue Apr 12, 2017 · 17 comments
Labels
dbt-docs [dbt feature] documentation site, powered by metadata artifacts

Comments

@jthandy
Copy link
Member

jthandy commented Apr 12, 2017

This is from @pedromachado in slack:


I've been thinking that it would be good to have a documentation feature built into dbt. Perhaps something that could leverage 1) especially formatted comments embedded in the SQL and 2) the model dependency graph to generate some documentation or at least a diagram with all the tables/columns.

@drewbanin
Copy link
Contributor

drewbanin commented Apr 12, 2017

This is absolutely where my head is at. I think this could look something like Javadoc.

That, coupled with some dbt queries to infer the schemas of the models, plus constraint info provided by schema tests, would be incredibly cool.

@drewbanin drewbanin changed the title Documentation inline documentation rendered as eg. Markdown May 10, 2017
@drewbanin
Copy link
Contributor

@dwallace0723 I understand that you've been doing some documentation work within a dbt project -- what kinds of things do you find yourself documenting? We've thought about

  • interactions between models
  • high-level information about specific models
  • information about the SQL within a model (eg. unusual sql, odd business logic, etc)

Is that about congruent with your experiences?

@dwallace0723
Copy link

@drewbanin So far, I've been using documentation mostly for the third bullet point you listed. That being said, I can definitely see myself needing proper documentation for the first two bullets as well.

The way I've been doing this for now is having a comment above a line of SQL that links to the corresponding piece of documentation located in a .md file within the dbt project directory.

In practice, the flow looks like this:
data_doc_md_flow

Not the most seamless experience, but it gets the job done for now.

@jthandy
Copy link
Member Author

jthandy commented May 10, 2017

that is cool

@HarlanH
Copy link

HarlanH commented Aug 2, 2017

I'd be interested in working with someone on a minimal initial version of this. I'd be perfectly happy with the following minimal features:

  • extraction of a block of markdown for each model
  • some sort of column-level syntax for documenting columns
  • ideally, an auto-generated list of models/tables used by the given model

Something like:

{{ usual dbt stuff goes here }}

--@ multiline block markdown goes here
--@ and here...
--@ don't really know if @ is the best character

SELECT 
  cat,  --@col strays are not included
  dog1 AS dog --@col not the same as dog2!
FROM {{ ref('whatever') }}

Stripping these out via regex and dropping them into a single markdown or similar file shouldn't be a lot of work. Dealing with references wouldn't be too bad either. Thoughts? Tons of room for enhancements, but even this would be fantastic.

@drewbanin
Copy link
Contributor

@HarlanH I'm super happy to see some pseudo code for this :)

I like the idea of using --@ for multi-line block comments, but regex parsing the raw SQL is a little scary to me. I've given this some thought, and I think the SQL implementation might have to look something like:

select
  {{ column('cat', comment='strays are not included') }},
  {{ column('dog1', as='dog', comment='not the same as dog2!') }}
from {{ ref('whatever') }}

^ I think that's super gross, and would prefer something that looks a little more like regular SQL.

Another option is to repurpose the schema.yml file that's currently used for testing. I can imagine something like this:

my_model:
  # same as before
  constraints:
    not_null:
      - field_1
      - field_2

  # also provide for block comments, etc
  documentation:
    - field: field_1
      comment: This field is made of grass

    - field: field_2
      comment: This field is made of Athenry 

While this is obviously more readily parseable, i do really like the idea of colocating the docs with the code. There's also the question of where these markdown docs get compiled to, if they're versioned, etc etc. So, still some fundamental and unanswered questions here I think.

I'd love to keep pulling on this thread, so please let us know if you have more thoughts here!

@HarlanH
Copy link

HarlanH commented Aug 3, 2017

I guess I'm OK with convention over code for the column docs, and agree that jinja's pretty ugly for this sort of thing. I also am unsure about how to handle the first of these:

(CASE WHEN campaign_id LIKE 'job-notification-email - weekly %' THEN 'weekly'
		  WHEN campaign_id LIKE 'job-notification-email - daily %' THEN 'daily'
		  WHEN campaign_id LIKE 'MPC%' THEN 'MPC'
		  WHEN campaign_id LIKE '%every_time%' then 'every_time' 
		  ELSE NULL END)::varchar(25) AS campaign_type, -- @col one of daily, weekly, every_time, MPC
rcpt_to::varchar(254) AS user_email, -- @col envelope recipient

I definitely would not want to embed that in a jinja block... On the other hand, I'd expect the output as written to look something like:

* `campaign_type` (complex) -- one of daily, weekly, every_time, MPC
* `user_email` (`varchar(254)`, from `rcpt_to`) -- envelope recipient

At least that'd be perfectly adequate initially. If the line can't be parsed into <from_col>(::<cast>) (AS <to_col>)(,) -- @col <comment>, then do nothing.

I would prioritize in-line docs over a solution like leveraging the schema, even if it reduces functionality -- too much sad experience trying to keep separate files in sync!

On compilation, I was thinking of a dbt doc step that would compile them to a docs target directory, maybe initially to a single markdown file (a ToC would be cool). That could be run as part of dbt run, or not, I dunno.

On versioning, what do you have in mind?

@HarlanH
Copy link

HarlanH commented Aug 3, 2017

I wrote a quick-and-dirty standalone script that seems to do a fair amount of what I want. It's regex-heavy, so a bit fragile, but seems to be a decent start.

https://gist.github.com/HarlanH/277006989774372515c7130e63809315

@drewbanin
Copy link
Contributor

@HarlanH cool! Unsure if you're aware, but dbt generates a Networkx graph file in target/graph.gpickle when dbt compile or dbt run are invoked. That file contains all of the SQL models, as well as info about model materialization and macro/node dependencies. That file isn't currently documented anywhere, but you can read it with nx.read_gpickle(...) and take a look at the contents.

@pedromachados
Copy link

I also would prefer to define the comments in the model file, close to the query it documents. If the issue is parsing sql reliably, can we not define the commens in a block that is easy to isolate from the query, similar to the way configuration code is inserted? I would prefer a syntax that allows me to define start and end of comment block and to put comments in a very readable format. Something like yml could work but something less verbose would be even better.

As far a output, I would love to see an interactive, graphical representation of the network chart that shows the model doc and compiled sql upon you clicking a on a node. It would also be good to have a printable format but I don't know how to make this work with large graphs.

@HarlanH
Copy link

HarlanH commented Aug 7, 2017

Pedro, yeah, the biggest design trade-off seems to me to be whether you want to wrap the SELECT columns in non-SQL syntax, ala Drew's suggestion, or if you want to try to parse the SQL (either buggily with regexes, or with a real parser), or if you want to force the author to repeat herself.

Drew, does the existing dbt implementation actually run a SQL parser at any point?

@drewbanin
Copy link
Contributor

@HarlanH dbt does do an iota of sql parsing, but the plan is to get rid of that soon. I definitely think some sort of jinja function is the way to go here. For me, the core unanswered questions are:

  1. What's a sane interface for defining these comments?
  2. How much should dbt know about documentation?

For 1) I think we'll just need to play around with some different options. It could be jinja functions, or some sort of jinja block.... tbd

Point 2) is a little more open-ended. It's tempting to just write a ton of python code to generate markdown files, but I really like the idea of making dbt mostly unaware of things like documentation. Instead, dbt would provide all of the relevant information, and we could encode the documentation logic into something like a package which could very well be shipped along with dbt.

dbt already has a notion of "analysis" files which are compiled, but not executed. I think in a similar way, these docs would be "compiled" to markdown, but of course not executed. We'd just need to figure out templating, expose the list of models / comments in the context, and figure out how to change the output directory for the compiled markdown.

^ That's kind of where my head is at. Does that seem reasonable?

@HarlanH
Copy link

HarlanH commented Aug 7, 2017

Hm, I don't really like this syntax for SELECT clauses, especially for really complex column definitions:

{{ column('cat', comment='strays are not included') }},
{{ column('dog1', as='dog', comment='not the same as dog2!') }}

Does the contents of a {{ }} block have to be Python? And/or is there some other way to use jinja to extract information out of the SQL?

Could something like this be made to work, or does it violate dbt design assumptions?

{% col  one of daily, weekly, every_time, MPC %}
(CASE WHEN campaign_id LIKE 'job-notification-email - weekly %' THEN 'weekly'
		  WHEN campaign_id LIKE 'job-notification-email - daily %' THEN 'daily'
		  WHEN campaign_id LIKE 'MPC%' THEN 'MPC'
		  WHEN campaign_id LIKE '%every_time%' then 'every_time' 
		  ELSE NULL END)::varchar(25) {% AS %} campaign_type, 
{% endcol %}
{% col envelope recipient %}
rcpt_to::varchar(254) {% AS %} user_email, 
{% endcol %}
{% col user ID %}
id,
{% endcol %}

@clrcrl
Copy link
Contributor

clrcrl commented Nov 16, 2017

+1 to this feature! Ultimately, I want to build a databook (like this!) that is tightly coupled to my code - one thing I've thought about is if I could store these definitions in another table, similar to pg_table_def or information_schema.columns. That way I can pull out definitions pretty easily.

Something that I find interesting about this approach:

{{ column('cat', comment='strays are not included') }},
{{ column('dog1', as='dog', comment='not the same as dog2!') }}

is that you could then potentially open up the door to referencing previous columns when building fields (I miss this feature of LookML and find myself repeating code or having to write lots of cascading CTEs).

e.g. (running with cats and dogs, let's pretend this is a list of people in my office and the pets they own, and that the only pets you can have are cat or dogs)

{{ column('employee_id') }}
{{ column('number_of_cats', comment='strays are not included') }},
{{ column(col('number_of_cats') > 0 , as 'has_cat', comment='TRUE when number_of_cats > 0') }},
{{ column(col('number_of_cats') >= 3 , as 'is_crazy_cat_person', comment='TRUE when number_of_cats > 3') }},
{{ column('number_of_dogs', as='dog', comment='not the same as dog2!') }},
{{ column(col('number_of_dogs') > 0 , as 'has_dog', comment='TRUE when number_of_dogs > 0') }},
{{ column(col('has_cat') or col ('has_dog), as 'has_pet', comment='TRUE when has_cat or has_dog') }}

Agree that then gets a hard to read, and that probably strays too far from regular SQL, and also is not the point of this issue, but just a thought 😸

@norton120
Copy link

norton120 commented Nov 17, 2017

Funny I was thinking something exactly along the same lines as @HarlanH mentioned here parsing sql comments. What felt good about that was that it still serves some value as raw sql:

SELECT 
        number_of_cats, --@ strays not included
       number_of_dogs AS dog_number --@ also includes large hamsters

is still valid sql and even serves as local documentation if nothing else.
right now at the bottom of each analytics model we drop a config block with all the comments as post hooks; it is not perfect but at least they all live in the same file.

{{ config({
     "post-hook":[
          "COMMENT IF EXISTS ON COLUMN {{this}}.number_of_cats IS 'strays not included'",
          "COMMENT IF EXISTS ON COLUMN {{this}}.number_of_dogs IS 'also includes large hamsters'"]
})}}

Could probably clean it up further with a macro, but maybe doing something similar to what Drew said inside the models?

Another thing that seemed interesting would be a config value to "inherit" comments from ephemeral models - ie define the comments as far up the pipe as possible (or as close to the defining transforms as possible).

@HarlanH
Copy link

HarlanH commented Nov 18, 2017

Just a note that I've got a lot of columns now that are defined by macros, so column comments often can't extract a useful name from the pre-macro SQL (although I suppose if the comment-extractor worked on the compiled versions? hm...)

...
{{ my_macro('thing', 'that', 'constructs', 'one or more', 'columns') }}, -- @col what am I supposed to do with this?
...

@drewbanin drewbanin added dbt-docs [dbt feature] documentation site, powered by metadata artifacts and removed auto-generated docs labels Jun 25, 2018
@drewbanin drewbanin mentioned this issue Jun 26, 2018
@drewbanin
Copy link
Contributor

drewbanin commented Sep 4, 2018

Closing this - docs site is going out in 0.11.0 on 9/6 :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dbt-docs [dbt feature] documentation site, powered by metadata artifacts
Projects
None yet
Development

No branches or pull requests

7 participants