Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New node type: reports #2730

Closed
jtcohen6 opened this issue Aug 28, 2020 · 7 comments
Closed

New node type: reports #2730

jtcohen6 opened this issue Aug 28, 2020 · 7 comments
Assignees
Labels
enhancement New feature or request
Milestone

Comments

@jtcohen6
Copy link
Contributor

jtcohen6 commented Aug 28, 2020

**Edit: I previously called these exposures. I've changed it to report to keep it more tangible for this first version.

dbt Core needs to have a semantic understanding of a report. A "report" is a generalization of a "dashboard", though it could also be an ML model, Jupyter notebook, R Shiny app, in-app data viz, etc.

The report node serves two purposes:

  1. Define a set of models as dependencies
  2. Define metadata to populate an embed tile and special dbt-docs "landing page"

Why?

  • We want to be able to push information about data quality and lineage from dbt into external tools. Exposures should be the Core side of that contract.
  • The same reason we added sources: there's a big piece missing from the DAG today. Once downstream uses of dbt models are registered as reports, dbt project maintainers can start asking questions like:
    • Which models are our most critical dependencies?
    • Which final models are only used by exposures with very few/irregular views?
    • Which final models aren't used at all?

Tentative spec

Edit: updated based on comments below

models/anything.yml

version: 2

exposures:
  - name: orders_dash

    type: dashboard # one of: {dashboard, notebook, analysis, ml, application}
    url: https://fishtown.looker.com/dashboards/1
    maturity: high # i.e. importance/usage/QA/SLA/etc. one of: {low, medium, high}

    description: >

      This is a dashboard of monthly orders, with customer
      attributes as context.

    depends_on:
      - ref('fct_orders')
      - ref('dim_customer')
      - source('marketplace', 'currency_conversions')

    owner:
      email: [email protected] # required
      name: "Jeremy from FP&A" # optional

Core functionality

By way of depends_on, I'd expect dbt run -m +report:orders_dash to run all upstream models, and dbt test -m +report:orders_dash to test all upstream nodes. dbt run -m +report:* would run all models upstream of all reports. An report cannot itself be run or have tests defined.

Edit: I changed this syntax to feel more like the source: model selector method. Rationale: dbt run -m orders_dash has no effect; it's worth calling out that this is a special thing.

Open questions

  1. What should be the docs "landing page" for a report? I'll write up a related dbt-docs issue
  2. Crucially, we'll need a mechanism that can parse a report's depends_on from manifest.json, compile the list of all upstream nodes, and then search in run_results.json and sources.json for all associated tests and freshness checks. Where should that mechanism live, exactly?
  3. owner: There is added benefit (and constraint) to tying this to a dbt Cloud user ID. Should we try to make a mapping via email only, instead?
  4. type + maturity: Should we constrain the set of options here? Offer them as free text fields? I like the eventual idea of summarizing things like:
fct_orders is directly exposed in:
3 dashboards, of varying maturity (high: 2, medium: 1)
1 low-maturity ML pipeline
2 medium-maturity apps

fct_orders indirectly powers:
2 medium-maturity dashboards

Maybe that's in dbt-docs, maybe that's an ls command, maybe it's a pipe dream. This is the piece that feels least critical for the first version.

@drewbanin to help suss out some answers

Future work (not v1)

  • Can reports declare non-model nodes in their depends_on? (Could reports depend on other reports?)
  • Can reports modify expectations of upstream tests? I could imagine overriding a test severity, or defining a set of tests to exclude from consideration.
  • Can reports be tagged?
  • Can they accept arbitrary additional fields (i.e. meta)?
  • Can we closely tie owner to a dbt Cloud user by email? Purpose: configurable notification channels, asset rendering in dbt-docs
@jtcohen6 jtcohen6 added the enhancement New feature or request label Aug 28, 2020
@jtcohen6 jtcohen6 modified the milestones: Marian Anderson, 0.18.1 Aug 28, 2020
@drewbanin
Copy link
Contributor

What should be the docs "landing page" for an exposure? I'll write up a related dbt-docs issue

Awesome! I think we can keep this pretty simple to start. Would be good to think about how they participate in the DAG view and what information we render on the Exposure details pages.

Crucially, we'll need a mechanism that can parse an exposure's depends_on from manifest.json, compile the list of all upstream nodes, and then search in run_results.json and sources.json for all associated tests and freshness checks

I think this can be out-of-scope for dbt Core. Just producing a well-defined node in the manifest that contains edges out to specific nodes is going to be a great starting point. In this conversation, can we also consider what an exposure can state dependencies on? Is it only models? Or are seeds/snapshots/sources also supported? I think you could make a convincing argument for either case!

owner: There is added benefit (and constraint) to tying this to a dbt Cloud user ID. Should we try to make a mapping via email only, instead?

I think an optionality-preserving move would be to make the owner config a dict. Maybe something like:

owner:
  email: [email protected]

owner:
  name: Alice Dashmacher
  email: [email protected]

owner:
  name: Alice Dashmacher
  email: [email protected]
  slack: @aliced

Maybe we just make email required and leave room for other fields to be added in the future? You buy it?

type + maturity: Should we constrain the set of options here? Offer them as free text fields?

If we're going to want to use this type info in structured ways, it will probably benefit us to restrict these to an enumerated set of values up-front. That said, I don't feel like we can close over the set of possible exposures that people are creating out there in the wild! My instinct is that we should start with a small set of type values ({dashboard, notebook, analysis, ml, application}) and plan to add more entries or make it free-text in the future if desired!

Do we love the label maturity? It's the name that seems most sensible to me, but I am unsure if encapsulates the entirety of importance/usage/SLAs/etc that we want it to convey. Maybe my fear would be overloading one label and trying to make it do too many things. If you've got other ideas, i'd love to discuss them, otherwise I think we should just pick sensible and roll with it!

@beckjake
Copy link
Contributor

if you want to support refables and sources (which I think is a good idea!), what about just using ref and source?

    depends_on:
      - ref('fct_orders')
      - ref('dim_customer')
      - source('raw', 'countrycodes')

Users should at least some level of comfortable with the behavior of ref/source already.

Allowing sources will probably be nice for testing, too (zero database operations required).

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Aug 28, 2020

In this conversation, can we also consider what an exposure can state dependencies on? Is it only models? Or are seeds/snapshots/sources also supported? I think you could make a convincing argument for either case!

I was thinking just models to start, since I wasn't thinking that we'd lose out on anything big with that constraint. But I think using refable syntax (as @beckjake recommends) is better, so I'm on board with supporting anything that can be ref'd or source'd.

I think an optionality-preserving move would be to make the owner config a dict... You buy it?

I do! I like email (required) and name (optional) to start out.

My instinct is that we should start with a small set of type values ({dashboard, notebook, analysis, ml, application}) and plan to add more entries or make it free-text in the future if desired!

Sounds great to me. It feels powerful to have this in a structured form, even if we don't know the exact intent.

I am unsure if encapsulates the entirety of importance/usage/SLAs/etc that we want it to convey.

You've got a fair point. I see this label as three things:

  • Subjective signal for data quality, whereas objective signal will be source freshness + tests passing. Based on the user-supplied value of maturity, we could display a separate color in the tile / docs, or combine subjective + objective together.
  • Subtext alongside owner: Should a data consumer email the owner right away when they see something that doesn't make sense? Is the owner staking their reputation behind it? It's a more structured version of the owner putting **WARNING: WIP** in the exposure description.
  • Context for dbt project maintainers: If a to-be-deprecated model has downstream exposures that are all low-maturity, they'd feel more comfortable moving forward with the deprecation.

That's just what I'm thinking for now, though. I can't delineate between those use cases so strongly as to want different structured entry fields.

@jtcohen6 jtcohen6 changed the title New node type: exposures New node type: reports Sep 1, 2020
@beckjake beckjake self-assigned this Sep 2, 2020
@beckjake beckjake mentioned this issue Sep 14, 2020
4 tasks
@bashyroger
Copy link

bashyroger commented Sep 16, 2020

I am not sure about this approach: as IMO there should be a full decoupling between what is created in DBT and how it is consumed in any tool, I see the work of manually defining consumers / exposures /reports via additional metadata as work-without-end. Before you know it, someone created a new report/ lookml model / cognos framework / etc that you first must be altered on that it even exists. And then you need to add the linking metadata in DBT in hindsight...

Instead, what I suggest is to leverage the metadata collected by the various end user databases that run end user query's:
Use the database query log to automate the missing link between database objects created by DBT and how they are consumed by end users tools.

As an example, me and my team build this for the combination BigQuery and Looker at a previous client.
Looker queries are easy to detect on the query log as they also inject a piece of JSON code before their queries with various looker internal metadata.

For snowflake, the database I most work with currently, the go-to metadata source for this would be this ACCOUNT_USAGE view: https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html

Obviously, taking this route is a lot of work too / you'll need to start small and make an adapter per target database.

An alternative would be to make DBT link / integrate with a solution like this: https://getmanta.com/about-the-manta-platform/ .I think something like this could solve the problem too. Not 'for free' I am afraid though...

@jtcohen6
Copy link
Contributor Author

jtcohen6 commented Sep 16, 2020

@bashyroger I agree with a lot of what you said above. If you'll humor me, I think this is a yes/and rather than an either/or.

dbt could, and someday should, pull information from database query logs to get a sense of how models are being used in production. There would be tremendous benefit in identifying unused final models that are good candidates for deprecation, or shaky models that serve as cornerstones to crucial reports yet lack the commensurate testing. This tooling would be immensely valuable to dbt developers, and I completely agree that to such an end there should be a full decoupling between dbt resources and our ability to see how they're queried in the wild. dbt maintainers should see usage information for dashboards they know about and (especially) dashboards they don't.

The matrix of tools here is a challenge, but not an insurmountable one. It sounds like you've made some significant progress on this for BigQuery + Looker. Any chance there are code snippets or approaches you'd be interesting in contributing back? :D

At the same time, there's a lot of useful information already stored in dbt artifacts today—source freshness, test success—that we can and should put in front of data consumers and other downstream beneficiaries of dbt models. We think that means embedding it in the BI/application/usage layer. This is a subtle but significant distinction IMO:

  • The target audience is data consumers, not dbt developers. To be honest, this is a much broader audience, and it's a role I have a harder time empathizing with, since I've never had it. We're especially open to feedback here, and will likely be looking for beta testers...
  • The goal here is not to enumerate an exhaustive registry of all downstream usage, or to keep up with every ad hoc report/lookml model/cognos framework/etc; I agree that's an impossible task. Rather, we want to give dbt maintainers a "seal of approval" that they can put on trusted reports. We're making a claim here that the most impactful, established, reliable dashboards/notebooks/apps/analyses should offer their viewers a set of expectations about data quality, a status check on those expectations, a link back to a specialized landing page in dbt-docs, and an action step (contacting the owner) if those expectations are unmet.

To my mind, this roughly maps to the distinction, mentioned in Tristan's and Jia's "Are dashboards dead?" conversation, between ad hoc exploration on the one hand + dashboards with service-level guarantees on the other. We may see different, more precise tooling emerge to support each use case, which I really do believe to be quite different. Ultimately, I think dbt developers will want to know about both categories—in-the-wild querying and sanctioned usage—and we'll want to find compelling ways to integrate that information in the long run.

@beckjake
Copy link
Contributor

This was fixed in #2752

@bashyroger
Copy link

Hi @jtcohen6, I surely would want to help with what you are asking.

First, there is this code I shared a while ago on using the BigQuery audit log to look at costs in a granular way:
https://github.com/RogerDataNL/BigQuerySQL/tree/master/QueryLog
Changing this to look for the link between the end of a current DBT dag (resulting in a view / tabel being deployed) and end end user reports / queries / dashboards being created on them would boil down to 'doing more of the same' work

As we have automated the generation of sources.yml and schema in read model files now, I can imagine that we will use a similar approach by using the snowflake query log (analytical database used at my current client) to generate yml code that is compatible with the new reports node: #2752

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants