-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New node type: reports #2730
Comments
Awesome! I think we can keep this pretty simple to start. Would be good to think about how they participate in the DAG view and what information we render on the Exposure details pages.
I think this can be out-of-scope for dbt Core. Just producing a well-defined node in the manifest that contains edges out to specific nodes is going to be a great starting point. In this conversation, can we also consider what an exposure can state dependencies on? Is it only models? Or are seeds/snapshots/sources also supported? I think you could make a convincing argument for either case!
I think an optionality-preserving move would be to make the
Maybe we just make
If we're going to want to use this Do we love the label |
if you want to support refables and sources (which I think is a good idea!), what about just using
Users should at least some level of comfortable with the behavior of ref/source already. Allowing sources will probably be nice for testing, too (zero database operations required). |
I was thinking just models to start, since I wasn't thinking that we'd lose out on anything big with that constraint. But I think using refable syntax (as @beckjake recommends) is better, so I'm on board with supporting anything that can be
I do! I like
Sounds great to me. It feels powerful to have this in a structured form, even if we don't know the exact intent.
You've got a fair point. I see this label as three things:
That's just what I'm thinking for now, though. I can't delineate between those use cases so strongly as to want different structured entry fields. |
I am not sure about this approach: as IMO there should be a full decoupling between what is created in DBT and how it is consumed in any tool, I see the work of manually defining consumers / exposures /reports via additional metadata as work-without-end. Before you know it, someone created a new report/ lookml model / cognos framework / etc that you first must be altered on that it even exists. And then you need to add the linking metadata in DBT in hindsight... Instead, what I suggest is to leverage the metadata collected by the various end user databases that run end user query's: As an example, me and my team build this for the combination BigQuery and Looker at a previous client. For snowflake, the database I most work with currently, the go-to metadata source for this would be this ACCOUNT_USAGE view: https://docs.snowflake.com/en/sql-reference/account-usage/query_history.html Obviously, taking this route is a lot of work too / you'll need to start small and make an adapter per target database. An alternative would be to make DBT link / integrate with a solution like this: https://getmanta.com/about-the-manta-platform/ .I think something like this could solve the problem too. Not 'for free' I am afraid though... |
@bashyroger I agree with a lot of what you said above. If you'll humor me, I think this is a yes/and rather than an either/or. dbt could, and someday should, pull information from database query logs to get a sense of how models are being used in production. There would be tremendous benefit in identifying unused final models that are good candidates for deprecation, or shaky models that serve as cornerstones to crucial reports yet lack the commensurate testing. This tooling would be immensely valuable to dbt developers, and I completely agree that to such an end there should be a full decoupling between dbt resources and our ability to see how they're queried in the wild. dbt maintainers should see usage information for dashboards they know about and (especially) dashboards they don't. The matrix of tools here is a challenge, but not an insurmountable one. It sounds like you've made some significant progress on this for BigQuery + Looker. Any chance there are code snippets or approaches you'd be interesting in contributing back? :D At the same time, there's a lot of useful information already stored in dbt artifacts today—source freshness, test success—that we can and should put in front of data consumers and other downstream beneficiaries of dbt models. We think that means embedding it in the BI/application/usage layer. This is a subtle but significant distinction IMO:
To my mind, this roughly maps to the distinction, mentioned in Tristan's and Jia's "Are dashboards dead?" conversation, between ad hoc exploration on the one hand + dashboards with service-level guarantees on the other. We may see different, more precise tooling emerge to support each use case, which I really do believe to be quite different. Ultimately, I think dbt developers will want to know about both categories—in-the-wild querying and sanctioned usage—and we'll want to find compelling ways to integrate that information in the long run. |
This was fixed in #2752 |
Hi @jtcohen6, I surely would want to help with what you are asking. First, there is this code I shared a while ago on using the BigQuery audit log to look at costs in a granular way: As we have automated the generation of sources.yml and schema in read model files now, I can imagine that we will use a similar approach by using the snowflake query log (analytical database used at my current client) to generate yml code that is compatible with the new reports node: #2752 |
**Edit: I previously called these
exposures
. I've changed it toreport
to keep it more tangible for this first version.dbt Core needs to have a semantic understanding of a
report
. A "report" is a generalization of a "dashboard", though it could also be an ML model, Jupyter notebook, R Shiny app, in-app data viz, etc.The
report
node serves two purposes:Why?
sources
: there's a big piece missing from the DAG today. Once downstream uses of dbt models are registered asreports
, dbt project maintainers can start asking questions like:Tentative spec
Edit: updated based on comments below
models/anything.yml
Core functionality
By way of
depends_on
, I'd expectdbt run -m +report:orders_dash
to run all upstream models, anddbt test -m +report:orders_dash
to test all upstream nodes.dbt run -m +report:*
would run all models upstream of all reports. An report cannot itself be run or have tests defined.Edit: I changed this syntax to feel more like the
source:
model selector method. Rationale:dbt run -m orders_dash
has no effect; it's worth calling out that this is a special thing.Open questions
report
? I'll write up a related dbt-docs issuedepends_on
frommanifest.json
, compile the list of all upstream nodes, and then search inrun_results.json
andsources.json
for all associated tests and freshness checks. Where should that mechanism live, exactly?owner
: There is added benefit (and constraint) to tying this to a dbt Cloud user ID. Should we try to make a mapping via email only, instead?type
+maturity
: Should we constrain the set of options here? Offer them as free text fields? I like the eventual idea of summarizing things like:Maybe that's in dbt-docs, maybe that's an
ls
command, maybe it's a pipe dream. This is the piece that feels least critical for the first version.@drewbanin to help suss out some answers
Future work (not v1)
depends_on
? (Could reports depend on other reports?)meta
)?owner
to a dbt Cloud user by email? Purpose: configurable notification channels, asset rendering in dbt-docsThe text was updated successfully, but these errors were encountered: