Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Column level lineage #3226

Closed
sou-joshi opened this issue Apr 5, 2021 · 5 comments
Closed

Column level lineage #3226

sou-joshi opened this issue Apr 5, 2021 · 5 comments
Labels
discussion enhancement New feature or request

Comments

@sou-joshi
Copy link

Describe the feature

Currently have table/view level lineage captured, can this be extended to field/column based?

Additional context

Not specific to a database, applies to the product.

Who will this benefit?

There could be multiple renames to a field and with multiple joins the complexity to track field level changes and source of a particular column becomes difficult. From an Ops perspective it always good to know where a field is coming from to quickly solve data issues.

@sou-joshi sou-joshi added enhancement New feature or request triage labels Apr 5, 2021
@jtcohen6 jtcohen6 added discussion and removed triage labels Apr 8, 2021
@jtcohen6
Copy link
Contributor

jtcohen6 commented Apr 8, 2021

@dataexpertz-blr Thanks for opening, I'm surprised there wasn't already an issue for this :) It's something we're hearing and talking about a lot these days.

Mechanisms

I view column-level lineage as existing in two orders of complexity:

  • Extending existing constructs. Today, dbt developers have to duplicate a lot of resource properties (descriptions, tags, meta, tests) across models, even when model Y is just select * from model X. Over in Doc (and potentially, Test) Inheritance #2995, and dbt doc blocks #1158 before then, we've been discussing ways that YAML anchors make a version of this possible today, and how it could be better in the future (cross-file anchors, souped-up docs blocks, hierarchical properties-as-config).
  • Massive new capabilities. To capture column-level lineage for real for real, we'd need a validating SQL grammar—same as would, incidentally, for a decent linter / auto-formatter (Automatic formatter for SQL #2356). In a world where we had this, and built it into dbt, we'd also have an AST representation of every column name, from relation, and SQL function. This is definitely on our minds; it's still a ways away.

Use cases

As with any compelling feature, column-level lineage feels both immensely valuable and a bit vague. If dbt could produce an EXPLAIN-style plan, of every single SQL function performed to produce a single column, that would be very cool, and also tricky to read and reason about as a human being.

So I do find it useful to think concretely about the kinds of things we'd hope to enable here:

  • Property (tag/meta/etc) inheritance. E.g. If a source PII/PHI column is the indirect input to a column in a downstream model, being able to mark the latter model as sensitive.
  • Saving code (+ time). Extending what I described above, if a column has not transformed from model X to model Y—no renames, no aggregations—dbt could natively inherit its properties, such as description + tests.
    • Or, better yet, dbt would understand that the column has been tested upstream, has not changed, and so does not need to run the same tests again.
  • "dbt column advisor." If dbt has a full picture of how a column is produced—which input columns, which transformations—it could flag when there are potentially duplicative columns across models, and help avoid the repeating of business logic.
  • Slim CI to the max: If you've only changed one column across a few models, rather than running the changed models and all their children, you'd only need to run + test downstream models that use the affected column.

I'm curious to hear what other things come to mind!

@bashyroger
Copy link

+4 for the use case you've mentioned @jtcohen6 , I'd like to add the following, IMO important use case:

  • A variant of Slim CI to the max: being able to do an impact analysis at development time when you're changing a source / model's column name. After all, the fact that a column changes does not necessarily mean that the column also used in any dependent models. For that, you really need column level lineage.

Instead of building this yourself, you could also think about integrating DBT with the only data lineage focused SAAS product I have heard of: https://getmanta.com/integrations/

@tufanrakshit
Copy link

we are planning to use DBT Cloud for our project and this table by table lineage is really a killer feature which we would like to have as This would make data lineage and debugging much much easier

@devstein
Copy link

Hi 👋 , does anyone know how Datafold + DBT claims to provide this?

@jaypeedevlin
Copy link
Contributor

Datafold use their own lineage capabilities to do this — while they do read in your dbt project, the column level lineage is part of their platform (it's an awesome feature though!)

@dbt-labs dbt-labs locked and limited conversation to collaborators Dec 8, 2021
@jtcohen6 jtcohen6 converted this issue into discussion #4458 Dec 8, 2021

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
discussion enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

6 participants