Column level lineage #3226

sou-joshi · 2021-04-05T17:38:06Z

Describe the feature

Currently have table/view level lineage captured, can this be extended to field/column based?

Additional context

Not specific to a database, applies to the product.

Who will this benefit?

There could be multiple renames to a field and with multiple joins the complexity to track field level changes and source of a particular column becomes difficult. From an Ops perspective it always good to know where a field is coming from to quickly solve data issues.

jtcohen6 · 2021-04-08T18:01:02Z

@dataexpertz-blr Thanks for opening, I'm surprised there wasn't already an issue for this :) It's something we're hearing and talking about a lot these days.

Mechanisms

I view column-level lineage as existing in two orders of complexity:

Extending existing constructs. Today, dbt developers have to duplicate a lot of resource properties (descriptions, tags, meta, tests) across models, even when model Y is just select * from model X. Over in Doc (and potentially, Test) Inheritance #2995, and dbt doc blocks #1158 before then, we've been discussing ways that YAML anchors make a version of this possible today, and how it could be better in the future (cross-file anchors, souped-up docs blocks, hierarchical properties-as-config).
Massive new capabilities. To capture column-level lineage for real for real, we'd need a validating SQL grammar—same as would, incidentally, for a decent linter / auto-formatter (Automatic formatter for SQL #2356). In a world where we had this, and built it into dbt, we'd also have an AST representation of every column name, from relation, and SQL function. This is definitely on our minds; it's still a ways away.

Use cases

As with any compelling feature, column-level lineage feels both immensely valuable and a bit vague. If dbt could produce an EXPLAIN-style plan, of every single SQL function performed to produce a single column, that would be very cool, and also tricky to read and reason about as a human being.

So I do find it useful to think concretely about the kinds of things we'd hope to enable here:

Property (tag/meta/etc) inheritance. E.g. If a source PII/PHI column is the indirect input to a column in a downstream model, being able to mark the latter model as sensitive.
Saving code (+ time). Extending what I described above, if a column has not transformed from model X to model Y—no renames, no aggregations—dbt could natively inherit its properties, such as description + tests.
- Or, better yet, dbt would understand that the column has been tested upstream, has not changed, and so does not need to run the same tests again.
"dbt column advisor." If dbt has a full picture of how a column is produced—which input columns, which transformations—it could flag when there are potentially duplicative columns across models, and help avoid the repeating of business logic.
Slim CI to the max: If you've only changed one column across a few models, rather than running the changed models and all their children, you'd only need to run + test downstream models that use the affected column.

I'm curious to hear what other things come to mind!

bashyroger · 2021-04-26T12:47:19Z

+4 for the use case you've mentioned @jtcohen6 , I'd like to add the following, IMO important use case:

A variant of Slim CI to the max: being able to do an impact analysis at development time when you're changing a source / model's column name. After all, the fact that a column changes does not necessarily mean that the column also used in any dependent models. For that, you really need column level lineage.

Instead of building this yourself, you could also think about integrating DBT with the only data lineage focused SAAS product I have heard of: https://getmanta.com/integrations/

tufanrakshit · 2021-07-22T13:10:06Z

we are planning to use DBT Cloud for our project and this table by table lineage is really a killer feature which we would like to have as This would make data lineage and debugging much much easier

devstein · 2021-09-23T00:22:40Z

Hi 👋 , does anyone know how Datafold + DBT claims to provide this?

jaypeedevlin · 2021-09-23T00:26:22Z

Datafold use their own lineage capabilities to do this — while they do read in your dbt project, the column level lineage is part of their platform (it's an awesome feature though!)

sou-joshi added enhancement New feature or request triage labels Apr 5, 2021

jtcohen6 added discussion and removed triage labels Apr 8, 2021

jsnb-devoted mentioned this issue Nov 3, 2021

[Feature] "Slim CI" but with LIMIT 0 #4201

Closed

1 task

dbt-labs locked and limited conversation to collaborators Dec 8, 2021

jtcohen6 converted this issue into discussion #4458 Dec 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Column level lineage #3226

Column level lineage #3226

sou-joshi commented Apr 5, 2021

jtcohen6 commented Apr 8, 2021

bashyroger commented Apr 26, 2021

tufanrakshit commented Jul 22, 2021

devstein commented Sep 23, 2021

jaypeedevlin commented Sep 23, 2021

This issue was moved to a discussion.

This issue was moved to a discussion.

Column level lineage #3226

Column level lineage #3226

Comments

sou-joshi commented Apr 5, 2021

Describe the feature

Additional context

Who will this benefit?

jtcohen6 commented Apr 8, 2021

Mechanisms

Use cases

bashyroger commented Apr 26, 2021

tufanrakshit commented Jul 22, 2021

devstein commented Sep 23, 2021

jaypeedevlin commented Sep 23, 2021

This issue was moved to a discussion.