Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adds an RFC to implement lineage backend #32

Merged
merged 2 commits into from
Apr 16, 2021
Merged

Adds an RFC to implement lineage backend #32

merged 2 commits into from
Apr 16, 2021

Conversation

verdan
Copy link
Member

@verdan verdan commented Mar 31, 2021

Signed-off-by: verdan [email protected]

@verdan verdan requested a review from a team as a code owner March 31, 2021 09:10
@verdan verdan changed the title Draft: Adds an RFC to implement lineage backend Adds an RFC to implement lineage backend Apr 1, 2021
Copy link
Contributor

@dorianj dorianj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is somewhat under-specified on its own (could specify the exact models etc), however I think given the prior RFC that went out, and existing code, that figuring out the details in implementation should be fine.

No new concepts/definitions are being introduced as a part of this RFC.

Databuilder already has the table lineage model, which creates an upstream/downstream relation to adding to the Neo4j graph.
Column lineage model however still needs to be developed as a part of this RFC.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be reasonable to define this in the RFC. That said, given there's already table lineage, and the metadata response formats are defined, I think the solution space is small enough it's probably OK to figure it out in the implementation

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, because we have started to define neo4j queries based up on pervious RFC and also lineage model available into amundsen-common. though these are not optimized queries and not completed but I was planning to put queries on this line.

    def get_lineage(self, *,
                    id: str,
                    resource_type: ResourceType, direction: str, depth: int) -> Lineage:

        get_both_lineage_query = textwrap.dedent(u"""
        MATCH (down_parent:Table)<-[downstream_len:DOWNSTREAM*..{depth_key}]-(child:Table {{key: $query_key }})-[upstream_len:UPSTREAM*..{depth_key}]->(up_parent:Table)
        WITH
        child.key as child_key
        ,collect(distinct{{level:LENGTH(upstream_len),source:'hive',key:up_parent.key}}) AS upstream_entities
        ,collect(distinct{{level:LENGTH(downstream_len),source:'hive',key:down_parent.key}}) AS downstream_entities
        RETURN
        collect({{
        key:child_key,direction:"both",depth:1
        ,upstream_entities:upstream_entities
        ,downstream_entities:downstream_entities
        }}) AS lineageOutput
        """).format(depth_key=depth)

        get_upstream_lineage_query = textwrap.dedent(u"""
        MATCH (child:Table {{key: $query_key }})-[upstream_len:UPSTREAM*..{depth_key}]->(up_parent:Table)
        WITH
        child.key as child_key
        ,collect(distinct{{level:LENGTH(upstream_len),source:'hive',key:up_parent.key}}) AS upstream_entities
        RETURN
        collect({{
        key:child_key,direction:"upstream",depth:1
        ,upstream_entities:upstream_entities
        }}) AS lineageOutput
        """).format(depth_key=depth)

        get_downstream_lineage_query = textwrap.dedent(u"""
        MATCH (down_parent:Table)<-[downstream_len:DOWNSTREAM*..{depth_key}]-(child:Table {{key: $query_key }})
        WITH
        child.key as child_key
        ,collect(distinct{{level:LENGTH(downstream_len),source:'hive',key:down_parent.key}}) AS downstream_entities
        RETURN
        collect({{
        key:child_key,direction:"downstream",depth:1
        ,downstream_entities:downstream_entities
        }}) AS lineageOutput
        """).format(depth_key=depth)

        if direction == 'upstream':
            records = self._execute_cypher_query(statement=get_upstream_lineage_query,
                                                 param_dict={'query_key': id})
        elif direction == 'downstream':
            records = self._execute_cypher_query(statement=get_downstream_lineage_query,
                                                 param_dict={'query_key': id})
        else:
            records = self._execute_cypher_query(statement=get_both_lineage_query,
                                                 param_dict={'query_key': id})

        result = records.single()['lineageOutput'][0]
        return result


ref: https://github.com/amundsen-io/rfcs/blob/master/rfcs/025-lineage-stage-0.md

## Guide-level Explanation (aka Product Details)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you describe whether backend is only for graph or atlas or mysql? which one does it not plan to support?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


No new concepts/definitions are being introduced as a part of this RFC.

Databuilder already has the table lineage model, which creates an upstream/downstream relation to adding to the Neo4j graph.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you plan to change the existing model?
If so, what will be the new model interface?

The current one doesn't take into account for job/application that generate the lineage in between, do you plan to add those?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified

Signed-off-by: Dorian Johnson <[email protected]>
Copy link
Contributor

@dorianj dorianj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@feng-tao Added some clarifications to answer your questions, thanks for the feedback!


ref: https://github.com/amundsen-io/rfcs/blob/master/rfcs/025-lineage-stage-0.md

## Guide-level Explanation (aka Product Details)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added


No new concepts/definitions are being introduced as a part of this RFC.

Databuilder already has the table lineage model, which creates an upstream/downstream relation to adding to the Neo4j graph.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified

@dorianj dorianj merged commit 849181e into master Apr 16, 2021
@dorianj dorianj deleted the lineage-backend branch April 16, 2021 22:51
@verdan verdan added Status: Active and removed Status: Final Comment Period (FCP) On final comment period (seven days) labels Apr 19, 2021
allisonsuarez pushed a commit that referenced this pull request May 5, 2021
* Adds an rfc to implement lineage backend

Signed-off-by: verdan <[email protected]>

* lineage rfc: review feedback

Signed-off-by: Dorian Johnson <[email protected]>

Co-authored-by: Dorian Johnson <[email protected]>
Signed-off-by: Allison Suarez Miranda <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants