Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor flowchart models for modular pipelines change #1937

Conversation

ravi-kumar-pilla
Copy link
Contributor

@ravi-kumar-pilla ravi-kumar-pilla commented Jun 10, 2024

Description

Partially resolves - #1899

Development notes

  • Removes namespace property from GraphNode and retain the property only for TaskNode following Kedro Framework
  • The data type of modular pipelines is changed to Optional[Set(str)]
  • Updated all the create_node class methods like (create_task_node, create_data_node, create_parameters_node etc) to accept modular_pipelines parameter as we are populating modular pipelines before creating the nodes and associating each node with the associated modular pipeline during node creation.
  • Remove internal_inputs/outputs and external_inputs/outputs from ModularPipelineNode. We will only have inputs/outputs for a ModularPipelineNode (i.e., the datasets which are free inputs/outputs as per Kedro)
  • Update tests

QA notes

  • This PR is part of a bigger refactor (Refactor Namespace Pipelines  #1897) of modular pipelines and is created to ease review process.
  • The CI build might fail as this PR is not self-sufficient

Checklist

  • Read the contributing guidelines
  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added new entries to the RELEASE.md file
  • Added tests to cover my changes

@ravi-kumar-pilla ravi-kumar-pilla mentioned this pull request Jun 10, 2024
9 tasks
Signed-off-by: ravi-kumar-pilla <[email protected]>
Signed-off-by: ravi-kumar-pilla <[email protected]>
Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the description you mention

Removes namespace property from GraphNode and retain the property only for TaskNode following Kedro Framework

and I see a lot of methods related to namespaces have been removed. Can you explain why that change was needed and how namespaces are now set and used in Viz?

@ravi-kumar-pilla
Copy link
Contributor Author

In the description you mention

Removes namespace property from GraphNode and retain the property only for TaskNode following Kedro Framework

and I see a lot of methods related to namespaces have been removed. Can you explain why that change was needed and how namespaces are now set and used in Viz?

So earlier we used to have namespaces at the GraphNode level (which includes TaskNode, DataNode and others). After having discussion with Ivan, we realized datasets were never a part of a namespace and to have the same schema as Kedro, we removed the namespace from GraphNode and retained it to TaskNode.

For a TaskNode namespace is set as before via kedro object like -

    @field_validator("namespace")
    @classmethod
    def set_namespace(cls, _, info: ValidationInfo):
        return info.data["kedro_obj"].namespace

For other nodes, we do not need namespace. Lot of methods related to namespace like _expand_namespaces, _get_namespace in GraphNode are not required as they were used to calculate the modular_pipelines each node belongs to . Since we shifted the logic ( as you mentioned here ) before creating the nodes, we do not need them now.

Thank you

@ravi-kumar-pilla ravi-kumar-pilla marked this pull request as ready for review June 24, 2024 15:02
Copy link
Member

@idanov idanov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice :)

# https://kedro.readthedocs.io/en/latest/06_nodes_and_pipelines/03_modular_pipelines.html#how-to-connect-existing-pipelines
internal_inputs: Set[str] = Field(
set(), description="The dataset inputs within the modular pipeline node"
inputs: Set[str] = Field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much clearer now 👌

description="""The dataset outputs connecting the modular
pipeline node with other modular pipelines""",

outputs: Set[str] = Field(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So much clearer now 👌

name=dataset_name,
tags=tags,
layer=layer,
kedro_obj=dataset,
is_free_input=is_free_input,
stats=stats,
modular_pipelines=modular_pipelines,
)

@classmethod
def create_parameters_node(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These factory methods look more and more useless to me. They seem to be simply indirections to the constructors and making the creating of the specific nodes more verbose without any benefit from having them. We should revisit them at some point.

Returns:
An instance of TaskNode.
"""
node_name = node._name or node._func_name
return TaskNode(
id=cls._hash(str(node)),
id=node_id,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By not deriving the node_id automatically, we make these create methods even more useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants