[Draft] Node grouping #4427

ankatiyar · 2025-01-17T15:08:30Z

Description

Draft PR to demo - #4376

Development notes

Add attributes to pipeline to get dependencies and node grouped by namespace. (The property names just for the prototype)
This PR offers a version of the information from the group_by_namespace() method added in kedro-airflow in kedro-org/kedro-plugins#981

Questions to be considered:

Does it make sense to have this API in the Pipeline class?
The plugins would still have to discern whether they're executing a node or a group of nodes i.e would have to pick kedro run --node=<nodename> or kedro run --namespace=<namespace> for each "task". How do we make this easier?

Developer Certificate of Origin

We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a Signed-off-by line in the commit message. See our wiki for guidance.

If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.

Checklist

Read the contributing guidelines
Signed off each commit with a Developer Certificate of Origin (DCO)
Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the RELEASE.md file
Added tests to cover my changes
Checked if this change will affect Kedro-Viz, and if so, communicated that with the Viz team

Signed-off-by: Ankita Katiyar <[email protected]>

DimedS

Thanks for the PR, @ankatiyar! That’s a great algorithm!

I have a few questions and comments:

I think the Pipeline class is an excellent place for this API.
I agree with you that we should not return just the names but also include object types. This would allow the plugin to understand what exactly needs to be executed (e.g., a node or a namespace).
Additionally, I think it would be more beneficial to return a topologically sorted list (similar to how pipeline.nodes currently works) instead of a dictionary. This way, the list can be executed in order.

Specifically, the format could be:
[object_name, object_type, full_list_of_nodes], where:
- object_name: The name of the object, such as a namespace or a node.
- object_type: The type of the object, either a namespace or a node.
- full_list_of_nodes: A list of all nodes included under the object_name. For example, if the object_name is a namespace, it will contain all the nodes within that namespace. If the object_name is a node, this list would just contain the node itself.
- list_of_dependencies (to consider): Perhaps we should include the list of dependencies in the same list, rather than separating it into another method. This way, all the information needed for deployment would be available in a single call.
Example structure:
```
[
    [ns1, namespace, [n1, n2, n3]],  # Namespace containing nodes n1, n2, n3
    [n4, node, [n4]]                 # Single node n4
]
```
Each element in the list would represent one deployment step that plugin should create. The full_list_of_nodes is included for informational purposes and isn’t required for execution.

We could also consider whether it would be beneficial to avoid coding the logic for handling node grouping separately in each plugin. Instead, we could provide a generic API in Kedro, allowing plugins to query Kedro for what should be deployed. This API could accept optional parameters, such as the type of node grouping, and return a list of objects to be deployed. The plugin’s responsibility would then simply be to take this list and handle the pipeline conversion, streamlining the process.

Lastly, a small question: can node.name be equal to namespace.name? If so, how would the algorithm handle this scenario?

DimedS · 2025-01-27T12:07:14Z

kedro/pipeline/pipeline.py

+        Returns:
+            The pipeline nodes dependencies grouped by namespace.
+        """
+        node_dependencies_by_namespace = defaultdict(dict)


minor: maybe should be defaultdict(set) to avoid lines 407-409

Signed-off-by: Ankita Katiyar <[email protected]>

DimedS

Thanks for addressing the comments, @ankatiyar! It looks good to me - let's see how it works with Airflow!

Add deployment related attributes

2ff33b2

Signed-off-by: Ankita Katiyar <[email protected]>

ankatiyar requested a review from DimedS January 17, 2025 15:08

DimedS reviewed Jan 27, 2025

View reviewed changes

ankatiyar added 2 commits January 27, 2025 16:29

Merge branch 'main' into grouping

0d6461a

Update with feedback

ef4d664

Signed-off-by: Ankita Katiyar <[email protected]>

DimedS approved these changes Jan 30, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Draft] Node grouping #4427

[Draft] Node grouping #4427

ankatiyar commented Jan 17, 2025 •

edited

Loading

DimedS left a comment

DimedS Jan 27, 2025

DimedS left a comment

[Draft] Node grouping #4427

Are you sure you want to change the base?

[Draft] Node grouping #4427

Conversation

ankatiyar commented Jan 17, 2025 • edited Loading

Description

Development notes

Developer Certificate of Origin

Checklist

DimedS left a comment

Choose a reason for hiding this comment

DimedS Jan 27, 2025

Choose a reason for hiding this comment

DimedS left a comment

Choose a reason for hiding this comment

ankatiyar commented Jan 17, 2025 •

edited

Loading