-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Draft] Node grouping #4427
base: main
Are you sure you want to change the base?
[Draft] Node grouping #4427
Conversation
Signed-off-by: Ankita Katiyar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR, @ankatiyar! That’s a great algorithm!
I have a few questions and comments:
-
I think the
Pipeline
class is an excellent place for this API. -
I agree with you that we should not return just the names but also include object types. This would allow the plugin to understand what exactly needs to be executed (e.g., a node or a namespace).
-
Additionally, I think it would be more beneficial to return a topologically sorted list (similar to how
pipeline.nodes
currently works) instead of a dictionary. This way, the list can be executed in order.Specifically, the format could be:
[object_name, object_type, full_list_of_nodes]
, where:object_name
: The name of the object, such as a namespace or a node.object_type
: The type of the object, either anamespace
or anode
.full_list_of_nodes
: A list of all nodes included under theobject_name
. For example, if theobject_name
is a namespace, it will contain all the nodes within that namespace. If theobject_name
is a node, this list would just contain the node itself.list_of_dependencies
(to consider): Perhaps we should include the list of dependencies in the same list, rather than separating it into another method. This way, all the information needed for deployment would be available in a single call.
Example structure:
[ [ns1, namespace, [n1, n2, n3]], # Namespace containing nodes n1, n2, n3 [n4, node, [n4]] # Single node n4 ]
Each element in the list would represent one deployment step that plugin should create. The full_list_of_nodes is included for informational purposes and isn’t required for execution.
We could also consider whether it would be beneficial to avoid coding the logic for handling node grouping separately in each plugin. Instead, we could provide a generic API in Kedro, allowing plugins to query Kedro for what should be deployed. This API could accept optional parameters, such as the type of node grouping, and return a list of objects to be deployed. The plugin’s responsibility would then simply be to take this list and handle the pipeline conversion, streamlining the process.
Lastly, a small question: can node.name
be equal to namespace.name
? If so, how would the algorithm handle this scenario?
kedro/pipeline/pipeline.py
Outdated
Returns: | ||
The pipeline nodes dependencies grouped by namespace. | ||
""" | ||
node_dependencies_by_namespace = defaultdict(dict) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
minor: maybe should be defaultdict(set)
to avoid lines 407-409
Signed-off-by: Ankita Katiyar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for addressing the comments, @ankatiyar! It looks good to me - let's see how it works with Airflow!
Description
Draft PR to demo - #4376
Development notes
Add attributes to
pipeline
to get dependencies and node grouped by namespace. (The property names just for the prototype)This PR offers a version of the information from the
group_by_namespace()
method added inkedro-airflow
in kedro-org/kedro-plugins#981Questions to be considered:
Pipeline
class?kedro run --node=<nodename>
orkedro run --namespace=<namespace>
for each "task". How do we make this easier?Developer Certificate of Origin
We need all contributions to comply with the Developer Certificate of Origin (DCO). All commits must be signed off by including a
Signed-off-by
line in the commit message. See our wiki for guidance.If your PR is blocked due to unsigned commits, then you must follow the instructions under "Rebase the branch" on the GitHub Checks page for your PR. This will retroactively add the sign-off to all unsigned commits and allow the DCO check to pass.
Checklist
RELEASE.md
file