-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rectify "modular pipelines" terminology #2723
Comments
From #1147 |
We had extensive discussions about how to refer to pipelines and did some user research. I've looked for the notes but because it was a couple of years ago and I think they were on the internal GitHub repo, I cannot find them. @yetudada and @idanov may have them, or @merelcht but I think we should revisit the discussion given that you've found the usage misleading as it currently stands. |
I'm happy to have a look at those notes, but regardless, I think this terminology is unnecessarily complicated as it stands today. It gives the impression that there are 3 kinds of pipelines:
When in fact, there's only one ("pipelines", which under the hood in Kedro are built with the Maybe let's chat about this next week. |
I would suggest to review the modular pipeline as a whole.
The example also use a new pipeline which use a cooking analogy, which is nice but the problem is this pipeline does not exist anywhere. This is an advance and one of the more complicate feature, playing with the pipeline and seeing it in kedro-viz helps a lot to understand the feature.
|
I agree with @noklam here in that we should review the modular pipeline as a whole. For smaller pipelines and projects (where there are less pipelines in general), there is no actual issue other than the confusing terminology. But for projects with lots of pipelines (and pipelines with lots of nodes), I think there is room for improvement of the concept of a kedro pipeline itself. In my view, there are 3 points of view to take into account when designing for a solution:
Anyway, these are just my thoughts on the topic. |
Thanks @MatthiasRoels for the writeup! About (1), indeed @noklam has some thoughts about this, the granularity issue when deploying Kedro projects is something we want to look into (we have another issue about it but I don't remember which one is it), for (2) I've seen how Kedro Viz looks like for huge projects and indeed needs more work, and (3) what do you mean by sub-pipelines without namespaces? |
Cool, I am curious about @noklam's thoughts on this!
This is not what I meant, what I wanted to say was that the concept of namespaces might be complex for some users when you just want make a subset of nodes re-usable as a whole. But I might be wrong on this too! |
For the record (because I keep losing this link): issue in the private repository that collected research around terminology https://github.com/quantumblacklabs/private-kedro/issues/806 |
I need to get better at Github notification, I only saw this in an email today😅
I guess this is what you mean by using sub-pipelines without namespaces? |
IMO, we need to clarify what should be done from The 1-1 node mappings is a topic that comes up repeatedly, and at this point I think we can agree it is bad for most of the case. The logical first step is 1 pipeline = 1 node, of course it varies a lot for deployment and it also depends on how you structure your pipeline and how granular it is. The serialisation/deserialisation cost goes up with number of nodes. Reducing the number of nodes should be the first thing to do. Some takes the approach of serialising the intermediate data to s3(or equivalent) for cross-node communication. https://pypi.org/project/vineyard-kedro/ takes this to next level and optimise it for K8s. The challenge here for Kedro is, in a single Kedro run, the KedroSession orchestrate the whole run but in deployment it is running separately. So this orchestration step need to happen before they are sent to the Orchestrator. Essentially, when you collapse a pipeline as a node, you want everything become in-memory and only persist the data that are necessary for communication with other orchestrator nodes. |
That’s exactly what I meant! |
Absolutely agree! But on the kedro side, some prep work can definitely be done that can be re-used in different plugins
Assuming you talk about orchestrator nodes, that’s exactly what you want to do. IMO, an object store (S3, GCS, MinIo,…) should work fine for the majority of use-cases!
That’s not necessarily true. You need to persist at least all datasets required in other orchestration nodes. But that doesn’t mean you don’t need to persist other datasets! I would imagine some sort of |
I always want to specify data to persist(or memory) at runtime without touching catalog, that's for interactive workflow.
True, I focus on the minimal data that are required, of course in practice you want to customise. This is consistent with default to 1 pipeline = 1 orchestrator node, where you may want to further collapse pipeline or you may need more granularity. So this should be the default if no config is given. |
A bit of a braindump here, but if I think of an easy example where I have a kedro project consisting of 2 pipelines
Actually, there is a fourth option but that should result in a "compile" error: the scenario where So kedro core (not a plugin) needs to figure out the correct order of execution as well as the exact kedro cmd required to run pipeline I see two potential starting points:
|
This conversation branched off quite a bit, I'll try to center the main question again: Can somebody explain me like I'm 5 years old what makes a "modular pipeline" different from a "pipeline"? |
And more:
So, if I'm correct, "a pipeline" and "a modular pipeline", depending on context, might be two entirely different categories of things: the former a Python class, the latter a directory structure. Furthermore: a modular pipeline contains a pipeline ( And this is where this terminology, in my opinion, falls apart: a "modular pipeline" is not a Not that I have better ideas now (and also I don't want to boil the ocean), but I wanted to at least give my interpretation. |
A bit more insight on modular pipelines https://github.com/quantumblacklabs/private-kedro/issues/752#issuecomment-736680109 (private link) (@idanov if you consent, you could copy-paste that comment here) |
I'm removing the documentation label from this as we have a docs task (#1998) to cover improvement of docs about modular pipelines. This ticket (to my mind) cover the philosophy of how we talk about modular pipelines and the language we want to use in communicating to users. It needs to happen ahead of the docs and then, when all is agreed, the docs can be overhauled. So #1998 is dependent on this (a "child" if you like) but this isn't a docs ticket. |
After we merge #3948, I think the only things left are doing one last pass on the Kedro Frameworks docs and reviewing the Kedro-Viz ones. As far as I understand (after 1 year of chewing on this issue), Kedro-Viz mostly cares about 2 things:
Since Kedro-Viz doesn't really have a user guide, there is not much to review. The word "modular" appears exactly once in the docs: The codebase is another thing though. kedro-org/kedro-viz#1941 refers to "modular pipelines" and so do all the Python classes, but it's actually talking about namespaces. I reckon that doing a Search & Replace might have big, unintended consequences (cc @rashidakanchwala) so it's probably not worth the effort, but at least user-facing documentation should make the concepts crystal clear. |
And change our tutorial too: https://docs.kedro.org/en/stable/tutorial/add_another_pipeline.html#modular-pipelines |
So, long story short:
|
Moving this back to our Inbox so we can re-prioritise. |
Opened an child issue exlusively about deciding a better name #4016 |
Description
We're making various distinctions in our documentation about "Pipelines" and "Modular pipelines", for example in the TOC:
And in our wording:
To the point that I believed namespaces were the same as modular pipelines.
However, it turns out that Pipelines and Modular Pipelines are mostly the same thing, and that
kedro.pipeline.modular_pipeline.pipeline
is not a wrapper overkedro.pipeline.pipeline
: they're the same function.kedro/kedro/pipeline/__init__.py
Line 5 in 160fd6b
This is also related from this comment that I didn't fully understand back then: #2402 (comment)
Context
It's a key concept for reusability that many users use.
Possible Implementation
pipeline_registry.register_pipelines
) and some of which aren't.kedro.pipeline.modular_pipeline.pipeline
and just usekedro.pipeline.pipeline
everywhere. (xref Simplify api hierarchy #712)Possible Alternatives
There are maybe less disruptive paths but I can't think of alternative ways of rectifying the current terminology.
The text was updated successfully, but these errors were encountered: