-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Users cannot install specific components of Kedro separately #3659
Comments
Thanks for the writeup! Addressing this was also on my list. Would be interesting to be pragmatic and compare this proposal (elegant) to feature-gating functionality as is being done with jupyter. The biggest con in my opinion is the lack of backwards compatibility in |
Thanks for the validation @sbrugman !
On the contrary, my idea is that this is completely backwards compatible:
Does that address your concerns?
Yep that's something to be discussed and evaluated. |
@astrojuanlu Sorry for the misunderstanding. "the biggest con ..." was referring to the alternative solution of using |
Oh sorry I didn't read your sentence:
|
The other point here is that as much as we'd like everyone to upgrade to 0.19.x ASAP, that doesn't work in practice and in many cases people like to keep production deployments static. The situation we find ourselves in today is that lots of people still pull 0.18.x and that now has outdated dependencies which (1) crop up on enterprise scanning tools like Sonarqube (2) We don't retroactively patch. The great unbundling of Kedro will allow us to be much more dynamic and potentially support individual components better and for longer. |
Kedro is a rather small framework by all standards, e.g. running Most issues with dependencies as always have been coming from the datasets and now that is largely solved by splitting them out of Kedro as a separate package. And just to illustrate that, before the move, our Snyk score was in the low 80s, check our Snyk score now: https://snyk.io/advisor/python/kedro (on par with Prefect and MLFlow and better than ZenML). |
I wanted to link this conversation #1758 where we considered turning Kedro into a meta-package when separating out I personally still feel that the meta package structure introduces more burden than it offers benefits to our users. I think the points brought up are very valid, but they all seem to be coming from a very specific user groups: those that have developed Kedro plugins and/or are very advanced users that use Kedro as a library. From my understanding, this is a vocal but also a minority group of our users and I would like to hear other perspectives before betting on splitting Kedro into sub components. I also hesitate to commit to an overhaul like this, because it would be a significant effort that doesn't introduce any new functionality to Kedro. It would just be a restructure of what we have and I question whether that's enough to gain new users and remain relevant as a tool. |
Thanks a lot for this, will have a look.
I'm not so sure about that. There might be some survivorship bias at play here - IOW, people who see Kedro and are discouraged by its approach to dependencies but never complain to us. By its own nature, this would be very difficult to detect. We need a leap of imagination. Note that this is not an all-or-nothing approach. We can start by spinning off Splitting this into 7 packages is a proposed solution, but it's not necessarily the only one. I feel the discussion is too centered on that proposed solution rather than the user problem, which I argue has plenty of supporting evidence. |
@datajoely gave some feedback and I reworked the top comment a bit. Changes:
|
We introduced this topic in yesterday's Tech Design session: I gave some views on the broader context, described the user problem, and collected some feedback. SummarySome ideas were proposed, broadly falling under 3 categories:
Some concerns were raised:
Key concernsKedro is not heavyI presented abundant qualitative evidence in #3659 (comment) that our users perceive Kedro as heavyweight. Some of those comments are old, but have been unaddressed. Some other comments are recent. 📣 We don't get to decide how our users perceive our product 📣. We can only influence their perception.1 I also presented quantitative evidence that makes it crystal clear that Kedro is too heavyweight in terms of dependencies. I purposedly did not discuss about SLOC (Single Lines Of Code) or bytesize because those are uninteresting metrics. Re-stating the quantitative evidence here:
`pipdeptree -p kedro`
Users can ignore what they don't wantI have presented abundant qualitative evidence in #3659 (comment) that this is not how people vet their open source dependencies. There is also quantitative evidence that our monolithic approach to dependency management causes extensive user pains. For example: #1807, #1752, #1733, #681, and large parts of #3094, just to name a few. Kedro is not a monopoly, there are plenty of adjacent open source frameworks whose functionality intersects or directly competes with Kedro. If we force users to make decisions against their will, they might as well use something else. We have already done enoughWe have already done very important things. Spinning off But, ironically, This is difficult to maintainIt's easy to grasp the idea that more repositories or more subprojects in monorepos somewhat increases the complexity. There's two sources of complexity:
My proposal to move forward with this is:
Footnotes
|
For me there are key really critical parts of this, I've ranked them in 'unbundling' effort/impact
ChatGPT generated the following and I (read possible dum dum) think it looks neat: import subprocess
def create_project_from_template(template_url, project_name):
subprocess.run(["pipx", "install", "cookiecutter"]) # Install Cookiecutter if not already installed
subprocess.run(["cookiecutter", template_url, "--output-dir", ".", "--no-input", "-f", "-o", project_name])
# Example usage
create_project_from_template("https://github.com/audreyr/cookiecutter-pypackage.git", "my_project")
|
@astrojuanlu Thanks for the writeup and the tech design session from last week. Let me comment on some of the arguments you brought forward in favour of splitting Kedro into smaller packages. Kedro is (not) heavy
I followed all links and I am still unconvinced by the evidence, a lot of it has subjective opinions on what we should do. What I'd rather see is what actual problem we are causing by not doing the suggested actions.
I would really love to have a compilation of "whys" in order to have a grounded conversations in facts and not opinions.
Yes, and that's a communication challenge mostly, which we still haven't addressed. It will not be solved by splitting or not splitting Kedro, but with proper comms. Including the orchestrator confusion. If we self-assign a label to Kedro heavyweight, imo fully undeserved, then we only add to the problem, not solving it.
It is not crystal clear, to me at least. Why are those uninteresting? This is quite an important metric for determining if something is heavyweight. When saying that "kedro is a huge monolith" and "heavyweight", we need to define many words here, like huge, monolith and heavyweight. Because, I, for one, certainly measure the size of software projects in terms of how many lines of code they are, having worked on projects from a couple of hundred lines to a couple of hundred thousand lines and knowing the difference in complexity of either extremes. By that measure, Kedro is tiny, not huge. Here's a table to illustrate that:
As you can see, Kedro is definitely not an outlier from comparable frameworks, and I'd argue its the most lightweight one (bar The only valid discussion arising from this table (and your research) is if it has too many direct dependencies, and the answer probably is maybe and certainly not a definitive yes. However this issue can very easily be solved without breaking Kedro up into tiny packages, but carefully examining what we really need as mandatory dependency and what can be optional. Here's just my list of dependencies that can either be removed or made optional easily:
That's already 11 removed, which will make us second only to Users can(not) ignore what they don't want
As I've pointed, the evidence you've shown only beats around the bush of the actual problem, not revealing the problem itself. We need a double-click on all of the evidence you have provided and this is what we are lacking, although we can all suppose that the problem people face is indeed the list of direct dependencies, but that'd be an educated guess, not a fact.
I don't see the connection between those issues to dependency management. There's a lot about CLI (
I am unsure what you mean by that. How can you force someone to ignore something? A package or module in Python is unused if not imported. I am pulling a number out of thin air here, but I won't be surprised that most packages have no more than 20% of their API being utilised by any given application. Does that mean that each package forces the users to ignore the other 80% of their API footprint? People have agency over importing modules, we can't automagically force them to import the
Sure it isn't, but this comment is not very constructive. By looking at the table above - comparable tools don't seem to be too bothered by being huge and monolith either...
True, but we are not really forcing them anything. Our users can make a good judgement of the trade-offs of different packages, including Kedro. We can only try to convince them that Kedro is suitable for their use case, but that happens through a good community with lots of examples and education materials. And, obviously, try to improve their experience where we can, but first we need to fully understand the failure modes. We have (not) already done enough
As far as I am aware, you can install whatever you want fairly independently: https://github.com/kedro-org/kedro-plugins/blob/main/kedro-datasets/pyproject.toml
Again, less huge monolith of 7K lines of Python code doesn't sit right to me. I suspect we have different definitions of huge and monolith. A monolith in software engineering is not defined only by its dependencies, so the usage of those words from where I stand is incorrect and causes confusion and unnecessary disagreement, apart from leading us towards potentially unsuitable solutions. Finally, I don't recall anyone claiming we've done enough. Absolutely, we can do more! We should always look out for some small improvements that amount to a bigger change at the end. Such small improvement is removing unneeded dependencies. This is difficult to maintain
Even with the best design and great automation, over the years entropy is increasing, team members rotate and forget, make automated things less automated and so on. Coordinating the releases of 2 things is easier than coordinating the releases of 10 things. Currently the team experiences only a taster of that by releasing And that's not surprising, the probability of something going wrong is Let's agree to include this in the equation when deciding if splitting is the right solution for the problem, when we find an agreement on what the actual problem is. As a conclusion, let's start with the problem and then define a solution and not the other way around. So in that spirit, what is preventing users from installing Kedro as is and use only the |
TIL: MLFlow publishes a separate package |
From a related discussion at FastAPI fastapi/fastapi#11525 (reply in thread)
Top upvoted comment in the thread. |
Sequence of events in FastAPI:
And yet, Kedro is even bigger than the new big FastAPI:
About
@idanov I don't know how to say this in a way that it doesn't sound bad or confrontational, but this was wrong in April 14th, and is still wrong today. Everyone is entitled to their opinion but I'd rather not mix opinions with facts in this way. I feel that the push to conduct a full fledged research stream on why developers don't want fat dependencies is just delaying the inevitable. In any case, #3884 is already exploring options to make |
With a purely 80/20 hat on: Big wins:
Stuff we can easily kill:
Already on chopping block:
|
Description
https://linen-slack.kedro.org/t/16626961/hello-i-m-currently-building-a-python-package-that-derives-s#3e2cc457-1184-4b2d-b7b3-7cab59fa1fa0
Over the years, some Kedro power users and also folks evaluating the project have made it clear that they really like specific parts of Kedro, such as the catalog and the configuration loading, and they don't care much about the rest. See also #2741 (2021), #2898 (comment), https://linen-slack.kedro.org/t/16593946/is-there-a-way-of-installing-only-the-data-catalog-part-of-k#b6d532c4-2d7f-4add-b0ee-b0bfcffbdd5e, #2409 (comment) ("kedro is scary") and many more.
What has been done so far
We have already done a lot to make Kedro leaner and simpler. For example, Kedro 0.19 dropped datasets, which now have to be installed separately as
kedro-datasets
#2126 This has been unanimously celebrated.In addition, we also dropped some CLI commands that were rarely used #1616. There were some users that raised concerns, but so far we haven't heard more concerns.
We also improved our packaging infrastructure to avoid having explicit dependencies on
pip
andsetuptools
#2350 again, this has been received with silent approval.Why is this still a problem?
And yet, this is not enough.
In principle nothing prevents these users from doing
pip install kedro
and using only the parts they need. This was the argument @idanov and myself defended in #2409 (comment)However, this is not how most folks vet open source dependencies. In this absolutely fantastic survey from @/simonw, from Django fame, on Mastodon, people generally favour lean packages over bloated ones, and minimal dependencies over a large number of them. Some quotes:
For users that would like to use only our Data Catalog and/or only our Configuration Loader, they would be forced to:
kedro new
(which is only used once per project!), includingcookiecutter
,gitpython
, andpre-commit-hooks
(which we flagged as a "weird" dependency in 52d5189#r135274526). I made the case of separating it into a new package already in Runkedro new
without creating a new directory #681 (comment)click
andrich
. We are already tracking makingrich
optional in Make rich optional / not a core dependency ofkedro
#2928, but there are no plans to removeclick
1.build
andrope
. We are already considering what to do with micropackaging https://github.com/kedro-org/kedro/milestone/21, but for the moment it will stay there, idling.importlib_resources
,toposort
(which could be going away after we drop 3.8 support thanks tographlib
), possiblypluggy
.Proposed solution
The solution I propose is making
kedro
a meta-package, that would depend on smaller packages.Specifically, we could start with 2:
kedro-catalog
andkedro-new
. Why?The case for splitting
kedro-catalog
From the links above:
#2741
#2898 (comment)
https://linen-slack.kedro.org/t/16593946/is-there-a-way-of-installing-only-the-data-catalog-part-of-k#b6d532c4-2d7f-4add-b0ee-b0bfcffbdd5e
The case for splitting
kedro-new
kedro new
is a weird command for at least 2 reasons:kedro new
in every deployment target, production setting, cloud platform etc for projects that are already created.From the
kedro-boot
docs:Why should an app like this carry all the weight of unneeded commands?
kedro new
without creating a new directory #681 (comment),Future developments
If we are happy with the initial iterations we could take this idea further and make more smaller packages. Something like:
Advantages
Considerations
pip install kedro
would get regular users exactly the same.Implementation idea
We could even retain imports by leveraging PEP 420 implicit namespace packages https://packaging.python.org/en/latest/guides/packaging-namespace-packages/
pip install kedro-framework
(does not carry the CLI at all!)pip install kedro-catalog kedro-config
. They only use the part they like and forget about the rest.kedro
metapackage might as well do more strict pinning (and if folks don't like it, they can install the individual components themselves!)Alternative solutions
One alternative solution is to reject the metapackage approach, tell users to keep installing the full
kedro
, and continue gradually reducing the number of dependencies. The disadvantages are thatpip install kedro
will keep containing parts some users will not want, hence not addressing the core issue.Footnotes
Although quite honestly, Packaged kedro pipeline does not work well on Databricks #1807 and https://github.com/pallets/click/issues/2249 made me question some Click design choices, and https://github.com/kedro-org/kedro-plugins/pull/552 made me think that definitely I'd love to find a modern alternative. ↩
The text was updated successfully, but these errors were encountered: