-
Notifications
You must be signed in to change notification settings - Fork 913
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking dependencies by artifact's creation time like GNU make #528
Comments
Hi @sergun thank you for opening the feature request. Could you please elaborate a bit more on what kind of feature you are looking for? There's a template for a feature request, and it would be great if you could fill them in. |
DescriptionCurrently kedro can only calculate pipilenes from scratch and with some manually specified settings (you can specify nodes from which it can calculate everything or you can ask kedro to calculate only missing datasets by Runner.run_only_missing). ContextIn many use-cases model GNU make works better.. As you know make automaticlly tracks dependencies between artefacts (the same does kedro) but it also automatically tracks artifacts' mtime and understand which artifact should be re-created by taking in mind mtime of output artefact and dependencies. Possible ImplementationNo concreted ideas. But I think that we can integrate mtime in Dataset class and add ability to select nodes to be calculated based on mtime of input/output Datasets of each node. Possible AlternativesDo not see them. |
As for me, it will be really great to add such a functionality! |
I am not 100% sure that I follow what you are looking for, but my personal feeling so far is that this is a feature that can be achieved through the use of hooks. All of the kedro DataSets are currently backed by fsspec, a quick scan of their API revealed that there is a .modified() method that returns a timestamp of the modified path. I am not exactly sure how you would get to this inside of a hook though. You can access the kedro dataset instance dynamically by using Once you have the time what do you need to do with that? I am not familiar with GNU make and how it utilizes mtime to avoid recreating artifacts that are not necessary to make. This would be another great application for #400. If I could set a default update frequency (daily), and override that frequency in my catalog to tell kedro that this dataset is only refreshed (weekly/monthly/cron expression?). Then it could figure out if its time to update or now. |
Thanks @WaylonWalker ! But I do not see how to skip execution of node's processing function from some hook. |
I've quickly hacked together (with emphasis on hack) a prototype of what a hook would look like that enables something like this: https://gist.github.com/mzjp2/076bfd73b0215bda01ee71186966389d |
That is a really cool hook @mzjp2! If we could tag nodes with a run frequency and combine with this it would make things easy to blindly run everything and only update out of date nodes. |
I see now. What if you grabbed the ast or bytecode of the function of each node and cached it. Then you can check if the function itself has changed since last run, or if input data has changed since the last run. |
I found a parallel conversation is happening over on kedro.community. https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90 |
@WaylonWalker thanks for the link! |
@mzjp2 thanks a lot! Great job! |
I think I'd like to get my hands dirty with this one. I'll look into this in the context of Hacktoberfest. I'll make a draft PR where I reference this issue and the discussion in kedro.community. Should be a fun one. |
@pascalwhoop I think it would be one of the most significant cotribution to kedro 🥇 From my Makefile-based for ML experience I can say it is really cool when you do not need to think which nodes should be executed after some change (of params or data, or maybe code). BTW in the Makefile-based colution parameters were also files and they were incuded in make recipes as dependencies.. |
This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions. |
I'm sure we really need this..
It will provide ability to resume calculation based on availablity of artifacts and their creation time.
Without this feature kedro calculates entire pipeine from cratch.. And that is not suitable in many cases.
The text was updated successfully, but these errors were encountered: