Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tracking dependencies by artifact's creation time like GNU make #528

Closed
sergun opened this issue Sep 27, 2020 · 14 comments
Closed

Tracking dependencies by artifact's creation time like GNU make #528

sergun opened this issue Sep 27, 2020 · 14 comments
Labels
Issue: Feature Request New feature or improvement to existing feature

Comments

@sergun
Copy link

sergun commented Sep 27, 2020

I'm sure we really need this..
It will provide ability to resume calculation based on availablity of artifacts and their creation time.
Without this feature kedro calculates entire pipeine from cratch.. And that is not suitable in many cases.

@sergun sergun added the Issue: Feature Request New feature or improvement to existing feature label Sep 27, 2020
@921kiyo
Copy link
Contributor

921kiyo commented Sep 28, 2020

Hi @sergun thank you for opening the feature request. Could you please elaborate a bit more on what kind of feature you are looking for? There's a template for a feature request, and it would be great if you could fill them in.
What calculation would you like to know? The runtime of pipeline execution, or IO operation etc.

@sergun
Copy link
Author

sergun commented Sep 28, 2020

Description

Currently kedro can only calculate pipilenes from scratch and with some manually specified settings (you can specify nodes from which it can calculate everything or you can ask kedro to calculate only missing datasets by Runner.run_only_missing).

Context

In many use-cases model GNU make works better.. As you know make automaticlly tracks dependencies between artefacts (the same does kedro) but it also automatically tracks artifacts' mtime and understand which artifact should be re-created by taking in mind mtime of output artefact and dependencies.
Such model is suitable e.g. for feature engineering, when you have a lot of SQL-scripts, each of them creates some temporary table. Tables are joined and finally you have some resulting table with feature values. It is nice that data scientist can modify some script and make will automatically understand that table which is created by this script should be re-created and all tables depend on it should be also re-calculated. And other tables should be untouched. This use-cases is completely unsupported by kedro.

Possible Implementation

No concreted ideas. But I think that we can integrate mtime in Dataset class and add ability to select nodes to be calculated based on mtime of input/output Datasets of each node.

Possible Alternatives

Do not see them.

@DChulok
Copy link

DChulok commented Sep 29, 2020

As for me, it will be really great to add such a functionality!

@WaylonWalker
Copy link
Contributor

WaylonWalker commented Sep 30, 2020

I am not 100% sure that I follow what you are looking for, but my personal feeling so far is that this is a feature that can be achieved through the use of hooks.

All of the kedro DataSets are currently backed by fsspec, a quick scan of their API revealed that there is a .modified() method that returns a timestamp of the modified path. I am not exactly sure how you would get to this inside of a hook though.

You can access the kedro dataset instance dynamically by using getattr(catalog.datasets, 'dataset_name'). Inside the dataset instance, you will find all sorts of information about your dataset. I was able to get to the modified time of my datasets, but it did not appear that it was leveraging fsspec for the file system agnostic methods. Instead it seemed like it was specific to the filesystem type I was using.

Once you have the time what do you need to do with that? I am not familiar with GNU make and how it utilizes mtime to avoid recreating artifacts that are not necessary to make.

This would be another great application for #400. If I could set a default update frequency (daily), and override that frequency in my catalog to tell kedro that this dataset is only refreshed (weekly/monthly/cron expression?). Then it could figure out if its time to update or now.

@sergun
Copy link
Author

sergun commented Oct 1, 2020

Thanks @WaylonWalker !

But I do not see how to skip execution of node's processing function from some hook.
The idea is to do not call this function if mtime of input datasets of a node are earlier than mtime of it's output datasets and these datasets exist.

@mzjp2
Copy link
Contributor

mzjp2 commented Oct 1, 2020

I've quickly hacked together (with emphasis on hack) a prototype of what a hook would look like that enables something like this: https://gist.github.com/mzjp2/076bfd73b0215bda01ee71186966389d

@WaylonWalker
Copy link
Contributor

That is a really cool hook @mzjp2! If we could tag nodes with a run frequency and combine with this it would make things easy to blindly run everything and only update out of date nodes.

@WaylonWalker
Copy link
Contributor

Thanks @WaylonWalker !

But I do not see how to skip execution of node's processing function from some hook.
The idea is to do not call this function if mtime of input datasets of a node are earlier than mtime of it's output datasets and these datasets exist.

I see now. What if you grabbed the ast or bytecode of the function of each node and cached it. Then you can check if the function itself has changed since last run, or if input data has changed since the last run.

@WaylonWalker
Copy link
Contributor

I found a parallel conversation is happening over on kedro.community.

https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90

@sergun
Copy link
Author

sergun commented Oct 3, 2020

@WaylonWalker thanks for the link!
It is interesting that in our company we created very similar "in-house" solution based on old-school Makefiles :-)
With change detection based on MTIME of data files and processing scripts.
It works well but we decided to try kedro and wondered that there are no similar things implemented here..

@sergun
Copy link
Author

sergun commented Oct 3, 2020

@mzjp2 thanks a lot! Great job!
It seems that it should be core thing/concept in the future not hook based.. What do you think?

@pascalwhoop
Copy link
Contributor

I think I'd like to get my hands dirty with this one. I'll look into this in the context of Hacktoberfest. I'll make a draft PR where I reference this issue and the discussion in kedro.community. Should be a fun one.
@dataengineerone @sergun FYI

@sergun
Copy link
Author

sergun commented Oct 6, 2020

@pascalwhoop I think it would be one of the most significant cotribution to kedro 🥇
I think it make sense to consider both ways: hash-based / time-based tracking of changes.

From my Makefile-based for ML experience I can say it is really cool when you do not need to think which nodes should be executed after some change (of params or data, or maybe code). BTW in the Makefile-based colution parameters were also files and they were incuded in make recipes as dependencies..
They only problematic place with time-based tracking is cases when you add whitespace or lineend symbol to a file with parameters or to source code script and make wants to recalculate something dependend on them :-)
I also like make becuase you identidy your task by artifact (filename) not by id of node.. I find this more intuitive..

@stale
Copy link

stale bot commented Apr 12, 2021

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Apr 12, 2021
@stale stale bot closed this as completed Apr 19, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Issue: Feature Request New feature or improvement to existing feature
Projects
None yet
Development

No branches or pull requests

6 participants