Tracking dependencies by artifact's creation time like GNU make #528

sergun · 2020-09-27T18:38:55Z

I'm sure we really need this..
It will provide ability to resume calculation based on availablity of artifacts and their creation time.
Without this feature kedro calculates entire pipeine from cratch.. And that is not suitable in many cases.

921kiyo · 2020-09-28T12:43:18Z

Hi @sergun thank you for opening the feature request. Could you please elaborate a bit more on what kind of feature you are looking for? There's a template for a feature request, and it would be great if you could fill them in.
What calculation would you like to know? The runtime of pipeline execution, or IO operation etc.

sergun · 2020-09-28T14:15:54Z

Description

Currently kedro can only calculate pipilenes from scratch and with some manually specified settings (you can specify nodes from which it can calculate everything or you can ask kedro to calculate only missing datasets by Runner.run_only_missing).

Context

In many use-cases model GNU make works better.. As you know make automaticlly tracks dependencies between artefacts (the same does kedro) but it also automatically tracks artifacts' mtime and understand which artifact should be re-created by taking in mind mtime of output artefact and dependencies.
Such model is suitable e.g. for feature engineering, when you have a lot of SQL-scripts, each of them creates some temporary table. Tables are joined and finally you have some resulting table with feature values. It is nice that data scientist can modify some script and make will automatically understand that table which is created by this script should be re-created and all tables depend on it should be also re-calculated. And other tables should be untouched. This use-cases is completely unsupported by kedro.

Possible Implementation

No concreted ideas. But I think that we can integrate mtime in Dataset class and add ability to select nodes to be calculated based on mtime of input/output Datasets of each node.

Possible Alternatives

Do not see them.

DChulok · 2020-09-29T11:37:18Z

As for me, it will be really great to add such a functionality!

WaylonWalker · 2020-09-30T16:54:06Z

I am not 100% sure that I follow what you are looking for, but my personal feeling so far is that this is a feature that can be achieved through the use of hooks.

All of the kedro DataSets are currently backed by fsspec, a quick scan of their API revealed that there is a .modified() method that returns a timestamp of the modified path. I am not exactly sure how you would get to this inside of a hook though.

You can access the kedro dataset instance dynamically by using getattr(catalog.datasets, 'dataset_name'). Inside the dataset instance, you will find all sorts of information about your dataset. I was able to get to the modified time of my datasets, but it did not appear that it was leveraging fsspec for the file system agnostic methods. Instead it seemed like it was specific to the filesystem type I was using.

Once you have the time what do you need to do with that? I am not familiar with GNU make and how it utilizes mtime to avoid recreating artifacts that are not necessary to make.

This would be another great application for #400. If I could set a default update frequency (daily), and override that frequency in my catalog to tell kedro that this dataset is only refreshed (weekly/monthly/cron expression?). Then it could figure out if its time to update or now.

sergun · 2020-10-01T10:08:51Z

Thanks @WaylonWalker !

But I do not see how to skip execution of node's processing function from some hook.
The idea is to do not call this function if mtime of input datasets of a node are earlier than mtime of it's output datasets and these datasets exist.

mzjp2 · 2020-10-01T11:29:38Z

I've quickly hacked together (with emphasis on hack) a prototype of what a hook would look like that enables something like this: https://gist.github.com/mzjp2/076bfd73b0215bda01ee71186966389d

WaylonWalker · 2020-10-01T14:02:15Z

That is a really cool hook @mzjp2! If we could tag nodes with a run frequency and combine with this it would make things easy to blindly run everything and only update out of date nodes.

WaylonWalker · 2020-10-01T14:04:36Z

Thanks @WaylonWalker !

But I do not see how to skip execution of node's processing function from some hook.
The idea is to do not call this function if mtime of input datasets of a node are earlier than mtime of it's output datasets and these datasets exist.

I see now. What if you grabbed the ast or bytecode of the function of each node and cached it. Then you can check if the function itself has changed since last run, or if input data has changed since the last run.

WaylonWalker · 2020-10-01T14:42:02Z

I found a parallel conversation is happening over on kedro.community.

https://discourse.kedro.community/t/speeding-up-pipeline-processing-with-change-detection/90

sergun · 2020-10-03T19:22:38Z

@WaylonWalker thanks for the link!
It is interesting that in our company we created very similar "in-house" solution based on old-school Makefiles :-)
With change detection based on MTIME of data files and processing scripts.
It works well but we decided to try kedro and wondered that there are no similar things implemented here..

sergun · 2020-10-03T19:25:40Z

@mzjp2 thanks a lot! Great job!
It seems that it should be core thing/concept in the future not hook based.. What do you think?

pascalwhoop · 2020-10-06T11:45:50Z

I think I'd like to get my hands dirty with this one. I'll look into this in the context of Hacktoberfest. I'll make a draft PR where I reference this issue and the discussion in kedro.community. Should be a fun one.
@dataengineerone @sergun FYI

sergun · 2020-10-06T14:51:27Z

@pascalwhoop I think it would be one of the most significant cotribution to kedro 🥇
I think it make sense to consider both ways: hash-based / time-based tracking of changes.

From my Makefile-based for ML experience I can say it is really cool when you do not need to think which nodes should be executed after some change (of params or data, or maybe code). BTW in the Makefile-based colution parameters were also files and they were incuded in make recipes as dependencies..
They only problematic place with time-based tracking is cases when you add whitespace or lineend symbol to a file with parameters or to source code script and make wants to recalculate something dependend on them :-)
I also like make becuase you identidy your task by artifact (filename) not by id of node.. I find this more intuitive..

stale · 2021-04-12T15:10:54Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

sergun added the Issue: Feature Request New feature or improvement to existing feature label Sep 27, 2020

WaylonWalker mentioned this issue Oct 17, 2020

give name of function inside incorrect parameters error #568

Merged

7 tasks

stale bot added the stale label Apr 12, 2021

stale bot closed this as completed Apr 19, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tracking dependencies by artifact's creation time like GNU make #528

Tracking dependencies by artifact's creation time like GNU make #528

sergun commented Sep 27, 2020

921kiyo commented Sep 28, 2020

sergun commented Sep 28, 2020

DChulok commented Sep 29, 2020

WaylonWalker commented Sep 30, 2020 •

edited

Loading

sergun commented Oct 1, 2020

mzjp2 commented Oct 1, 2020

WaylonWalker commented Oct 1, 2020

WaylonWalker commented Oct 1, 2020

WaylonWalker commented Oct 1, 2020

sergun commented Oct 3, 2020

sergun commented Oct 3, 2020

pascalwhoop commented Oct 6, 2020

sergun commented Oct 6, 2020

stale bot commented Apr 12, 2021

Tracking dependencies by artifact's creation time like GNU make #528

Tracking dependencies by artifact's creation time like GNU make #528

Comments

sergun commented Sep 27, 2020

921kiyo commented Sep 28, 2020

sergun commented Sep 28, 2020

Description

Context

Possible Implementation

Possible Alternatives

DChulok commented Sep 29, 2020

WaylonWalker commented Sep 30, 2020 • edited Loading

sergun commented Oct 1, 2020

mzjp2 commented Oct 1, 2020

WaylonWalker commented Oct 1, 2020

WaylonWalker commented Oct 1, 2020

WaylonWalker commented Oct 1, 2020

sergun commented Oct 3, 2020

sergun commented Oct 3, 2020

pascalwhoop commented Oct 6, 2020

sergun commented Oct 6, 2020

stale bot commented Apr 12, 2021

WaylonWalker commented Sep 30, 2020 •

edited

Loading