Replies: 5 comments 31 replies
-
Relevant discussions/resources:
I hadn't though of using DVC, but on the face of it, that sounds like a awesome way to implement the functionality. Funnily enough, my gist above got a comment by @bvancil (https://github.com/bvancil) that also asked about the code-versioning functionality. I thought along the same lines as you, that comparing the AST of the node would be a cool way to do this, but your point on module-level things like imports changing is something I hadn't considered. Potentially it could be node-code + module-code (minus other functions/nodes)? |
Beta Was this translation helpful? Give feedback.
-
The flip-side of this discussion on partial rebuilds -- suppose I'm using argo-workflows to run my pipeline. I want to get all the artifacts into DVC so that I can later use them for partial rebuilds. Should I write wrappers that check in outputs? What about logs and error outputs? Also in a distributed environment I can't just rely on the local work tree -- it would seem that I need a versioning policy to guide how the commits are merged. Has any of this work been done already? Suggestions on where to look, or where to put it if I want to build something myself? Update -- dvc checkpoints seem to be the thing to integrate with. I guess the approach would be to make kedro the source of authority, and automate generation of dvc.yaml, etc.... (unless there is some reason to think about doing it the other way around?) |
Beta Was this translation helpful? Give feedback.
-
An initial stab at a design -- https://github.com/FactFiber/kedro-dvc/blob/main/doc/design.md BTW ...
:( ... |
Beta Was this translation helpful? Give feedback.
-
I've written some more detail into the basic correspondence between kedro and dvc. A question -- is there a mechanism for attaching extra metadata to data catalog entries, nodes and pipelines? Perhaps a generic mechanism would suffice. E.g. a meta=dict(kedro_dvc=dict( ...)) or equiv in yaml for data catalog. This could also be provided externally, but it gets harder to be DRY that way.
|
Beta Was this translation helpful? Give feedback.
-
Hello Kedro Devs/Mods, Any updates in this work? I read that Kedro provides data versioning -- but it simply writes data to folder with timestamp. Recently, there are some demo's that show how MLFlow and DVC can work together. I am wondering are there any efforts to make DVC work with Kedro Data Catalogue. It solves many problems and Then Kedro can play nicely along with other mature open source tools in the MLOps space. Otherwise, it feels as if, everybody (tool) becomes a siwss-army knife of MLops. Appreciate any pointers or comments or suggestions! |
Beta Was this translation helpful? Give feedback.
-
Hi All,
I have been thinking of creating a Kedro Plugin which leverages DVCs Version Controlled Data but Kedros beautiful pipelines. Somewhere I have read a discussion of this idea before, but I can’t find it anymore. Maybe Github, Discord, etc.
Background: At work I had the problem of jobs taking a long time and failing often, but once they ran successfully, all was good. However, when changing parts of the pipeline, after a while, you don’t know if your data is up to date any more - so you have to rerun all. And this takes a long time. So I used DVC to track my input data (e.g. CSV, Hadoop Data*, etc.) but I also tracked the scripts (.py, .sh, etc.). When any one of those changed, the Pipeline would run only that what was necessary. For development this was a game changer.
*) I added an extra step, which simply called ‘DESCRIBE my_hive/impala_table’, saved the output somewhere, and then DVC’ed just this proxy. If the Data on Hive changed, DVC would not know - but this was fine.
For Kedro I was thinking of creating a Plugin/Hook which has very similar behaviour. A Nodes Input is very easily tracked by DVC (especially easy for local data). The Code is not so easily tracked.
Some difficulties: I cannot track all the code, because if I change Node17 then I would have to rerun everything, because accoarding to DVC some code changed.
Thus I want to have a DVC run only look at the current Nodes Code and Data. I was thinking of using python’s inspect module to track the code for this Node. I think this would work. The problem might be, that the Node imports code from a different file. If this file then changes, would DVC know? Is it possible to track all relevant code?
DVC can only run one task at a time. Thus integrating with kedro-accelrator would be impossible. Right?
To reiterate: The goal of kedro-dvc would not be the Data Version Control per se, but rather it would help during devlopment of kedro pipelines. You could be sure that the code/data is always up to date, but only have to run individual nodes - the rest can be skipped as the data is up to date.
The goal is not to keep the data versioned (I guess you should use versioned datasets xD)
Before I embark on this, I would like to ask the community on feedback.
Thanks!
Beta Was this translation helpful? Give feedback.
All reactions