Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

store whole DAG in one DVC-file #1871

Closed
Casyfill opened this issue Apr 10, 2019 · 56 comments · Fixed by #3584
Closed

store whole DAG in one DVC-file #1871

Casyfill opened this issue Apr 10, 2019 · 56 comments · Fixed by #3584
Assignees
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension research

Comments

@Casyfill
Copy link

I understand the merits of having multiple .dvc files for complex processes,
but it would be just great to have the option to store the whole DAG in one Dvcfile!

I feel it might help the overall readability of the structure

@ghost ghost added the feature request Requesting a new feature label Apr 11, 2019
@ghost
Copy link

ghost commented Apr 11, 2019

I might be overestimating this, but it looks to me that we would need to do a huge refactor to support this, or at least not treating Dvcfile as a stage and create a separate class to deal with the logic of parsing such file to come up with a pipeline.

I can see the analogy with make on this one, but from my perspective, dvc is a CLI tool to write, read and execute (stage) files, whereas make is reduced to read and control de execution of (make) files.

I do appreciate the aesthetics of having everything on a single file, tho.

@Casyfill
Copy link
Author

Casyfill commented May 2, 2019

I see.
as a last resort, perhaps it is at least possible to add paths to the previous steps as comments in the file?

@shcheklein
Copy link
Member

@Casyfill I think we will release a new version very soon (next week? @efiop can confirm) that will be preserving comments in the DVC files. I'm not sure we should be generating these comments on the DVC side though - it's hard to keep them up to date. What editor are using? May be it's a good feature to support on the editor level - navigate up/down the pipeline cc @prihoda

@shcheklein
Copy link
Member

Just for the context - we got a request on the ods.ai #tool_dvc channel to support this:

(translation is mine)

Hi! Is there a way (feature/third-party tool/script/...) to describe the whole dvc-pipeline as a single file (as it is happening in build tools - make, bazel, maven, etc) instead of having a file per stage

@efiop efiop added p4-not-important p3-nice-to-have It should be done this or next sprint p2-medium Medium priority, should be done, but less important and removed p4-not-important p3-nice-to-have It should be done this or next sprint labels Jul 23, 2019
@salotz
Copy link

salotz commented Dec 12, 2019

I can see the analogy with make on this one, but from my perspective, dvc is a CLI tool to write, read and execute (stage) files, whereas make is reduced to read and control de execution of (make) files.

The single-file approach could be built upon the CLI approach as an extra optional layer. I suggested something like tup as another syntax option to make, which I happen to like. It inverts the dependency graph to be a flow graph and they look like unix pipes. It also can be faster due to the way it checks for updates.

I do appreciate the aesthetics of having everything on a single file, tho.

It is more than just about aesthetics (although great aesthetics in a build tool would be nice). Its a complex problem and waterbed theory is real.
If you only want to have dvc be analogous to gcc I suppose that is fine, but as projects get more complex you will want some sort of build tool like make, scons, waf, redo and so on. Eventually people will make the same thing in some ad hoc way bad way. So I see it as an inevitability that some sort of build automation tool will come about. I will probably be integrating them into my invoke tasks so that I can take care of pipeline things that aren't directly related to ML pipelines and wouldn't require dvc run. You could consider integrating or pairing with scons or waf which are both python projects.

Since it is a python project there is always the option of making pipelines in python, even though the syntax suffers a little bit. The way that prefect does their pipelines isn't so bad using a context manager.

Not really a solution since not everyone uses emacs and you can't emulate the awesome UI features it has on all platforms but magit is really the only way I interact with git and makes it very easy. A graphical interface could also solve some of the issues without sacrificing the versionability of generated files and not introducing new files.

My two cents 😄 and also this essay is awesome reading for exactly this kind of stuff https://ngnghm.github.io/blog/2016/04/26/chapter-9-build-systems/

@efiop efiop added research product: VSCode Integration with VSCode extension labels Jan 23, 2020
@efiop efiop self-assigned this Jan 30, 2020
@efiop
Copy link
Contributor

efiop commented Feb 7, 2020

Ok, so this issue has triggered a reoccurring discussion about separating pipelines and data subsystems of dvc. The main reason being that dvc-files will be modified by dvc on every dvc repro/run, so storing multiple stages in one dvc-file will absolutely turn into a merge nightmare.

If we forget about pipelines, then DVC-files could become one-liner placeholder files that would be easy to merge and resolve conflicts. They would be used by commands like pull/push/checkout/etc to handle the data itself:

NOTE I'm oversimplifying some formatting of json/yaml files here for simplicity, please don't mind it

$ dvc add path/to/something
$ cat path/to/something.dvc
90104d9e83cfb825cf45507e90aadd27

So they are no longer human-readable/editable. They are still acting as a placeholder, giving visibility to the user that some file is located there, even through github web UI and without running dvc checkout.

Now to the pipeline part. As everyone has experienced, dvc-files are pretty weird in the regard that they are modified by you and by the dvc (writing/changing hashes). Which causes non-trivial merge conflicts and is causing us a lot of hustle trying to preserve comments and stuff like that. Plus, dvc run is notoriously cumbersome when it comes to defining a long list of deps and outs in your cmdline. So to solve all of that, we could create new pipeline-files that would be hand-written and won't be ever touched by dvc itself. Those pipeline-files would be basically like current dvc-files but without the hashes and could also contain more than one pipeline stage in them (this is what this ticket is about). To store hashes for dependencies and outputs that those stages were run with, dvc will auto-generate a special hash file, that would basically be a one-line (to avoid git conflicts) yaml or json with list of deps and outs and their hashes. So it would look something like:

$ cat Pipeline # Feels like `Dvcfile` would be a nice fit here, but not using it for compat reasons
stages:
  - cmd: mycmd1 dep out1 out2
    deps: a 
    outs: # don't worry about metrics and stuff, it could be expanded from this no probs
      - out1
      - out2
$ dvc run Pipeline
$ cat out1.dvc
94614d6650e062655f9f77507dc9c1f2
$ cat out2.dvc
06cda08aa427e73492389a0f17c72d91
$ cat Pipeline.hash # or some better name
{"md5": "45678854", "deps": {"dep": "655f9f7750"}, "outs": [{"out1": "94614d6650e062655f9f77507dc9c1f2"}, {"out2": "06cda08aa427e73492389a0f17c72d91"}]} # note that `md5` is a hash of the Pipeline file itself

this way when merging .dvc files you choose your data version, but you don't have to touch your pipeline.hash file, so it will know by itself if it needs to run or not. So we get a complete decoupling of data and pipeline parts, which is pretty sweat. External outs/deps stuff could be handled on top of that, just don't want to get into those details.

Obviously, if we go that route, we will have two approaches there: try to keep backward compatibility or release 1.0 and basically start over. I have two concerns for the former one: naming new files and keeping old code alive. If we just move on to 1.0, we'll be able to drop all stage-writing code, all dvc run logic as we know it today and overall it seems like it will be much more elegant. But breaking compatibility is a very big pill to swallow...

Would love to hear what you guys think. CC @iterative/engineering

@prihoda
Copy link
Contributor

prihoda commented Feb 7, 2020

In my opinion this is a step in the right direction, I would go for adapting this change and breaking compatibility, but it definitely brings up a lot of questions.

  • So this means that all of the stages would be defined using one or multiple Pipeline files in your repo?
  • How do you handle cases when a new version of the Pipeline is commited but not the Pipeline.hash (or vice versa), which will make it incompatible for the other users?
  • Do you actually need to keep the .dvc files, since the hash is stored in the Pipeline.hash file? Are they now only a convenience for checking out out1 using an already existing dvc checkout out1.dvc?
  • What happens if the hash in the .dvc file is different from Pipeline.hash?
  • How do you handle merge conflicts in the Pipeline.hash file? This could occur when the user first designs the Pipeline, commits it, and then different users run a different part of it.

One thing that might make it simpler is to ditch the Pipeline.hash file and keep storing all the hashes in the .dvc file.

On a slightly unrelated note, one thing that I would welcome is to have an option to explicitly configure locations where the Pipeline files can be discovered so that the whole repo does not have to be listed with every DVC action. This has been a cause for me personally when working on a shared filesystem on our HPC when working with thousands of files, since listing can take up to several minutes. I actually opted to use Makefiles in those cases for that reason.

@efiop
Copy link
Contributor

efiop commented Feb 7, 2020

So this means that all of the stages would be defined using one or multiple Pipeline files in your repo?

I think it would make sense to allow multiple pipeline files, so that you could have one in your subdirs and do stuff like that. Having only one pipeline file seems too strict and limiting.

How do you handle cases when a new version of the Pipeline is commited but not the Pipeline.hash (or vice versa), which will make it incompatible for the other users?

There is also an idea of not having pipeline.hash, but rather storing it in .dvc/build-cache (or similar) in a form of -> . Kudos @Suor . That way we wll get build-cache by-design and could possibly push it to dvc remote alongside the data. If some file is missing, then other users will simply have to re-run their pipeline, though the results will already be there.

Do you actually need to keep the .dvc files, since the hash is stored in the Pipeline.hash file? Are they now only a convenience for checking out out1 using an already existing dvc checkout out1.dvc?

Yes, that will make build-cache idea possible (described above). So .dvc files will only deal with caching, and hash(or something in .dvc/build-cache) will deal with pipelines. It is best to think about those as completely separate instances. So hash files for pipelines are not used by checkout at all.

What happens if the hash in the .dvc file is different from Pipeline.hash?

If corresponding data-file has the same hash as described in .dvc file, but which differs from the one in pipeline.hash, it means that your pipeline will need to be re-run. So think about those only as metadata about the last execution, which could then be checked against the real-world to determine if we need to re-run it or not.

How do you handle merge conflicts in the Pipeline.hash file? This could occur when the user first designs the Pipeline, commits it, and then different users run a different part of it.

In the build-cache idea described above, .dvc/build-cache is based on the hash of cmd + dep hashes, so there won't be conflicts. But also since we are considering pushing those to dvc remote, there won't be merges for those at all, since they won't be tracked by git.

One thing that might make it simpler is to ditch the Pipeline.hash file and keep storing all the hashes in the .dvc file.

Yes, theoretically we could combine both approaches under one .dvc file umbrella. E.g. if this particular .dvc file only has a hash inside, it means that it is a "data" metadata, that would be used on checkout and friends. And if it is a list of stages (with commands/deps/outs) then it will be treated as a pipeline file, so hashes won't be written into it, but rather will be stored in .dvc/build-cache. I really like your idea, I think this will work even if we decide to keep backward compatibility 👍

On a slightly unrelated note, one thing that I would welcome is to have an option to explicitly configure locations where the Pipeline files can be discovered so that the whole repo does not have to be listed with every DVC action. This has been a cause for me personally when working on a shared filesystem on our HPC when working with thousands of files, since listing can take up to several minutes. I actually opted to use Makefiles in those cases for that reason.

Have you considered adding those giant directories to .dvcignore? I don't really like the idea of such configuration, as it makes it less intuitive and less flexible than just collecting files. Plus we will still have to collect .dvcignores and stuff and will also have to collect old dvcfiles for backward compatibility.

@prihoda
Copy link
Contributor

prihoda commented Feb 7, 2020

There is also an idea of not having pipeline.hash, but rather storing it in .dvc/build-cache (or similar) in a form of -> .

Can you elaborate more on the build-cache idea? I think I am missing something.

@jorgeorpinel jorgeorpinel changed the title store whole DAG in one Dvcfile store whole DAG in one DVD-file Feb 7, 2020
@jorgeorpinel jorgeorpinel changed the title store whole DAG in one DVD-file store whole DAG in one DVC-file Feb 7, 2020
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Feb 7, 2020

If we forget about pipelines, then DVC-files could become one-liner placeholder files that would be easy to merge and resolve conflicts
90104d9e83cfb825cf45507e90aadd27

Are we sure? I suspect that changes in the commit hash is what causes merge conflicts anyway, and it would still be the one thing left in single-line DVC-files. I rec trying this with some mock files in Git first.

As everyone has experienced, dvc-files are pretty weird in the regard that they are modified by you and by the dvc

I don't personally ever change DVC-files manually and it's not really something we advocate for, in docs at least. I wonder how many users really need to do this. Maybe I'm totally off, but I think that human-readable DVC-files are great and helps people understand what's happening (helps with examples in docs), but it doesn't necessarily mean people should edit them other than in edge cases.


That said:

we could create new pipeline-files that would be hand-written and won't be ever touched by dvc itself... and could also contain more than one pipeline stage in them

It's not an unattractive idea (and I do think Dvcfile would be a nice default name, kind of like Dockerfile). In fact it sounds like a cool feature that would allow more automations/integrations in the future.
But would it be possible to make this into an alternative feature? Keeping the existing possibility to define individual stages via dvc add/run/import

Pipeline.hash # or some better name

Dvcfile.lock? But as mentioned above this approach reintroduces the merge conflict situation...

try to keep backward compatibility or release 1.0 and basically start over

Seems like a 2.0 for sure, but I'm not seeing why we couldn't keep back compat with multi-line DVC-files as mentioned in my previous comment (except for Dvcfile name).

an idea of not having pipeline.hash, but rather storing it in .dvc/build-cache... based on the hash of cmd + dep hashes, so there won't be conflicts. But also since we are considering pushing those to dvc remote, there won't be merges for those at all, since they won't be tracked by git.

I'm also not getting how the build cache would work exactly, what it contains, etc. Maybe a more visual example could help describe it?

Thanks!

@dmpetrov
Copy link
Member

dmpetrov commented Feb 7, 2020

It is a really great design proposal from @efiop. Very solid ideas. I especially like the idea of rolling out the pipeline hashes under build-cache. Thanks to @Suor. (Except that I don't like the name, run-cache might be better than build-cache).

It is great that @efiop deeply immersed in the problem. (To make it even more challenging 😄) I encourage you to think about related DVC-areas and how it can affect the proposed design:

  1. So, build-cache (repro: use build cache for deterministic stages #1234) is already involved. Good - very useful scenario.
  2. If we separate out pipelines from data sources it might affect other scenarios:
    1. Some dvc gc strategies might depend on pipelines - like remove everything but source datasets. How can we make the separation? Should we move these GC strategies under pipeline commands/module/package?
    2. Should we mix data source cache and pipeline cache? How about extracting it to a separate dir .dvc/cache/run/ or even to a separate remote?
    3. We can go even further if we separate dirs in the cache... It can be related to some scenarios with granular remote management like Specify what should be pushed to different remotes #2095. A possible solution - pull different remotes into different dirs or give to users the ability to specify a cache subdir for each import.
  3. How about metrics? Should we separate them from data sources as well and how? Btw.. we will be introducing a new type of metric for plots and visualization commands in DVC. How this might affect this design.
  4. Are you thinking about extracting a separate module in the level of DVC code or a library/package? Benefits?

@skshetry
Copy link
Member

Previous approach with the lockfile and the multistage Dvcfile, even though the hashes were duplicated among output stage file and lockfile, they were different concepts. So, user could still think of a (lockfile + dvcfile) to be a single concept (they need not care about lockfile at all) and could just dvc repro Dvcfile:stage_name. There's no ambiguity here as there is a clear separation between pipeline stage and output file stage.

But, with new suggested approach, it's quite opposite. The pipelines.yaml will house stage templates for multiple stage (similar to Dvcfile in previous approach). This yaml file will generate multiple entries of stage in a single file (just list of whatever we have in our current single stage file), hence will bear the checksums that can be used to handle data-related commands, etc. So, this removes the requirement for lockfile and removes duplication of the checksums.

But, this does share the same concepts and same structure related to a given stage among different files. This might make it complicated for user. Eg:
Should dvc lock pipelines.yaml:stage_name be used or, should it be dvc lock pipelines.dvc:stage_name?
Maybe, we should allow both? We might even need to duplicate few information on both files anyway (eg: locked, persist, cache, etc), so duplication is still there.

@efiop
Copy link
Contributor

efiop commented Apr 21, 2020

Should dvc lock pipelines.yaml:stage_name be used or, should it be dvc lock pipelines.dvc:stage_name?

pipeline.yaml is the ultimate source of truth for everything except hashes, so lock should be stored in pipeline.yaml and not pipeline.dvc. Doesn't seem like locked: True matters for .dvc, so there is no need to include it there. Same with persist. With cache it is a bit more tricky, because it will affect things like push/pull/checkout, but we could again look into the corresponding pipeline.yaml for the ultimate source of truth and not store cache: True/false in .dvc.

efiop added a commit to efiop/dvc that referenced this issue Apr 22, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in iterative#1871.

Related to iterative#1871
Local part of iterative#1234
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 23, 2020

Hi, just sharing one of my answers given privately...

Should dvc lock pipelines.yaml:stage_name be used or, should it be dvc lock pipelines.dvc:stage_name? Maybe, we should allow both?

If the base name is the same in both, why not just pipelines:split-into-two? DVC can determine from which internal file to get the info. When you're using the default pipelines file name (pipelines.yaml/lock) you could just skip it: dvc checkout split-into-two — maybe just print a warning.

How will dvc add DVC-files look like? Will those remain the same? And should we call them something else in order to stop using the term "DVC-file" (esp. if we keep Dvcfile as the default pipeline file name)? Maybe "DVC metadata file" or just "metadata file", or "data source file" ... Idk

@skshetry
Copy link
Member

That was meant to be a secret presentation. 😅.
@jorgeorpinel, the dvc add-ed files will still generate .dvc files. We are only changing for the stage files.

you could just skip it: dvc checkout split-into-two — maybe just print a warning.

Yes, for the naming and if it's a default file, yes we can go with it. But, for the data-related commands, it'd be better to be explicit with file naming because that's going to be confusing.

@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Apr 24, 2020

Np. Original comment deleted, text moved to #1871 (comment) above. And thanks for the answers (both secretly and above)

for the data-related commands, it'd be better to be explicit with file naming because that's going to be confusing

I'm not sure I get why that would be confusing. Is it because you need to open the pipeline file to know/remember the stage names? dvc pipeline show can probably help with this.

efiop added a commit to efiop/dvc that referenced this issue Apr 28, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in iterative#1871.

Related to iterative#1871
Local part of iterative#1234
@skshetry
Copy link
Member

skshetry commented Apr 29, 2020

@jorgeorpinel, the only concern I have is, because some of the data commands allow granular operations on file (eg: dvc checkout foo), it might overlap with dvc checkout foo, where foo might be a stage name.
Ideally, we will not be able to know what the user's intention was. We can check stages and see if the name was mentioned and then fallback to whichever way is possible or even tell user that it's ambiguous in this condition. But, this is us guessing, right?

Take an example, user have a file foo and creates another stage with same name. Suppose, later, foo is no longer tracked. And, now, when user does dvc checkout foo, we will be falling back to stage checkout which may or may not be what the user intended. The :stage_name addressing makes intention clear.

^ is me talking ideally. If we do not have better ideas, we can easily fallback, there's no problem with that.

Another approach can be to do nothing and keep :stage_name addressing as-is. This will allow us to move away from *.dvc and *.yaml file easily and user can just do dvc checkout file --stage or something to just checkout files from the stage where that specific file belonged to (and, we can also keep :stage_name for convenience). My point is to, make data related commands be primarily about data not stage or dvc files, and, pipelines related command can use plain and simple stage name (eg: dvc repro <stage_name).

@skshetry
Copy link
Member

skshetry commented Apr 29, 2020

@iterative/engineering, this should be ready to try out. I'd love to get the feedbacks.
Just checkout to master or pip install via:

$ pip install --user https://github.com/iterative/dvc/archive/master.zip

Remember, the pipeline file is only generated if you specify a hidden -n/--name flag on dvc run (else, it will fallback to old-style *.dvc files).

A working example should be following to try out:

dvc run --name "generate-foo" --outs foo \
               "echo 'foo' > foo"

And, the stage can be addressed via :stage_name (eg: dvc repro :generate-foo).

efiop added a commit to efiop/dvc that referenced this issue Apr 29, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in iterative#1871.

Related to iterative#1871
Local part of iterative#1234
efiop added a commit that referenced this issue Apr 29, 2020
This patch introduces `.dvc/cache/stages` that is used to store previous
runs and their results, which could then be reused later when we stumble
upon the same command with the same deps and outs.

Format of build cache entries is single-line json, which is readable by
humans and might also be used for lock files discussed in #1871.

Related to #1871
Local part of #1234
@jorgeorpinel
Copy link
Contributor

because some of the data commands allow granular operations on file (eg: dvc checkout foo), it might overlap with dvc checkout foo, where foo might be a stage name... we will not be able to know what the user's intention was

So are you changing the implicit arguments accepted by these commands? I think either the stage needs a stage e.g. --stage, or path targets needs one e.g. --
Otherwise how are you detecting what kind of argument you're getting anyway? Does argparse even support this

The :stage_name addressing makes intention clear... do nothing and keep :stage_name addressing

OK, this also works but flags are more explicit (which is good). Both could be supported I guess.

My point is to, make data related commands be primarily about data not stage or dvc files, and, pipelines related command can use plain and simple stage name

Agree

@efiop
Copy link
Contributor

efiop commented May 5, 2020

Closing in favor of #3693 . Multistage dvcfiles are now default for dvc run in alpha release 1.0.0a0.

@efiop efiop closed this as completed May 5, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p2-medium Medium priority, should be done, but less important product: VSCode Integration with VSCode extension research
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants