-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
store whole DAG in one DVC-file #1871
Comments
I might be overestimating this, but it looks to me that we would need to do a huge refactor to support this, or at least not treating I can see the analogy with I do appreciate the aesthetics of having everything on a single file, tho. |
I see. |
@Casyfill I think we will release a new version very soon (next week? @efiop can confirm) that will be preserving comments in the DVC files. I'm not sure we should be generating these comments on the DVC side though - it's hard to keep them up to date. What editor are using? May be it's a good feature to support on the editor level - navigate up/down the pipeline cc @prihoda |
Just for the context - we got a request on the ods.ai #tool_dvc channel to support this: (translation is mine)
|
The single-file approach could be built upon the CLI approach as an extra optional layer. I suggested something like tup as another syntax option to make, which I happen to like. It inverts the dependency graph to be a flow graph and they look like unix pipes. It also can be faster due to the way it checks for updates.
It is more than just about aesthetics (although great aesthetics in a build tool would be nice). Its a complex problem and waterbed theory is real. Since it is a python project there is always the option of making pipelines in python, even though the syntax suffers a little bit. The way that prefect does their pipelines isn't so bad using a context manager. Not really a solution since not everyone uses emacs and you can't emulate the awesome UI features it has on all platforms but magit is really the only way I interact with git and makes it very easy. A graphical interface could also solve some of the issues without sacrificing the versionability of generated files and not introducing new files. My two cents 😄 and also this essay is awesome reading for exactly this kind of stuff https://ngnghm.github.io/blog/2016/04/26/chapter-9-build-systems/ |
Ok, so this issue has triggered a reoccurring discussion about separating pipelines and data subsystems of dvc. The main reason being that dvc-files will be modified by dvc on every If we forget about pipelines, then DVC-files could become one-liner placeholder files that would be easy to merge and resolve conflicts. They would be used by commands like NOTE I'm oversimplifying some formatting of json/yaml files here for simplicity, please don't mind it
So they are no longer human-readable/editable. They are still acting as a placeholder, giving visibility to the user that some file is located there, even through github web UI and without running Now to the pipeline part. As everyone has experienced, dvc-files are pretty weird in the regard that they are modified by you and by the dvc (writing/changing hashes). Which causes non-trivial merge conflicts and is causing us a lot of hustle trying to preserve comments and stuff like that. Plus,
this way when merging Obviously, if we go that route, we will have two approaches there: try to keep backward compatibility or release 1.0 and basically start over. I have two concerns for the former one: naming new files and keeping old code alive. If we just move on to 1.0, we'll be able to drop all stage-writing code, all Would love to hear what you guys think. CC @iterative/engineering |
In my opinion this is a step in the right direction, I would go for adapting this change and breaking compatibility, but it definitely brings up a lot of questions.
One thing that might make it simpler is to ditch the On a slightly unrelated note, one thing that I would welcome is to have an option to explicitly configure locations where the Pipeline files can be discovered so that the whole repo does not have to be listed with every DVC action. This has been a cause for me personally when working on a shared filesystem on our HPC when working with thousands of files, since listing can take up to several minutes. I actually opted to use Makefiles in those cases for that reason. |
I think it would make sense to allow multiple pipeline files, so that you could have one in your subdirs and do stuff like that. Having only one pipeline file seems too strict and limiting.
There is also an idea of not having
Yes, that will make build-cache idea possible (described above). So .dvc files will only deal with caching, and
If corresponding data-file has the same hash as described in .dvc file, but which differs from the one in
In the build-cache idea described above, .dvc/build-cache is based on the hash of cmd + dep hashes, so there won't be conflicts. But also since we are considering pushing those to dvc remote, there won't be merges for those at all, since they won't be tracked by git.
Yes, theoretically we could combine both approaches under one
Have you considered adding those giant directories to |
Can you elaborate more on the build-cache idea? I think I am missing something. |
Are we sure? I suspect that changes in the commit hash is what causes merge conflicts anyway, and it would still be the one thing left in single-line DVC-files. I rec trying this with some mock files in Git first.
I don't personally ever change DVC-files manually and it's not really something we advocate for, in docs at least. I wonder how many users really need to do this. Maybe I'm totally off, but I think that human-readable DVC-files are great and helps people understand what's happening (helps with examples in docs), but it doesn't necessarily mean people should edit them other than in edge cases. That said:
It's not an unattractive idea (and I do think
Seems like a 2.0 for sure, but I'm not seeing why we couldn't keep back compat with multi-line DVC-files as mentioned in my previous comment (except for
I'm also not getting how the build cache would work exactly, what it contains, etc. Maybe a more visual example could help describe it? Thanks! |
It is a really great design proposal from @efiop. Very solid ideas. I especially like the idea of rolling out the pipeline hashes under build-cache. Thanks to @Suor. (Except that I don't like the name, It is great that @efiop deeply immersed in the problem. (To make it even more challenging 😄) I encourage you to think about related DVC-areas and how it can affect the proposed design:
|
Previous approach with the lockfile and the multistage Dvcfile, even though the hashes were duplicated among output stage file and lockfile, they were different concepts. So, user could still think of a (lockfile + dvcfile) to be a single concept (they need not care about lockfile at all) and could just But, with new suggested approach, it's quite opposite. The But, this does share the same concepts and same structure related to a given stage among different files. This might make it complicated for user. Eg: |
|
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234
Hi, just sharing one of my answers given privately...
If the base name is the same in both, why not just How will |
That was meant to be a secret presentation. 😅.
Yes, for the naming and if it's a default file, yes we can go with it. But, for the data-related commands, it'd be better to be explicit with file naming because that's going to be confusing. |
Np. Original comment deleted, text moved to #1871 (comment) above. And thanks for the answers (both secretly and above)
I'm not sure I get why that would be confusing. Is it because you need to open the pipeline file to know/remember the stage names? |
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234
@jorgeorpinel, the only concern I have is, because some of the data commands allow granular operations on file (eg: Take an example, user have a file ^ is me talking ideally. If we do not have better ideas, we can easily fallback, there's no problem with that. Another approach can be to do nothing and keep |
@iterative/engineering, this should be ready to try out. I'd love to get the feedbacks. $ pip install --user https://github.com/iterative/dvc/archive/master.zip Remember, the pipeline file is only generated if you specify a hidden A working example should be following to try out: dvc run --name "generate-foo" --outs foo \
"echo 'foo' > foo" And, the stage can be addressed via |
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in iterative#1871. Related to iterative#1871 Local part of iterative#1234
This patch introduces `.dvc/cache/stages` that is used to store previous runs and their results, which could then be reused later when we stumble upon the same command with the same deps and outs. Format of build cache entries is single-line json, which is readable by humans and might also be used for lock files discussed in #1871. Related to #1871 Local part of #1234
So are you changing the implicit arguments accepted by these commands? I think either the stage needs a stage e.g.
OK, this also works but flags are more explicit (which is good). Both could be supported I guess.
Agree |
Closing in favor of #3693 . Multistage dvcfiles are now default for |
I understand the merits of having multiple
.dvc
files for complex processes,but it would be just great to have the option to store the whole DAG in one Dvcfile!
I feel it might help the overall readability of the structure
The text was updated successfully, but these errors were encountered: