Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

repro: force committing .dvc files can cause UX problems #2115

Closed
1 task
jorgeorpinel opened this issue Jan 19, 2021 · 6 comments
Closed
1 task

repro: force committing .dvc files can cause UX problems #2115

jorgeorpinel opened this issue Jan 19, 2021 · 6 comments
Labels
A: docs Area: user documentation (gatsby-theme-iterative) 🐛 type: bug Something isn't working. type: enhancement Something is not clear, small updates, improvement suggestions

Comments

@jorgeorpinel
Copy link
Contributor

Bug Report

This is probably by design but may also be an unintended side effect.

Description

Deleting contents of a tracked dir. which prevent a stage from being executed with repro results in a confusing project state where repro becomes very difficult.

Reproduce

$ mkdir ds
$ echo a > ds/a
$ echo b > ds/b
$ dvc add ds
...
$ cat ds.dvc
outs:
- md5: 56...
$ ls .dvc/cache
3b  56  60     # So far so good. Should prob. git add/commit here.

$ dvc run -n ab -d ds/a -d ds/b -o ab "cat ds/a ds/b > ab"
# Now stage "ab" depends on ds/a and ds/b independently.
$ rm ds/b  # Oops...
$ rm ab  # Oops, let's generate it again:
$ dvc repro
Verifying data sources in stage: 'ds.dvc'

ERROR: failed to reproduce 'dvc.yaml': dependency 'ds/b' does not exist

# OK, this seems to make sense. So let's checkout ds/b to fix this, right?

$ dvc checkout ds/b
$ ls ds
a   # Huh?
$ dvc checkout
$ ls ds
a   # What's going on?

The repro attempt did something to ds/, but the message Verifying data sources in stage: 'ds.dvc' doesn't really give much of an indication of that. The only relatively visible hint to what's happening comes from git status (IF you committed the first ds.dvc version):

$ git status
        modified:   .gitignore
        modified:   ds.dvc   # Why is this modified?
$ git diff ds.dvc
-- md5: 56...
+- md5: 45...
$ ls .dvc/cache
3b  45  56  60  dd  runs
$ cat .dvc/cache/45/b1eb2ff8072c35f82255c3c22c1ae9.dir
[{"md5": "60b725f10c9c85c70d97880dfe8191b3", "relpath": "a"}]%

As you can see it's pretty hard to figure out what's happening without a deep understanding of DVC. And the solution is also not obvious. git checkout -- ds.dvc does the trick but it's only possible if the previous version was committed to Git. Otherwise you'd have to manually restore ds.dvc based on the cache file names (not reasonable), or re-build ds/ manually (which may or may not be doable) and re-add/commit it.

Expected Ideas

  • Maybe repro should not force-commit things by default?

  • At least certainly not silently, for now (improve messaging)

  • BTW this behavior is not documented in https://dvc.org/doc/command-reference/repro! We can transfer this one if a decision is made to keep the current behavior, but otherwise please create a corresponding docs issue.

Environment information

$ dvc version
DVC version: 1.11.10 (pip)
---------------------------------
Platform: Python 3.6.9 on Linux-4.19.128-microsoft-standard-x86_64-with-Ubuntu-18.04-bionic
Supports: gdrive, gs, hdfs, http, https, s3, ssh, oss
Cache types: hardlink, symlink
Caches: local
Remotes: None
Repo: dvc, git
@efiop
Copy link
Contributor

efiop commented Jan 24, 2021

@jorgeorpinel You've modified the data source and repro-ed the whole pipeline. That's correct behavior and is the same as you replacing the input data with another data, we used to have this in get-started awhile back, but looks like not anymore.

Moving to docs for now.

CC @dberenbaum maybe you have any thoughts on this. We could open a discussion https://github.com/iterative/dvc/discussions for this, if we plan on revisiting this behavior. Btw, looks like dvc experiments might make this a bit more intuitive.

@efiop efiop transferred this issue from iterative/dvc Jan 24, 2021
@dberenbaum
Copy link
Contributor

This confused for me for a bit, which might be a sign that it's not an ideal user experience. However, after I worked through the example on my own, I don't see any unexpected behavior. The only change that dvc repro seems to make is updating ds/ds.dvc to reflect the deletion of ds/a, which is what I would expect.

As far as messaging and documentation, there are a couple of things that might be helpful:

  1. As @jorgeorpinel mentioned, messaging during dvc repro could be more explicit and include something like ds changed to reflect the dvc updates.
  2. Documentation should encourage users to git commit after dvc commands (dvc add/run/repro), if it doesn't already.

@jorgeorpinel
Copy link
Contributor Author

Hi sorry for late reply here.

modified the data source and repro-ed the whole pipeline. That's correct behavior... same as you replacing the input data with another data

Hmm. @efiop does repro normally re-add every .dvc file? Ok maybe, that sounds familiar (I should probably know this by now heh). But the repro targets arg. states 'dvc.yaml' by default. (no mention of .dvc files or re-adding data) in help output an docs... So is that wrong then?

dvc repro seems to make is updating ds/ds.dvc to reflect the deletion of ds/a, which is what I would expect

@dberenbaum why do you expect that? (keeping in mind it doesn't seem to be indicated anywhere that repro cares about .dvc files, and also we don't consider .dvc files "stages" anymore so are they part of the pipeline?).

Agree on improving core messages and docs for now. So I guess this could go back in the core repo to address that part first?

@dberenbaum
Copy link
Contributor

Ah, I think I misunderstood, or maybe this is a new issue you are raising. I had assumed that dvc repro with no target would reproduce everything in the project, whether in the dvc.yaml or in .dvc files, but that's inconsistent with the documentation, so I agree that we should confirm expected behavior on that and update the help and docs as needed.

@jorgeorpinel
Copy link
Contributor Author

assumed that dvc repro with no target would reproduce everything in the project

Yes. That sounds reasonable, in fact I think that's what we aim for. Maybe we just need to fix the help output (and docs).

@jorgeorpinel jorgeorpinel added 🐛 type: bug Something isn't working. A: docs Area: user documentation (gatsby-theme-iterative) type: enhancement Something is not clear, small updates, improvement suggestions labels Feb 14, 2021
@jorgeorpinel jorgeorpinel changed the title repro: force committing .dvc files can cause UX problems repro: force committing .dvc files can cause UX problems May 10, 2021
@jorgeorpinel
Copy link
Contributor Author

Reviving this 😬

Documentation should encourage users to git commit after dvc commands (dvc add/run/repro)

Actually @dberenbaum I don't know if that addresses this problem because without fixing the messaging it's pretty hard to even realize that you have to recover some old .dvc file version. So I think we should either transfer this back to the core repo or close it, and since users haven't run into this issue that I know of (wdyt @efiop ?) I'm closing it for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A: docs Area: user documentation (gatsby-theme-iterative) 🐛 type: bug Something isn't working. type: enhancement Something is not clear, small updates, improvement suggestions
Projects
None yet
Development

No branches or pull requests

3 participants