Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Discuss whether we should have consistent output format for both dvc status and dvc status -c. #4482

Closed
karajan1001 opened this issue Aug 27, 2020 · 4 comments · Fixed by #4490
Labels
discussion requires active participation to reach a conclusion

Comments

@karajan1001
Copy link
Contributor

karajan1001 commented Aug 27, 2020

So in case of our current test, if we commit modified alice we will get Data and pipelines up to date on dvc status
and we will get new: alice on dvc status -c. As this issue is about filtering the status I would keep output of dvc status -c as is and create issue discussing whether we should have consistent output format for both dvc status and dvc status -c.

Originally posted by @pared in #4433 (comment)

@skshetry we can discuss the output and format and the Stage::status() with filter_info here

@pared pared added the discussion requires active participation to reach a conclusion label Aug 27, 2020
@pared pared changed the title [Discuss] Whether we should have consistent output format for both dvc status and dvc status -c. Discuss whether we should have consistent output format for both dvc status and dvc status -c. Aug 27, 2020
@pared
Copy link
Contributor

pared commented Aug 27, 2020

So in case of this issue there are actually 2 things to be done:

  1. unify the output of _joint_status so that it returns same structure, no matter whether we provide filter_info or not
  2. discuss whether we should unify output between dvc status and dvc status -c

I think 1. has been discussed under #4433 and we should always return the stage.status with or without flter_info. In case of unchanged files we probably don't have to mention that. Maybe in case of exact output match we could print "unchanged" but in other cases I think its better to report only "edited" statuses.

As to 2:
Lets consider following test:

def test_status(tmp_dir, scm, dvc, local_remote):
    tmp_dir.dvc_gen({"dir": {"file1": "file content", "file2": "file2 content"}}, commit="init")
    dvc.push()

    tmp_dir.gen({"dir": {"file1": "modified file content"}})
    s_local = dvc.status()
    dvc.commit("dir.dvc", force=True)
    s_cloud = dvc.status(cloud=True)

    assert s_local == {"dir.dvc": [{'changed outs': {'dir': 'modified'}}]}
    assert s_cloud == {'dir': 'new', 'dir/file1': 'new'}

Considering that both status and status --cloud is the same command, I think it would make sense that both would return unified status. On the other hand, docs specifically mention that in the case of --remote and --cloud we compare cache's content. And status from the cache point of view is something different than status from workspace point of view. In the case of the repo, we identify outputs by its path, so it's easy to detect whether something changed (if the file under the given path has different md5, it changed). But cache has no notion of outputs, it only has paths and files under it. So, in order to unify status output we would need to implement some logic making connection outputs -> remote so that we can detect that dir/file1 is modified. That would probably mean state for remotes, which would lead to performance degradation. I don't think it is worth it.
@karajan1001 What do you think?

@karajan1001
Copy link
Contributor Author

@pared @skshetry
To question 1, first we should move the filter_info logic into Stage()::status. Currently, all the logics associated with it are all in the base level (stage, output, cache). And at the API level like checkout, we should only pass it as arguments to the lower level without actually handle it.
Screenshot from 2020-08-27 22-20-08

And the question is what info should we return for stage.status with a not none filter_info. And this depends on what we should show in DVC status [outputs], or should we follow the output format of DVC status -c [outputs]. Now I feel the question is that status -c is a totally different command from status. One compares the workspace to the HEAD snapshot with file path info, the other compares the cache with remote cache, no file path info needed. And yes they have a totally different logic in

dvc/dvc/repo/status.py

Lines 31 to 38 in 334556f

def _local_status(self, targets=None, with_deps=False, recursive=False):
targets = targets or [None]
pairs = cat(
self.collect_granular(t, with_deps=with_deps, recursive=recursive)
for t in targets
)
return _joint_status(pairs)

dvc/dvc/repo/status.py

Lines 81 to 109 in 334556f

used = self.used_cache(
targets,
all_branches=all_branches,
all_tags=all_tags,
all_commits=all_commits,
with_deps=with_deps,
force=True,
remote=remote,
jobs=jobs,
recursive=recursive,
)
ret = {}
status_info = self.cloud.status(
used, jobs, remote=remote, log_missing=False
)
for info in status_info.values():
name = info["name"]
status_ = info["status"]
if status_ == cloud.STATUS_OK:
continue
prefix_map = {
cloud.STATUS_DELETED: "deleted",
cloud.STATUS_NEW: "new",
cloud.STATUS_MISSING: "missing",
}
ret[name] = prefix_map[status_]

,but shared with the same interface status. Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]

➜  tests git:(master) ✗ dvc status -c s1
        new:                alice                                                                                                                                               
	new:                bob
➜  tests git:(master) ✗ dvc status -c alice
        new:                alice    

@pared
Copy link
Contributor

pared commented Aug 28, 2020

Now DVC status -c [outputs] shares the same output format with DVC status [stages], and DVC status [outputs] should share the same output format with DVC status [stages] not DVC status -c [outputs]

@karajan1001
I think this is a great summary of the whole issue.

@pared
Copy link
Contributor

pared commented Aug 31, 2020

Thank you @karajan1001 for handling this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discussion requires active participation to reach a conclusion
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants