Unified show/diff command for all output types #4

dberenbaum · 2021-03-29T23:48:53Z

No description provided.

pmrowla · 2021-03-30T00:52:33Z

Why not just top-level commands like

$ dvc diff [--data] [--params] [--metrics] [--binary]

where each flag adds the output for a respective diff driver, and --data would be the current dvc diff filesystem/data diff. --binary would use the appropriate binary diff driver set via a gitattributes style path regex + diff program configuration. Image diffing would fall into this --binary category.

pmrowla · 2021-03-30T00:52:37Z

On image diffing, I mentioned this in the other thread, but I don't think we should get into providing/supporting binary diff drivers in DVC. It would be better for us to provide a way for users to define/write their diff drivers to do whatever they need. Providing a basic example for image diffing would be fine here (but I think it should be separate from the core DVC repo).

Basically, this is similar to the remotes issue - we are moving towards supporting user-defined remotes which comply with fsspec , rather than implementing/maintaining/supporting too many remote types ourselves in core DVC. We should take the same approach with binary diffing - provide the API/configuration to work with any user-provided diff driver rather than writing/maintaining them ourselves in core DVC.

pmrowla · 2021-03-30T00:54:00Z

proposals/unified-show-diff.md

+optional arguments:
+    --type  Restrict report to single output type (metrics, plots, images).
+```
+


What would the expected output be for

show --type images

Are we just listing them on the command line, or are we actually opening them in a web browser like with plots?

I also mentioned this in the other thread, but maybe we should consider defining an optional metadata/schema field for mimetype. If an output has a mimetype defined, we would then show it in a browser as needed.

The user would need to explicitly set mimetype themselves (on outputs they know will be images, or directories containing images for example). DVC would not attempt to do any auto detection (whether filename/extension based or by actually reading headers).

I also mentioned this in the other thread, but maybe we should consider defining an optional metadata/schema field for mimetype. If an output has a mimetype defined, we would then show it in a browser as needed.

I like this idea except that we sort of already have a different convention with plots (and metrics although browser support isn't needed). Do you see those as special cases? Should dvc.yaml syntax support mimetype: plots instead of the current plots: section?

I think plots and metrics would stay a special case, partially due to backwards-compatibility/existing behavior, and partially because they don't really fit into the mimetype idea anyways.

Side note - if we go this route, we should avoid creating our own mimetypes - mimetype is a formal IANA standard. There would be some leeway to use something like application/vnd.dvc.plots but it's probably not worth it for us?

Metrics are a special case for DVC since it's not just "yaml/json" files, it's "yaml/json" files containing only numeric values structured in a specific DVC-metrics compatible way.

plots (outputs) are a special case in the same way as metrics. And even though the actual generated plot is just a file with mimetype: text/html, we are really talking about the DVC plots (yaml/json) output and not the dvc plots ... generated HTML file. The plots field also contains configuration information about things like the default template to use for a specific (ploattable) metric.

I think it might be better to just leave binary/image support out of this proposal. There's already a separate discussion in iterative/dvc#5681, and it's a bit separate from the main point of the proposal.

pared · 2021-03-30T09:55:26Z

Related: iterative/dvc#5693

pared · 2021-03-30T09:56:31Z

proposals/unified-show-diff.md

+* Support file/directory `targets` as positional arguments.
+* By default, show all outputs in the workspace.
+
+`diff` implementations will:


Probably worth adding support for "revision:target" style parsing as per iterative/dvc#5693

Also, we might want to consider implementing options similar to git's --no-index.

Git does not have "driver" concept. All files are text ones so making diff --no-index is easy. In our case, we will probably need some sensible default behavior to detect "type" and what driver we should use (basing, on, for example, outputs type), but in some cases (eg. diff for metric file in no-dvc repo) providing the type will be a necessity while now it works out of the box thanks to the fact that the commands are split.

I'm taking out the parts of the proposal related to iterative/dvc#5693 for now since they are a bit unrelated to the concept of unifying the commands.

dberenbaum · 2021-03-30T20:31:39Z

$ dvc diff [--data] [--params] [--metrics] [--binary]

I like the idea. I have a few questions:

What if I want to diff everything together? Part of the goal is collecting all these diffs together so users don't have to run multiple commands to get metrics, plots, etc.
Any reason you prefer individual flags for each type instead of a --type flag?
Binary type might not be quite right since not all binary files necessarily need the same driver.

EDIT: Separating responses into multiple comments.

dberenbaum · 2021-03-30T20:42:15Z

On image diffing, I mentioned this in the other thread, but I don't think we should get into providing/supporting binary diff drivers in DVC. It would be better for us to provide a way for users to define/write their diff drivers to do whatever they need. Providing a basic example for image diffing would be fine here (but I think it should be separate from the core DVC repo).

Basically, this is similar to the remotes issue - we are moving towards supporting user-defined remotes which comply with fsspec , rather than implementing/maintaining/supporting too many remote types ourselves in core DVC. We should take the same approach with binary diffing - provide the API/configuration to work with any user-provided diff driver rather than writing/maintaining them ourselves in core DVC.

Yeah, I agree. I need to clarify what I mean by a diff driver for dvc. I thought of it as an API or a parent class with a method for returning certain fields like old data, new data, etc. I did not think of it as a UI output or HTML. That would instead be handled by the difftool. The current dvc outputs from the diff commands today could be more or less the built-in difftool(s) (maybe they could be consolidated into one html), and users could provide any other difftools they want.

For example, the driver for plots would return all the underlying data needed to produce the Vega/HTML, but the Vega/HTML would be in the difftool. For images, the driver might collect the images, verify that they are recognizable as images, and convert them to some common format. The difftool would handle how these get visualized (maybe dvc provides some very basic default here so that there is some way to visualize them without external tools).

By the way, I will add a summary of this conversation into the doc once we get to some agreement.

pmrowla · 2021-03-31T01:33:31Z

What if I want to diff everything together? Part of the goal is collecting all these diffs together so users don't have to run multiple commands to get metrics, plots, etc.

Any reason you prefer individual flags for each type instead of a --type flag?

To get more than one type of diff, you just add multiple flags and we would just concatenate the respective outputs (similar to what exp diff does now for metrics + plots)

So if I wanted all 3 of data/metrics/params could just run something along the lines of dvc diff -dmp. Being able to combine short flags is why I went with this over --type <...>

pmrowla · 2021-03-31T01:33:37Z

Binary type might not be quite right since not all binary files necessarily need the same driver.

Right, but the diff driver that gets run on binary would be controlled in the git driver + gitattributes way

So in .dvc/config (more likely in my global or system DVC config) I would have something like

[diff "raster-img"]
    command = /usr/bin/my_image_diff

[diff "vector-img"]
    command = /usr/bin/my_vector_image_diff

[diff "archive"]
    command = /usr/bin/my_archive_diff

and in my .dvcattributes (or whatever we decide to call it) I would have

*.jpg diff=raster-img
*.png diff=raster-img
*.svg diff=vector-img
*.zip diff=archive
*.tar diff=archive

When I do dvc diff --binary, when DVC encounters any files matching *.jpg,*.png it will automatically call the vector image diff tool, for files matching *.svg it would call the vector image diff tool, and so on. For files which do not have a binary diff tool set, we will do nothing in --binary mode.

dberenbaum · 2021-03-31T18:02:22Z

When I do dvc diff --binary, when DVC encounters any files matching *.jpg,*.png it will automatically call the vector image diff tool, for files matching *.svg it would call the vector image diff tool, and so on.

Which *.jpg,*.png,*.svg files would be included in dvc diff --binary? Does it include all of those files in the git repo, or only those that are tracked by DVC?
Let's say I have lots of image files but only want to see a diff for some subset (raw data might be images but so might plots and related files explaining my model). I wonder if it makes more sense to keep the .dvcattributes data in dvc.yaml (or allow it to be overriden there) like - mypic.jpg:\n diff: raster-imgto avoid inconsistencies from spreading the info across multiple dvc metafiles?
What if I only want to see diffs for images and not archives?
Should dvc provide basic drivers for images (or other binary types), and should it be the default for image mimetypes? Do we expect every user to have to define/implement this themselves?

pmrowla · 2021-04-01T01:42:31Z

Which .jpg,.png,*.svg files would be included in dvc diff --binary? Does it include all of those files in the git repo, or only those that are tracked by DVC?

I think it makes sense for us to be able to diff any file in the entire repo and not only DVC tracked files. We should just accept the dvc diff ... -- [path1] [path2]... syntax to allow the user to diff specific files (if that's what they want). It would also make sense to have flags like --dvc-only for only dvc tracked files and potentially even --targets [stage_or_dvcfile] to restrict the diff to outputs of a specific dvc stage.

Let's say I have lots of image files but only want to see a diff for some subset (raw data might be images but so might plots and related files explaining my model).

If we are keeping gitattributes syntax, patterns are matched against paths, not just filetype extensions. So if you only wanted images inside a foo/images to be diffed, you can do

/foo/images/*.jpg diff=...

I wonder if it makes more sense to keep the .dvcattributes data in dvc.yaml (or allow it to be overriden there) like - mypic.jpg:\n diff: raster-imgto avoid inconsistencies from spreading the info across multiple dvc metafiles?

I'm not sure if keeping this data in dvc.yaml is right, I'm not even sure if this hypothetical .dvcattributes would be git-tracked at all. This seems more like information that belongs in local/system/global configs, since you have to have both the appropriate diff tools and the appropriate attributes configuration on your machine.

In cases where the user has the diff tools included in their repo itself (maybe in something like /scripts/my_diff.py), then they could git track everything, but in typical cases where this is dependent on a user's specific machine, I don't think this configuration would get git-tracked.

The way I see it, configuring my preferred diff tool for is comparable to "I want my default git text editor to be vim" - Not everyone on my team will want that same configuration.

What if I only want to see diffs for images and not archives?

This is covered by allowing specific paths/outputs/targets to be specified (i.e. dvc diff -- **/*.jpg)

Should dvc provide basic drivers for images (or other binary types), and should it be the default for image mimetypes? Do we expect every user to have to define/implement this themselves?

I think providing basic example drivers is fine (and at least one good example would probably be required so that users actually understand how diff drivers work, images seem like a good place to start here). I don't think any of our (example) drivers should be enabled by default. We could provide a basic template .dvcattributes file with "uncomment the following line to use default image driver" example content. But I'd want to be clear about a few caveats:

This should not be part of the core DVC repo (we should have a separate iterative/example-diff-driver(s) repo)
It should be clearly documented that our drivers are examples, the user can assume they will work out of the box for typical scenarios but will not be actively maintained/developed by the core DVC team
We should not accept requests to add more "default" drivers ourselves

This follows along with our new/current approach to remote support - the core DVC team should focus on developing core DVC. Users with specific use-case diffing needs should write their diff own tools.

jorgeorpinel · 2021-04-07T00:57:28Z

proposals/unified-show-diff.md

+# Summary
+
+Establish a unified command to combine and `show` or `diff` all outputs of
+specified types, including:


I agree the different show subcommands are similar enough to get merged, and even more so for diff subcmds. but are shows similar enough to diffs, or do they have intrinsically different uses? In the latter case we would need to keep 2 top commands. BTW exp show is very different (experiments are not outputs).

jorgeorpinel · 2021-04-07T00:57:32Z

proposals/unified-show-diff.md

+* Plots
+* Images (depends on https://github.com/iterative/dvc/discussions/5681)
+
+These would all be put under a single `dvc outputs show/diff` command.


I like dvc out(put)s. What I suggested originally was dvc compare (since diff is taken) but that only consolidated metrics/plots/experiments diffcommands. If we decide to keep 2 top commands maybe we have have bothouts(mergedshows) and compare(mergeddiff`s).

jorgeorpinel · 2021-04-07T00:57:40Z

proposals/unified-show-diff.md

+A common abstraction for all output types minimizes the flexibility of each
+output type's `show` and `diff` commands.


So the key question is are we forcing the generalization too much? Originally the idea was to merge commands that are already basically the same thing but somehow ended up duplicated. Maybe no need to force it, I think (e.g. throw in show, throw in mime/types, etc.) at least as a first iteration. Then again ERs are useful for big changes like that so up to you!

proposals/unified-show-diff.md

+- Enhancement Proposal PR: (leave this empty)
+- Contributors: dberenbaum


dberenbaum · 2021-04-08T14:50:07Z

I'm migrating this to https://www.notion.so/iterative/Unified-show-diff-command-for-all-output-types-df3c941296e84c6ba6c2fab1c3a55f7e. I'm also going to rewrite a lot of it to try to capture and address the discussions already in this PR. If I don't capture your feedback from here, please add a comment or add it to the list of unresolved questions at the bottom of the proposal.

EDIT: Also, one of the improvements in Notion should be that it's easy to make changes directly on the document, so it hopefully feels less like one person owns the document 😄 .

dberenbaum added 3 commits March 26, 2021 16:40

first draft of unified show/diff

674cd04

minor edits to unified show/diff

3114bea

git comps for unified diff

61dc88d

dberenbaum requested review from pmrowla, pared and skshetry March 29, 2021 23:48

pmrowla reviewed Mar 30, 2021

View reviewed changes

pared reviewed Mar 30, 2021

View reviewed changes

pared mentioned this pull request Mar 30, 2021

plots/metrics/params(?): stop throwing exceptions iterative/dvc#5525

Closed

jorgeorpinel reviewed Apr 7, 2021

View reviewed changes

proposals/unified-show-diff.md

Comment on lines +3 to +4

- Enhancement Proposal PR: (leave this empty)

- Contributors: dberenbaum

This comment was marked as resolved.

Sign in to view

skshetry removed their request for review December 22, 2022 06:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unified show/diff command for all output types #4

Unified show/diff command for all output types #4

dberenbaum commented Mar 29, 2021

pmrowla commented Mar 30, 2021

pmrowla commented Mar 30, 2021 •

edited

Loading

pmrowla Mar 30, 2021 •

edited

Loading

pmrowla Mar 30, 2021

dberenbaum Mar 30, 2021

pmrowla Mar 31, 2021 •

edited

Loading

dberenbaum Apr 8, 2021

pared commented Mar 30, 2021

pared Mar 30, 2021

dberenbaum Apr 8, 2021

dberenbaum commented Mar 30, 2021 •

edited

Loading

dberenbaum commented Mar 30, 2021

pmrowla commented Mar 31, 2021

pmrowla commented Mar 31, 2021 •

edited

Loading

dberenbaum commented Mar 31, 2021

pmrowla commented Apr 1, 2021 •

edited

Loading

jorgeorpinel Apr 7, 2021

jorgeorpinel Apr 7, 2021

jorgeorpinel Apr 7, 2021

This comment was marked as resolved.

dberenbaum commented Apr 8, 2021 •

edited

Loading

		A common abstraction for all output types minimizes the flexibility of each
		output type's `show` and `diff` commands.

		- Enhancement Proposal PR: (leave this empty)
		- Contributors: dberenbaum

Unified show/diff command for all output types #4

Are you sure you want to change the base?

Unified show/diff command for all output types #4

Conversation

dberenbaum commented Mar 29, 2021

pmrowla commented Mar 30, 2021

pmrowla commented Mar 30, 2021 • edited Loading

pmrowla Mar 30, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pmrowla Mar 31, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pared commented Mar 30, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dberenbaum commented Mar 30, 2021 • edited Loading

dberenbaum commented Mar 30, 2021

pmrowla commented Mar 31, 2021

pmrowla commented Mar 31, 2021 • edited Loading

dberenbaum commented Mar 31, 2021

pmrowla commented Apr 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

This comment was marked as resolved.

dberenbaum commented Apr 8, 2021 • edited Loading

pmrowla commented Mar 30, 2021 •

edited

Loading

pmrowla Mar 30, 2021 •

edited

Loading

pmrowla Mar 31, 2021 •

edited

Loading

dberenbaum commented Mar 30, 2021 •

edited

Loading

pmrowla commented Mar 31, 2021 •

edited

Loading

pmrowla commented Apr 1, 2021 •

edited

Loading

dberenbaum commented Apr 8, 2021 •

edited

Loading