-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unified show/diff command for all output types #4
base: main
Are you sure you want to change the base?
Conversation
Why not just top-level commands like
where each flag adds the output for a respective diff driver, and |
On image diffing, I mentioned this in the other thread, but I don't think we should get into providing/supporting binary diff drivers in DVC. It would be better for us to provide a way for users to define/write their diff drivers to do whatever they need. Providing a basic example for image diffing would be fine here (but I think it should be separate from the core DVC repo). Basically, this is similar to the remotes issue - we are moving towards supporting user-defined remotes which comply with fsspec , rather than implementing/maintaining/supporting too many remote types ourselves in core DVC. We should take the same approach with binary diffing - provide the API/configuration to work with any user-provided diff driver rather than writing/maintaining them ourselves in core DVC. |
optional arguments: | ||
--type Restrict report to single output type (metrics, plots, images). | ||
``` | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What would the expected output be for
show --type images
Are we just listing them on the command line, or are we actually opening them in a web browser like with plots?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also mentioned this in the other thread, but maybe we should consider defining an optional metadata/schema field for mimetype
. If an output has a mimetype defined, we would then show it in a browser as needed.
The user would need to explicitly set mimetype
themselves (on outputs they know will be images, or directories containing images for example). DVC would not attempt to do any auto detection (whether filename/extension based or by actually reading headers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also mentioned this in the other thread, but maybe we should consider defining an optional metadata/schema field for
mimetype
. If an output has a mimetype defined, we would then show it in a browser as needed.
I like this idea except that we sort of already have a different convention with plots
(and metrics
although browser support isn't needed). Do you see those as special cases? Should dvc.yaml
syntax support mimetype: plots
instead of the current plots:
section?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think plots and metrics would stay a special case, partially due to backwards-compatibility/existing behavior, and partially because they don't really fit into the mimetype idea anyways.
Side note - if we go this route, we should avoid creating our own mimetypes - mimetype is a formal IANA standard. There would be some leeway to use something like application/vnd.dvc.plots
but it's probably not worth it for us?
Metrics are a special case for DVC since it's not just "yaml/json" files, it's "yaml/json" files containing only numeric values structured in a specific DVC-metrics compatible way.
plots (outputs) are a special case in the same way as metrics. And even though the actual generated plot is just a file with mimetype: text/html
, we are really talking about the DVC plots
(yaml/json) output and not the dvc plots ...
generated HTML file. The plots
field also contains configuration information about things like the default template to use for a specific (ploattable) metric.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it might be better to just leave binary/image support out of this proposal. There's already a separate discussion in iterative/dvc#5681, and it's a bit separate from the main point of the proposal.
Related: iterative/dvc#5693 |
* Support file/directory `targets` as positional arguments. | ||
* By default, show all outputs in the workspace. | ||
|
||
`diff` implementations will: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably worth adding support for "revision:target" style parsing as per iterative/dvc#5693
Also, we might want to consider implementing options similar to git
's --no-index
.
Git does not have "driver" concept. All files are text ones so making diff --no-index
is easy. In our case, we will probably need some sensible default behavior to detect "type" and what driver we should use (basing, on, for example, outputs type), but in some cases (eg. diff for metric file in no-dvc repo) providing the type
will be a necessity while now it works out of the box thanks to the fact that the commands are split.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm taking out the parts of the proposal related to iterative/dvc#5693 for now since they are a bit unrelated to the concept of unifying the commands.
I like the idea. I have a few questions:
EDIT: Separating responses into multiple comments. |
Yeah, I agree. I need to clarify what I mean by a diff driver for dvc. I thought of it as an API or a parent class with a method for returning certain fields like old data, new data, etc. I did not think of it as a UI output or HTML. That would instead be handled by the difftool. The current dvc outputs from the diff commands today could be more or less the built-in difftool(s) (maybe they could be consolidated into one html), and users could provide any other difftools they want. For example, the driver for plots would return all the underlying data needed to produce the Vega/HTML, but the Vega/HTML would be in the difftool. For images, the driver might collect the images, verify that they are recognizable as images, and convert them to some common format. The difftool would handle how these get visualized (maybe dvc provides some very basic default here so that there is some way to visualize them without external tools). By the way, I will add a summary of this conversation into the doc once we get to some agreement. |
To get more than one type of diff, you just add multiple flags and we would just concatenate the respective outputs (similar to what So if I wanted all 3 of data/metrics/params could just run something along the lines of |
Right, but the diff driver that gets run on So in
and in my
When I do |
|
I think it makes sense for us to be able to diff any file in the entire repo and not only DVC tracked files. We should just accept the
If we are keeping gitattributes syntax, patterns are matched against paths, not just filetype extensions. So if you only wanted images inside a
I'm not sure if keeping this data in In cases where the user has the diff tools included in their repo itself (maybe in something like The way I see it, configuring my preferred diff tool for is comparable to "I want my default git text editor to be vim" - Not everyone on my team will want that same configuration.
This is covered by allowing specific paths/outputs/targets to be specified (i.e.
I think providing basic example drivers is fine (and at least one good example would probably be required so that users actually understand how diff drivers work, images seem like a good place to start here). I don't think any of our (example) drivers should be enabled by default. We could provide a basic template .dvcattributes file with "uncomment the following line to use default image driver" example content. But I'd want to be clear about a few caveats:
This follows along with our new/current approach to remote support - the core DVC team should focus on developing core DVC. Users with specific use-case diffing needs should write their diff own tools. |
# Summary | ||
|
||
Establish a unified command to combine and `show` or `diff` all outputs of | ||
specified types, including: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree the different show
subcommands are similar enough to get merged, and even more so for diff
subcmds. but are show
s similar enough to diff
s, or do they have intrinsically different uses? In the latter case we would need to keep 2 top commands. BTW exp show
is very different (experiments are not outputs).
* Plots | ||
* Images (depends on https://github.com/iterative/dvc/discussions/5681) | ||
|
||
These would all be put under a single `dvc outputs show/diff` command. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like dvc out(put)s
. What I suggested originally was dvc compare
(since diff
is taken) but that only consolidated metrics/plots/experiments diffcommands. If we decide to keep 2 top commands maybe we have have both
outs(merged
shows) and
compare(merged
diff`s).
A common abstraction for all output types minimizes the flexibility of each | ||
output type's `show` and `diff` commands. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So the key question is are we forcing the generalization too much? Originally the idea was to merge commands that are already basically the same thing but somehow ended up duplicated. Maybe no need to force it, I think (e.g. throw in show
, throw in mime/types, etc.) at least as a first iteration. Then again ERs are useful for big changes like that so up to you!
- Enhancement Proposal PR: (leave this empty) | ||
- Contributors: dberenbaum |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
I'm migrating this to https://www.notion.so/iterative/Unified-show-diff-command-for-all-output-types-df3c941296e84c6ba6c2fab1c3a55f7e. I'm also going to rewrite a lot of it to try to capture and address the discussions already in this PR. If I don't capture your feedback from here, please add a comment or add it to the list of unresolved questions at the bottom of the proposal. EDIT: Also, one of the improvements in Notion should be that it's easy to make changes directly on the document, so it hopefully feels less like one person owns the document 😄 . |
No description provided.