-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
repro: 1.x update #1572
repro: 1.x update #1572
Conversation
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the initiative @sarthakforwet , definitely a much needed update.
- What about the Examples though? Need to test them all and update if needed — can be done in a separate PR though so no pressure. - Misc. updates #1615
Co-authored-by: Jorge Orpinel <[email protected]>
Co-authored-by: Jorge Orpinel <[email protected]>
This comment has been minimized.
This comment has been minimized.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Getting there @sarthakforwet but a few items are pending from the previous review, and here's a few more:
Yeah actually, targets do support dirs: So basically we have the following difference between usage of -R and -c option - Directory structure - I think the -c option should be updated by removing the following line - and adding an example of using |
This comment has been minimized.
This comment has been minimized.
- `-c <path>`, `--cwd <path>` - directory within the project to reproduce from. | ||
Instead of using `--cwd`, one can alternately specify a target in a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Like I just suggested, I'd like to test whether this option even makes provides any value now that dvc repro accepts directories as targets. But even if it does, the paragraph is wrong or outdated, for example "specify a target in a subdirectory as path/to/target.dvc
" — target.dvc
is not a subdir.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. You are correct for this scenario (./
which isn't the only scenario). The thing is you should be able to use dvc repro path/to/dir
(without flags) if path/to/dir
is tracked by DVC, I think (maybe not). That's what I'd like you to play with please, and if it does work, compare using or not using -c
and -R
in that scenario, and find a good way to express all your findings in the description (as well, probably, as the new Specific files or directories example you're suggesting).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you haven't done this, let's leave it for a separate PR @sarthakforwet
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thing is you should be able to use dvc repro path/to/dir (without flags) if path/to/dir is tracked by DVC,
We can specify the corresponding .dvc
file of the DVC tracked directory in dvc repro
(which would update any change in the directory) but not the directory name itself. Moreover we cannot build a pipeline in DVC tracked directories ( as per my experimentation dunno if documented already). Also I think directories have no logical use as a dependency in a pipeline. (although we can add them if we wish). And hence I think it is not incorrect if dvc repro
does not take directories as targets (without flags).
Since we cannot specify a directory in dvc repro
, to reproduce pipelines inside a directory/sub-directory (which is untracked), we make use of -c and -R options.
However no example in the doc for using dvc repro
with a specific target
stage or .dvc file is provided and hence I think this should be done. WDYT @jorgeorpinel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can specify the corresponding .dvc file ... but not the directory name itself
(or a stage name). OK got it. I think you're right. Except when you use -R
.
- This should be clarified in the -R option desc: It expands what the
targets
argument accepts.
we cannot build a pipeline in DVC tracked directories
I never thought about that... I guess it's to be expected. May need a core repo issue, will think about it 🧠 thanks
I think directories have no logical use as a dependency in a pipeline
Why not? Of course you can use a directory as a dependency. They can also be outputs. I don't see how this affects our conversation though.
to reproduce pipelines inside a directory/sub-directory (which is untracked), we make use of -c and -R options
And -P.
- no example in the doc for using dvc repro with a specific target stage or .dvc file is provided and hence I think this should be done
Agree. I'll create an issue about these 2 check boxes, ans possibly something else... I'll tag you there. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See #1632
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(or a stage name)
I am referencing DVC tracked directories specifically in #1572(review) and I think "stage name" should not be mentioned with that..
And -P.
Yeah sure, but I was talking specifically in comparison of dvc repro
with/without -R
and -c
options.
Agree. I'll create an issue about these 2 check boxes, ans possibly something else...
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am referencing DVC tracked directories specifically
Stages in dvc.yaml can output directories too. It't not exclusive to .dvc files.
This comment has been minimized.
This comment has been minimized.
@jorgeorpinel Although we cannot perform parallel stage execution but we can perform sequential stage execution (even when having different branches in the pipeline) like this- would it be beneficial to mention this too? |
UPDATE: Let's leave this for another PR. See #1572 (comment) also. |
command: by specifying stage or `.dvc` file to reproduce, or by using the | ||
command: by specifying a stage or a `.dvc` file to reproduce, or by using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only one stage or one .dvc file is accepted as targets
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, there can be many. The sentence should be "specifying stages or .dvc files..." but only the grammar fix in the current text was asked so only did that. Should I update it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"specifying stages or .dvc files" is exactly the grammar fix I was looking for.
The "Example: Downstream" probably needs updation @jorgeorpinel. Following updates could be needed- a.) The "Example: Downstream" describes b. Performing c. Running So should we also mention it in the doc? I would open a new PR and mention this discussion if the changes are valid. |
@jorgeorpinel Are we ready to merge this PR? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes.
A few items are pending which I marked with checkboxes in the discussion above. And I'm about to reply again to the question of dvc repro examples... ⏳
Thanks again @sarthakforwet On the examples, I've started updating them in #1615 since it's mostly typos and corrections, not really 1.x-related updates. Feel free to contribute to that branch (although you would need to open a PR from your fork to my branch... or you can leave suggestions in the changes).
Are you sure? Meaning have you actually tried this? If so, that would probably be 1.x-related, actually.
Interesting finding... I don't know if that's expected, or some bug! Please open a report in the core repo (link here for context). |
In that case like I said the --downstream option description definitely needs an update, as well as that example itself (see #1615 (review)). Would you like to submit a PR to my Feel free to also include the 2 pending items listed under #1572 (review) above. |
Yeah absolutely.
Sure! |
command: by specifying stage file `targets`, or by using the `--single-item`, | ||
`--cwd`, or other options. | ||
There are a few ways to restrict the stages that will be regenerated by this | ||
command: by specifying stages or `.dvc` files to reproduce, or by using the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it accept an output?
stages -> stage names
.dvc - probably it's better not to metion this? what is the use case for this? can it be done w/o repro?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be done w/o repro?
Yes, .dvc
files can be updated via dvc add
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it accept an output?
Not supported by repro
.dvc - what is the use case for this?
Same as using dvc add
again on a changed data file.
Keeping it in this paragraph because it's not an edge case to have the raw data as .dvc files at the beginning of a pipeline and dvc repro
will by default update these if the data changed.
Updated in #1615.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if it the case (repro updates .dvc) - it's a side effect, not a main goal of the dvc repro
... dvc repro something.dvc
for dvc add
-ed files is confusing and not worth complicating docs for I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
dvc repro something.dvc for dvc add-ed files is confusing
Good point. Generalized in ebc1560.
@@ -53,7 +56,8 @@ corresponding commands. <abbr>Outputs</abbr> are deleted from the | |||
|
|||
It saves all the data files, intermediate or final results into the <abbr>DVC | |||
cache</abbr> (unless the `--no-commit` option is used), and updates the hash | |||
values of changed dependencies and outputs in the corresponding stage files. | |||
values of changed dependencies and outputs in the corresponding stages (or | |||
`.dvc` files). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to complicate everything with .dvc
files?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately this is a pretty common thing as I just explained. And we no longer call .dvc files "stages", so we need to keep mentioning both some times... In fact I removed the parentheses and changed or->and.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's just use ... in the DVC files (.dvc, dvc.lock, etc)
... essence first, details second
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Updated to "in the DVC files (dvc.lock
and .dvc
)" (those are the only 2 that apply here) also in ebc1560.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DVC files (dvc.lock and .dvc)
Wouldn't it be better to just have "dvc.lock and .dvc files" though? Shorter
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, can be that as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed. I like it also because so far I've managed to avoid the term "DVC file" It's confusing vs. .dvc file, which is just one type of DVC file...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would even rename the DVC Files and Dirs guide to Internal Files and Dirs... But anyway, moving on!
|
||
- `-P`, `--all-pipelines` - reproduce all pipelines, for all the stage files | ||
present in `DVC` repository. | ||
- `-P`, `--all-pipelines` - reproduce all pipelines, for all the stages present |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does it reproduce all pipelines within one dvc.yaml or within all of them? (search recusrsively)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reproduces all of the pipelines (recursive search).
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just tested. -P makes it recursive indeed (checks all dvc.yaml files). I rewrote the description anyway since it didn't make much sense. See #1615.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The funny think is that status
is recursive by default, but repro
needs -P
... Is that a bug? Cc @efiop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @sarthakforwet, for testing this. I would make the note more explicit then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I already updated the note.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
status is recursive by default, but repro needs -P
Extacted this matter iterative/dvc/issues/4273 with much more detail. PTAL
* cmd: review repro examples per #1572 (comment) * cmd: fic tyupo in get-url * cmd: updates to repro per #1572 (review) * cmd: rewrite repro -P desc rel #1572 (review) * cmd: simplified and generalize repro targets desc and DVC file mention per #1572 (comment) and #1572 (comment) * cmd: minor update for repro desc wording per #1572 (comment) * term: don't use "synchronize" in the context of checkout * cmd: rewrite Downstream example and added info for sequential execution of stages * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * cmd: Updated Downstream example * Update content/docs/command-reference/repro.md * repro: Updated Downstream example * Update content/docs/command-reference/repro.md * cmd: updated last para for the description of --downstream and improved formatting * cmd: review language of init --subdir * term: revuew usage of "granular", esp. around init --subdir * repro.md: updated Downstream example * cmd: improve init --subdir explanation * cmd: add info about nested subrepos to init * cmd: fix -P option desc. per #1615 (review) * cmd: improve explanation on how --subdir affects commands per #1615 (review) * cmd: simplify nested structures explanation in init per #1615 (comment) * guide: add note aboud `cp` not being a download in external deps per #1643 (review) * cmd: add note about what --cwd means to repro per iterative/dvc#4292 (comment) * guide: nvmd! removing that note in external deps per 42b670f#r41087633 * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * cmd: more small updates to init * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Restyled by prettier * cmd: rewrap metrics diff usage paragraph * term: remove "just" from -j desc in 3 refs per eda27fc * cmd: add command examples to init --subdir use cases per #1615 * cmd: explain nested repo and projects of all kinds outside of --subdir per #1615 (review) * cmd: remove bold names to nested and not-nested structure examples in init --subdir per #1615 (review) * cmd: standardize --jobs option in all refs per #1615 (review) et al. * cmd: add speed note to --jobs desc in all refs. per #1615 (review) * cmd: change versioning command example in init per #1615 (review) * cd: change repo comments in init --subdir examples for #1615 (review) * cmd: improve note on DVC submodules a little for #1615 (review) * cmd: better explain why isolation is important in --subdir bullet per #1615 (review) * cmd: split last --subdir cases explicitly as 2 bullets per #1615 (comment) * cmd: remove most notes and code block examples about nesting projects/repos in init per #1615 (review) Co-authored-by: sarthakforwet <[email protected]> Co-authored-by: Restyled.io <[email protected]>
* cmd: review repro examples per #1572 (comment) * cmd: fic tyupo in get-url * cmd: updates to repro per #1572 (review) * cmd: rewrite repro -P desc rel #1572 (review) * cmd: simplified and generalize repro targets desc and DVC file mention per #1572 (comment) and #1572 (comment) * cmd: minor update for repro desc wording per #1572 (comment) * term: don't use "synchronize" in the context of checkout * cmd: rewrite Downstream example and added info for sequential execution of stages * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * cmd: Updated Downstream example * Update content/docs/command-reference/repro.md * repro: Updated Downstream example * Update content/docs/command-reference/repro.md * cmd: updated last para for the description of --downstream and improved formatting * cmd: review language of init --subdir * term: revuew usage of "granular", esp. around init --subdir * repro.md: updated Downstream example * cmd: improve init --subdir explanation * cmd: add info about nested subrepos to init * cmd: fix -P option desc. per #1615 (review) * cmd: improve explanation on how --subdir affects commands per #1615 (review) * cmd: simplify nested structures explanation in init per #1615 (comment) * guide: add note aboud `cp` not being a download in external deps per #1643 (review) * cmd: add note about what --cwd means to repro per iterative/dvc#4292 (comment) * guide: nvmd! removing that note in external deps per 42b670f#r41087633 * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * cmd: more small updates to init * Update content/docs/command-reference/repro.md * Update content/docs/command-reference/repro.md * Restyled by prettier * cmd: rewrap metrics diff usage paragraph * term: remove "just" from -j desc in 3 refs per eda27fc * cmd: add command examples to init --subdir use cases per #1615 * cmd: explain nested repo and projects of all kinds outside of --subdir per #1615 (review) * cmd: remove bold names to nested and not-nested structure examples in init --subdir per #1615 (review) * cmd: standardize --jobs option in all refs per #1615 (review) et al. * cmd: add speed note to --jobs desc in all refs. per #1615 (review) * cmd: change versioning command example in init per #1615 (review) * cd: change repo comments in init --subdir examples for #1615 (review) * cmd: improve note on DVC submodules a little for #1615 (review) * cmd: better explain why isolation is important in --subdir bullet per #1615 (review) * cmd: split last --subdir cases explicitly as 2 bullets per #1615 (comment) * cmd: remove most notes and code block examples about nesting projects/repos in init per #1615 (review) * cmd: add basic text about nesting to init * cmd: revert nested block examples back into --subdir per #1661 (review) * cmd: remove new section about nesting to revert more Co-authored-by: sarthakforwet <[email protected]> Co-authored-by: Restyled.io <[email protected]>
A PR as an outcome of the dicsussion at #1445(review).
This PR intends to completely update dvc repro documentation to sync with the 1.0 version of DVC.