Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

how-to: add stage outputs existing in the WS but missing from dvc.yaml #1840

Merged
merged 38 commits into from
Nov 6, 2020
Merged
Show file tree
Hide file tree
Changes from 32 commits
Commits
Show all changes
38 commits
Select commit Hold shift + click to select a range
b6d0912
add output to stage how to doc
imhardikj Oct 5, 2020
cb42f52
Update add-output-to-stage.md
imhardikj Oct 5, 2020
9272b35
updates
imhardikj Oct 8, 2020
729fb19
Merge branch 'master' into guide/how-to
imhardikj Oct 8, 2020
d4c7595
updates
imhardikj Oct 8, 2020
8a02881
updates
imhardikj Oct 8, 2020
d1b4596
updates
imhardikj Oct 8, 2020
7934203
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 13, 2020
5f8a8aa
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 13, 2020
049d487
Updates
imhardikj Oct 15, 2020
930dc40
updates
imhardikj Oct 15, 2020
1b2e4f8
Updating run and commit
imhardikj Oct 17, 2020
c63e871
Updates
imhardikj Oct 17, 2020
95ca43a
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 19, 2020
057dd6e
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 19, 2020
393b571
Updates
imhardikj Oct 20, 2020
3bed673
Updates
imhardikj Oct 20, 2020
029f4ab
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
4918e35
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
0d9cc45
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
4ec39e7
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
ef51644
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
35d8007
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
3e521b6
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
74fba60
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 21, 2020
5dd0b95
Updates
imhardikj Oct 21, 2020
2bd07a9
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 27, 2020
537a1df
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 27, 2020
b4fcb9b
Update content/docs/user-guide/how-to/add-output-to-stage.md
jorgeorpinel Oct 27, 2020
d2690a4
Updates
imhardikj Oct 28, 2020
3cdb7df
Update content/docs/command-reference/commit.md
jorgeorpinel Nov 3, 2020
2aeceef
Update content/docs/command-reference/commit.md
jorgeorpinel Nov 3, 2020
3d8ce6b
Update content/docs/command-reference/commit.md
jorgeorpinel Nov 3, 2020
a665c2a
Update content/docs/command-reference/commit.md
jorgeorpinel Nov 3, 2020
60c04e6
Update content/docs/command-reference/commit.md
jorgeorpinel Nov 3, 2020
995f852
Update content/docs/command-reference/commit.md
jorgeorpinel Nov 3, 2020
d741cb7
Updates
imhardikj Nov 4, 2020
bc57658
Updates
imhardikj Nov 6, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
28 changes: 18 additions & 10 deletions content/docs/command-reference/commit.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,9 +20,9 @@ positional arguments:
The `dvc commit` command is useful for several scenarios, when data already
tracked by DVC changes: when a [stage](/doc/command-reference/run) or
[pipeline](/doc/command-reference/dag) is in development/experimentation; when
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
manually editing or generating DVC <abbr>outputs</abbr>; or to force update the
`dvc.lock` or `.dvc` files without reproducing stages or pipelines. These
scenarios are further detailed below.
force-updating the `dvc.lock` or `.dvc` files without reproducing stages or
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
pipelines; or when DVC <abbr>outputs</abbr> are added to a stage, manually
generated or edited. These scenarios are further detailed below.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- Code or data for a stage is under active development, with multiple iterations
(experiments) in code, configuration, or data. Use the `--no-commit` option of
Expand All @@ -32,12 +32,6 @@ scenarios are further detailed below.
💡 For convenience, a pre-commit Git hook is available to remind you to
`dvc commit` when needed. See `dvc install` for more details.

- It's always possible to manually execute the source code used in a stage
without DVC (outputs must be unprotected or removed first in certain cases,
see `dvc unprotect`). Once a desirable result is reached, use `dvc add` or
`dvc commit` as appropriate to update the `dvc.lock` or `.dvc` files and store
changed data to the cache.

- Sometimes we want to edit source code, config, or data files in a way that
doesn't cause changes in the results of their data pipeline. We might write
add code comments, change indentation, remove some debugging printouts, or any
Expand All @@ -46,6 +40,20 @@ scenarios are further detailed below.
reproduce the whole pipeline. If you're sure no pipeline results would change,
use `dvc commit` to force update the `dvc.lock` or `.dvc` files and cache.

- In cases where we have previously executed a stage (either by writing `dvc.yaml`
manually and using `dvc repro`, or with `dvc run`), and later notice that
output files or directories created by the stage command, which are already in
the workspace, are missing from `dvc.yaml` (`outs` field). We can
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[add missing outputs to an existing stage](/docs/user-guide/how-to/add-output-to-stage)
without having to execute it again. Use `dvc commit` to update the `dvc.lock`
file and save outputs to the cache.

- It's always possible to manually execute the terminal command or source code
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
used in a stage without DVC (outputs must be unprotected or removed first in
certain cases, see `dvc unprotect`). Once a desirable result is reached, use
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`dvc add` or `dvc commit` as appropriate to update the `dvc.lock` or `.dvc`
files and store changed data to the cache.

Let's take a look at what is happening in the first scenario closely. Normally
DVC commands like `dvc add`, `dvc repro` or `dvc run` commit the data to the
<abbr>cache</abbr> after creating or updating a `dvc.lock` or `.dvc` file. What
Expand All @@ -66,7 +74,7 @@ computed and added to the `dvc.lock` or `.dvc` file, but the actual data file is
not saved in the cache. This is where the `dvc commit` command comes into play.
It performs that last step (saving the data in cache).

Note that it's best to avoid the last two scenarios. They essentially
Note that it's best to avoid the last three scenarios. They essentially
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
force-update the `dvc.lock` or `.dvc` files and save data to cache. They are
still useful, but keep in mind that DVC can't guarantee reproducibility in those
cases.
Expand Down
6 changes: 6 additions & 0 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,12 @@ Relevant notes:
also means that the stage command needs to recreate any directory structures
defined as outputs every time its executed by DVC.

- In some situations we have executed a stage and later notice that output files
or directories created by the stage command, which are already in the
workspace, are missing from `dvc.yaml` (`outs` field). We can
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
[add missing outputs to an existing stage](/docs/user-guide/how-to/add-output-to-stage)
without having to execute it again.

- Renaming dependencies or outputs requires a
[manual process](/doc/command-reference/move#renaming-stage-outputs) to update
`dvc.yaml` and the project's cache accordingly.
Expand Down
6 changes: 5 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -99,7 +99,11 @@
"label": "How To",
"slug": "how-to",
"source": false,
"children": ["undo-adding-data", "update-tracked-files"]
"children": [
"add-output-to-stage",
"undo-adding-data",
"update-tracked-files"
]
},
"setup-google-drive-remote",
"large-dataset-optimization",
Expand Down
46 changes: 46 additions & 0 deletions content/docs/user-guide/how-to/add-output-to-stage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Add Output to Stage

There are situations where we have executed a stage (either by writing
`dvc.yaml` manually and using `dvc repro`, or with `dvc run`), but later notice
that some of the output files or directories it creates, which are already in
the <abbr>workspace</abbr>, are missing from `dvc.yaml` (`outs` field). Follow
the steps below to add existing files or directories as <abbr>outputs</abbr> to
a stage without re-executing it again, which can be expensive/time-consuming,
and is unnecessary.

We start with an example `prepare`, which has a single output. To add a missing
output `data/validate` to this stage, we can edit `dvc.yaml` like this:

```git
stages:
prepare:
cmd: python src/prepare.py
deps:
- src/prepare.py
outs:
- data/train
+ - data/validate
```

> Note that you can also use `dvc run` with the `-f` and `--no-exec` options to
> add another output to the stage:
>
> ```dvc
> $ dvc run -f --no-exec \
> -n prepare \
> -d src/prepare.py \
> -o data/train \
> -o data/validate \
> python src/prepare.py
> ```
>
> `-f` overwrites the stage in `dvc.yaml`, while `--no-exec` updates the stage
> without executing it.

Finally, we need to run `dvc commit` to save the newly specified output(s) to
the <abbr>cache</abbr> (and to update the corresponding hash values in
`dvc.lock`):

```dvc
$ dvc commit
```