Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: clarifications around external outputs info. #2154

Merged
merged 4 commits into from
Mar 14, 2021
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 5 additions & 3 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,10 +146,12 @@ not.
Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`

- `--external` - allow `targets` that are outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).
[External Outputs](/doc/user-guide/external-outputs).

> Note that external outputs typically require an external cache setup. See
> link above for more details.
> ⚠️ Note that this is an advanced feature for very specific situations and
> not recommended except if there's absolutely no other alternative.
> Additionally, this typically requires an external cache setup (see link
> above).

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it
Expand Down
20 changes: 10 additions & 10 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -185,30 +185,30 @@ This section contains the following options, which affect the project's
directories, which is useful when you are using a a
[shared development server](/doc/use-cases/shared-development-server).

- `cache.local` - name of a _local remote_ to use as a
[custom cache](/doc/user-guide/managing-external-data#examples) directory.
- `cache.local` - name of a _local remote_ to
[use as external cache](/doc/user-guide/external-outputs#examples) directory.
(Refer to `dvc remote` for more information on "local remotes".) This will
overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`.

- `cache.s3` - name of an Amazon S3 remote to use as
[external cache](/doc/user-guide/managing-external-data#examples).
[use as external cache](/doc/user-guide/external-outputs#examples).

- `cache.gs` - name of a Google Cloud Storage remote to use as
[external cache](/doc/user-guide/managing-external-data#examples).
[use as external cache](/doc/user-guide/external-outputs#examples).

- `cache.ssh` - name of an SSH remote to use as
[external cache](/doc/user-guide/managing-external-data#examples).
[use as external cache](/doc/user-guide/external-outputs#examples).

- `cache.hdfs` - name of an HDFS remote to use as
[external cache](/doc/user-guide/managing-external-data#examples).
[use as external cache](/doc/user-guide/external-outputs#examples).

- `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as
[external cache](/doc/user-guide/managing-external-data#examples).
[use as external cache](/doc/user-guide/external-outputs#examples).

> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for
> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file
> ⚠️ Avoid using the same [remote storage](/doc/command-reference/remote) used
> for `dvc push` and `dvc pull` as external cache, because it may cause file
> hash overlaps: the hash of an external <abbr>output</abbr> could collide with
> a hash generated locally for another file with different content.
> that of a local file with different content.

### state

Expand Down
8 changes: 5 additions & 3 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,8 +101,9 @@ Relevant notes:
for more info.)

- [external dependencies](/doc/user-guide/external-dependencies) and
[external outputs](/doc/user-guide/managing-external-data) (outside of the
<abbr>workspace</abbr>) are also supported (except metrics and plots).
[external outputs](/doc/user-guide/external-outputs) (outside of the
<abbr>workspace</abbr>) are also supported (except metrics and plots),
although not usually recommended.

- Outputs are deleted from the workspace before executing the command (including
at `dvc repro`) if their paths are found as existing files/directories (unless
Expand Down Expand Up @@ -259,7 +260,8 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR'
> considered "always changed", so this option has no effect in those cases.

- `--external` - allow writing outputs outside of the DVC repository. See
[Managing External Data](/doc/user-guide/managing-external-data).
[External Outputs](/doc/user-guide/external-outputs) — not usually
recommended.

- `--desc <text>` - user description of the stage (optional). This doesn't
affect any DVC operations.
Expand Down
2 changes: 1 addition & 1 deletion content/docs/command-reference/version.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ usage: dvc version [-h] [-q | -v]
| `Supports` | Types of [remote storage](/doc/command-reference/remote/add#supported-storage-types) supported by the current DVC setup (their required dependencies are installed) |
| `Cache types` | [Types of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) supported (between <abbr>workspace</abbr> and <abbr>cache</abbr>) |
| `Cache directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the <abbr>cache</abbr> directory is mounted |
| `Caches` | Cache [location types](/doc/user-guide/managing-external-data) configured in the repo (e.g. local, SSH, S3, etc.) |
| `Caches` | Cache [location types](/doc/user-guide/external-outputs) configured in the repo (e.g. local, SSH, S3, etc.) |
| `Remotes` | Remote [location types](/doc/command-reference/remote/add#supported-storage-types) configured in the repo (e.g. SSH, S3, Google Drive, etc.) |
| `Workspace directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the <abbr>workspace</abbr> is mounted |
| `Repo` | Shows whether we are in a DVC repo and/or Git repo |
Expand Down
4 changes: 2 additions & 2 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -136,8 +136,8 @@
"large-dataset-optimization",
"external-dependencies",
{
"label": "Managing External Data",
"slug": "managing-external-data"
"label": "External Outputs",
"slug": "external-outputs"
},
{
"label": "Contributing",
Expand Down
6 changes: 3 additions & 3 deletions content/docs/start/data-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -257,10 +257,10 @@ volume?
While these cases are not covered in the Get Started, we recommend reading the
following sections next to learn more about advanced workflows:

- A shared [external cache](/doc/use-cases/shared-development-server) can be set
- A [shared external cache](/doc/use-cases/shared-development-server) can be set
up to store, version and access a lot of data on a large shared volume
efficiently.
- A quite advanced scenario is to track and version data directly on the remote
storage (e.g. S3). Check out
[Managing External Data](https://dvc.org/doc/user-guide/managing-external-data)
to learn more.
[External Outputs](https://dvc.org/doc/user-guide/external-outputs) to learn
more.
19 changes: 9 additions & 10 deletions content/docs/user-guide/external-dependencies.md
Original file line number Diff line number Diff line change
@@ -1,25 +1,24 @@
# External Dependencies

There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
way, that its preferable to avoid moving it from its current external location.
For example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.

External dependencies and
[external outputs](/doc/user-guide/managing-external-data) provide ways to track
and version data outside of the <abbr>project</abbr>.
External dependencies (and [external outputs](/doc/user-guide/external-outputs))
provide ways to track (and version) data outside of the <abbr>project</abbr>.

## How external dependencies work

External <abbr>dependencies</abbr> are considered part of the (extended) DVC
project: DVC will track them, detecting when they change (triggering stage
executions on `dvc repro`, for example).
External <abbr>dependencies</abbr> will be tracked by DVC, detecting when they
change (triggering stage executions on `dvc repro`, for example).

To define files or directories in an external location as
[stage](/doc/command-reference/run) dependencies, put their remote URLs or
[stage](/doc/command-reference/run) dependencies, specify their remote URLs or
external paths in `dvc.yaml` (`deps` field). Use the same format as the `url` of
certain `dvc remote` types. Currently, the following protocols are supported:
certain `dvc remote` types. Currently, the following supported `dvc remote`
types/protocols:

- Amazon S3
- Microsoft Azure Blob Storage
Expand Down
Original file line number Diff line number Diff line change
@@ -1,51 +1,48 @@
# Managing External Data
# External Outputs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apparently this name was quite intentional as "outputs" is not meaningful for end-users (at the user guide level)...


> ⚠️ This is an advanced feature that we don't recommend using unless you really
> know what you are doing. Artifacts added with --external are not affected by
> `dvc push/pull/status -c`. You are likely looking for
> [straight-to-remote/cache](https://github.com/iterative/dvc/issues/4520)
> functionality or `dvc import-url`
> ⚠️ This is an advanced feature for very specific situations and not
> recommended except if there's absolutely no other alternative. In most cases
> alternatives like the `--to-cache` or `--to-remote` options of `dvc add` and
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
> `dvc import-url` are more convenient. **Note** that external outputs are not
> pushed or pulled from/to [remote storage](/doc/command-reference/remote).

There are cases when data is so large, or its processing is organized in such a
way, that its preferable to avoid moving it from its original location. For
example data on a network attached storage (NAS), processing data on HDFS,
running [Dask](https://dask.org/) via SSH, or for a script that streams data
from S3 to process it.
way, that its impossible to handle it in the local machine disk. For example
versioning existing data on a network attached storage (NAS), processing data on
HDFS, running [Dask](https://dask.org/) via SSH, or any code that generates
massive files directly to the cloud.

External outputs and
[external dependencies](/doc/user-guide/external-dependencies) provide ways to
External outputs (and
[external dependencies](/doc/user-guide/external-dependencies)) provide ways to
track and version data outside of the <abbr>project</abbr>.

## How external outputs work

External <abbr>outputs</abbr> are considered part of the (extended) DVC project:
DVC will track them for
External <abbr>outputs</abbr> are considered part of the (extended)
<abbr>workspace</abbr>: DVC will track them for
[versioning](/doc/use-cases/versioning-data-and-model-files), detecting when
they change (reported by `dvc status`, for example).

To use existing files or directories in an external location as
[stage](/doc/command-reference/run) outputs, give their remote URLs or external
paths to `dvc add`, or put them in `dvc.yaml` (`deps` field). Use the same
format as the `url` of certain `dvc remote` types. Currently, the following
protocols are supported:
To use existing files or directories in an external location as outputs, give
their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml`
(`deps` field). Use the same format as the `url` of the following supported
`dvc remote` types/protocols:

- Amazon S3
- SSH
- HDFS
- Local files and directories outside the <abbr>workspace</abbr>
- Local files and directories outside the workspace

External outputs require an
⚠️ External outputs require an
[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
in the same external/remote file.

> Note that [remote storage](/doc/command-reference/remote) is a different
> feature, and that external outputs are not pushed or pulled from/to DVC
> remotes.
> Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. as
> external cache, because it may cause data collisions: the hash of an external
> output could collide with that of a local file with different content.

> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for
> external outputs, because it may cause data collisions: the hash of an
> external output could collide with that of a local file with different
> content.
> Note that [remote storage](/doc/command-reference/remote) is a different
> feature.

## Examples

Expand Down
Loading