diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md
index d59ded757c..27f0512cfd 100644
--- a/content/docs/command-reference/add.md
+++ b/content/docs/command-reference/add.md
@@ -33,23 +33,23 @@ option to avoid this, and `dvc commit` to finish the process when needed).
> See also `dvc.yaml` and `dvc run` for more advanced ways to track and version
> intermediate and final results (like ML models).
-After checking that each `target` hasn't been added before (or tracked with
-other DVC commands), a few actions are taken under the hood:
+After checking that each `target` isn't already tracked with DVC, a few actions
+are taken under the hood:
1. Calculate the file hash.
-2. Move the file contents to the cache (by default in `.dvc/cache`) (or to
- remote storage if `--to-remote` is given), using the file hash to form the
- cached file path. (See
+2. Move the file contents to the cache, using the file hash to form the cached
+ file path (see
[Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
- for more details.)
-3. Attempt to replace the file with a link to the cached data (more details on
- file linking further down). Skipped if `--to-remote` is used.
-4. Create a corresponding `.dvc` file to track the file, using its path and hash
- to identify the cached data (with `--to-remote`/`-o`, an external path is
- moved to the workspace). The `.dvc` file lists the DVC-tracked file as an
- output (`outs` field). Unless the `--file` option is used, the
- `.dvc` file name generated by default is `.dvc`, where `` is the
- file name of the first target.
+ for details). Using `--out`, or `--to-remote` with an external target, the
+ data is copied instead (to cache or remote storage).
+3. Attempt to replace the file with a link to (or copy of) the cached data (more
+ details on file linking ahead). A new link is created if a different `--out`
+ `path` is given. Skipped if `--to-remote` is used
+4. Create a `.dvc` file to track the file or directory, saving it's path, and
+ the hash as a pointer to the cached data. The `.dvc` file lists the data as
+ an output (`outs` field). Unless the `--file` option is used,
+ the `.dvc` file name generated by default is `.dvc`, where `` is
+ the file name of the first target.
5. Add the `targets` to `.gitignore` in order to prevent them from being
committed to the Git repository (unless `dvc init --no-scm` was used when
initializing the DVC project).
@@ -145,7 +145,25 @@ not.
[pattern](https://docs.python.org/3/library/glob.html) specified in `targets`.
Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**`
-- `--external` - allow `targets` that are outside of the DVC repository. See
+- `-o `, `--out ` - destination `path` inside the workspace to place
+ the data target. By default the data file basename is used in the current
+ working directory (if this option isn't used). Directories in the given `path`
+ will be created. Note that for external targets, this can be combined
+ [with an external cache](#example-external-data) to skip the local file
+ system.
+
+- `--to-remote` - allow a target outside of the DVC repository (e.g. an S3
+ object, SSH directory URL, file on mounted volume, etc.) but don't move it
+ into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a DVC
+ remote instead (the default one unless `-r` is specified) to skip the local
+ file system. Use `dvc pull` to get the data later.
+
+- `-r `, `--remote ` - name of the
+ [remote](/doc/command-reference/remote) to store data on (can only be used
+ with `--to-remote`).
+
+- `--external` - allow `targets` that are outside of the DVC repository, to
+ track in-place. See
[Managing External Data](/doc/user-guide/managing-external-data).
> ⚠️ Note that this is an advanced feature for very specific situations and
@@ -153,20 +171,6 @@ not.
> Additionally, this typically requires an external cache setup (see link
> above).
-- `-o `, `--out ` - destination `path` to make a local target copy,
- or to [transfer](#example-transfer-to-cache) an external target into the cache
- (and link to workspace). Note that this can be combined with `--to-remote` to
- avoid storing the data locally, while still adding it to the project.
-
-- `--to-remote` - import an external target, but don't move it into the
- workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it
- directly to remote storage (the default one, unless `-r` is specified)
- instead. Use `dvc pull` to get the data locally.
-
-- `-r `, `--remote ` - name of the
- [remote storage](/doc/command-reference/remote) to transfer external target to
- (can only be used with `--to-remote`).
-
- `--desc ` - user description of the data (optional). This doesn't affect
any DVC operations.
@@ -336,95 +340,82 @@ $ tree .dvc/cache
Only the hash values of the `dir/` directory (with `.dir` file extension) and
`file2` have been cached.
-## Example: Transfer to the cache
-
-When you have a large dataset in an external location, you may want to add it to
-the project without having to copy it into the workspace. Maybe
-your local disk doesn't have enough space, but you have setup an
-[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache)
-that could handle it.
+## Example: External data
-The `--out` option lets you add external paths in a way that they are
-cached first, and then
-[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
-to a given path inside the workspace. Let's initialize an example
-DVC project to try this:
+Sometimes you may want to add a large dataset currently found in an external
+location. But what if there's not enough disk space to download the data? Here's
+one method!
-```dvc
-$ mkdir example # workspace
-$ cd example
-$ git init
-$ dvc init
-```
+The `--out` option lets you add external so that it's linked to a given path
+inside the workspace after being copied to the cache.
+Combined with
+[symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache)
+an
+[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache),
+this let's you avoid using the local file system completely.
-Now we can add a `data.xml` file via HTTP for example, putting it a local path
-in our project:
+For example, we can add a `data.xml` file via HTTP, outputting it to a local
+path in our project:
```dvc
-$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml
+$ dvc add https://data.dvc.org/get-started/data.xml -o raw/data.xml
$ ls
data.xml data.xml.dvc
```
-The resulting `.dvc` file will save the provided local `path` as if the data was
-already in the workspace, while the `md5` hash points to the copy of the data
-that has now been transferred to the cache. Let's check the
-contents of `data.xml.dvc` in this case:
+The local `data.xml` should be a symlink to the (externally) cached
+data copy. The resulting `.dvc` file will save the local `path` as if the data
+was already there before this command. Let's check the contents of
+`data.xml.dvc`:
```yaml
outs:
- md5: a304afb96060aad90176268345e10355
nfiles: 1
- path: data.xml
+ path: raw/data.xml
```
> For a similar operation that actually keeps a connection to the data source,
> please see `dvc import-url`.
-## Example: Transfer to remote storage
+## Example: `--to-remote` usage {#straight-to-remote}
-When you have a large dataset in an external location, you may want to track it
-as if it was in your project, but without downloading it locally (for now). The
-`--to-remote` option lets you do so, while storing a copy
-[remotely](/doc/command-reference/remote) so it can be
-[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project,
-and setup a remote:
+Here's another method to add a large dataset found in an external location
+without downloading the data (refer to previous example).
-```dvc
-$ mkdir example # workspace
-$ cd example
-$ git init
-$ dvc init
-$ mkdir /tmp/dvc-storage
-$ dvc remote add myremote /tmp/dvc-storage
-```
+The `--to-remote` option lets you store a copy of the target data on a
+[DVC remote](/doc/command-reference/remote), while creating a `.dvc` file
+locally so it can be [pulled](/doc/command-reference/plots) later. This is a way
+to "bootstrap" a project in your local machine, to be
+[reproduced](/doc/command-reference/repro) on the right environment later (e.g.
+a GPU cloud server or a CI/CD system).
-Now let's add the `data.xml` to our remote storage from the given remote
-location.
+Let's setup a simple remote and add a `data.xml` file from the web this way:
```dvc
+$ mkdir /tmp/dvc-storage
+$ dvc remote add myremote /tmp/dvc-storage
$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \
--to-remote -r myremote
...
-```
-
-The only difference that dataset is transferred straight to remote, so DVC won't
-control the remote location you gave but rather continue managing your remote
-storage where the data is now on. The operation will still be resulted with an
-`.dvc` file:
-
-```dvc
$ ls
data.xml.dvc
```
-Whenever anyone wants to actually download the added data (for example from a
-system that can handle it), they can use `dvc pull` as usual:
+> Note that this can be combined with `--out` to specify a local destination
+> `path` (written to the `.dvc` file).
-```dvc
- $ dvc pull data.xml.dvc -r tmp_remote
+DVC won't control the original data source after this, but rather continue
+managing your remote storage, where the data is now found. To actually download
+the data to cache, you can use `dvc fetch` or `dvc pull` as usual
+(on a system that can handle it):
+```dvc
+$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
```
+
+> Note that `dvc repro` will try to download the data too, as part of the
+> pipeline execution.
diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md
index 05312cca53..0d749a3d5c 100644
--- a/content/docs/command-reference/get.md
+++ b/content/docs/command-reference/get.md
@@ -56,10 +56,10 @@ name.
## Options
-- `-o `, `--out ` - specify a path to the desired location in the
- workspace to place the downloaded file or directory (instead of using the
- current working directory). Directories specified in the path will be created
- by this command.
+- `-o `, `--out ` - destination `path` to place the downloaded file
+ or directory. By default the data file basename is used in the current working
+ directory (if this option isn't used). Directories in the given `path` will be
+ created.
- `--rev ` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md
index 325b58c7e9..744bea83ee 100644
--- a/content/docs/command-reference/import-url.md
+++ b/content/docs/command-reference/import-url.md
@@ -23,9 +23,8 @@ positional arguments:
## Description
In some cases it's convenient to add a data file or directory from an external
-location into the workspace (or to
-[remote storage](/doc/command-reference/remote)), such that it can be updated
-later, if/when the external data source changes. Example scenarios:
+location into the project, such that it can be updated later if/when the
+external data source changes. Example scenarios:
- A remote system may produce occasional data files that are used in other
projects.
@@ -37,22 +36,20 @@ later, if/when the external data source changes. Example scenarios:
`dvc import-url` helps you create such an external data dependency, without
having to manually copy files from the supported locations (listed below), which
-may require installing a different tool for each type.
-
-When you don't want to store the target data in your local system, you can still
-create an import `.dvc` file while transferring a file or directory directly to
-remote storage, by using the `--to-remote` option. See the
-[Transfer to remote storage](#example-transfer-to-remote-storage) example for
-more details.
+would require installing/using a different tool for each type.
The `url` argument specifies the external location of the data to be imported.
The imported data is cached, and linked (or copied) to the current
-working directory with its original file name e.g. `data.txt` (or to a location
-provided with `out`).
+working directory with its original file name e.g. `data.txt`, or to a location
+provided with `out`.
An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` –
-similar to using `dvc add` after downloading the data. This makes it possible to
-update the import later, if the data source has changed (see `dvc update`).
+similar to using `dvc add` after downloading the data. It saves the information
+about the data source, so the import can be updated later if the data source has
+changed (see `dvc update`).
+
+💡 The `--to-remote` option lets you store an import on a
+[DVC remote](/doc/command-reference/remote) without using the local file system.
> Note that the imported data can be [pushed](/doc/command-reference/push) to
> remote storage normally.
@@ -64,8 +61,9 @@ field contains the corresponding local path in the workspace. It
records enough metadata about the imported data to enable DVC efficiently
determining whether the local copy is out of date.
-Note that `dvc repro` doesn't check or update import `.dvc` files, use
-`dvc update` to bring the import up to date from the data source.
+Note that `dvc repro` doesn't check or update import `.dvc` files by default
+(see `dvc freeze`), use `dvc update` to bring the import up to date from the
+data source.
DVC supports several types of external locations (protocols):
@@ -140,13 +138,13 @@ $ dvc run -n download_data \
want to "DVCfy" this state of the project (see also `dvc commit`).
- `--to-remote` - import an external target, but don't move it into the
- workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
- directly to remote storage (the default one, unless `-r` is specified)
- instead. Use `dvc pull` to get the data locally.
+ workspace, nor cache it. [Store a copy](#straight-to-remote) on a remote
+ instead (the default one unless `-r` is specified). Use `dvc pull` to get the
+ data locally.
- `-r `, `--remote ` - name of the
- [remote storage](/doc/command-reference/remote) (can only be used with
- `--to-remote`).
+ [remote](/doc/command-reference/remote) to store data on (can only be used
+ with `--to-remote`).
- `-j `, `--jobs ` - parallelism level for DVC to download data
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
@@ -358,46 +356,36 @@ Running stage 'prepare' with command:
python src/prepare.py data/data.xml
```
-## Example: Transfer to remote storage
+## Example: `--to-remote` usage {#straight-to-remote}
-When you have a large dataset in an external location, you may want to import it
-to your project without downloading it to the local file system (for using it
-later/elsewhere). The `--to-remote` option let you skip the download, while
-storing the imported data [remotely](/doc/command-reference/remote). Let's
-initialize a DVC project, and setup a remote:
+Normally, `dvc import-url` downloads the target data (to the cache)
+in order to link and track it locally. But what if there's not enough disk
+space?
-```dvc
-$ mkdir example # workspace
-$ cd example
-$ git init
-$ dvc init
-$ mkdir /tmp/dvc-storage
-$ dvc remote add myremote /tmp/dvc-storage
-```
+The `--to-remote` option lets you store a copy of the target data on a
+[DVC remote](/doc/command-reference/remote), while creating an import `.dvc`
+file locally so it can be [pulled](/doc/command-reference/plots) later. This is
+a way to "bootstrap" an import in your local machine, to be downloaded on the
+right environment later.
-Now let's create an import `.dvc` file without downloading the target data,
-transferring it directly to remote storage instead:
+Let's setup a simple remote and add a `data.xml` file from the web this way:
```
+$ mkdir /tmp/dvc-storage
+$ dvc remote add myremote /tmp/dvc-storage
$ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \
--to-remote -r myremote
...
-```
-
-The only change in our local workspace is a newly created import
-`.dvc` file:
-
-```dvc
$ ls
data.xml.dvc
```
-Whenever anyone wants to actually download the imported data (for example from a
-system that can handle it), they can use `dvc pull` as usual:
+The only change in our local workspace is the tiny `.dvc` file that
+was created. To actually download the data to cache, you can use
+`dvc fetch` or `dvc pull` as usual (on a system that can handle it):
```
- $ dvc pull data.xml.dvc -r tmp_remote
-
+$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
```
diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md
index b16a14b367..88727248c1 100644
--- a/content/docs/command-reference/import.md
+++ b/content/docs/command-reference/import.md
@@ -70,19 +70,19 @@ enable DVC efficiently determining whether the local copy is out of date.
To actually [version the data](/doc/tutorials/get-started/data-versioning),
`git add` (and `git commit`) the import `.dvc` file.
-Note that `dvc repro` doesn't check or update import `.dvc` files (see
-`dvc freeze`), use `dvc update` to bring the import up to date from the data
-source.
+Note that `dvc repro` doesn't check or update import `.dvc` files by default
+(see `dvc freeze`), use `dvc update` to bring the import up to date from the
+data source.
Also note that chained imports (importing data that was imported into the source
repo at `url`) are not supported.
## Options
-- `-o `, `--out ` - specify a path to the desired location in the
- workspace to place the downloaded file or directory (instead of using the
- current working directory). Directories specified in the path must already
- exist, otherwise this command will fail.
+- `-o `, `--out ` - destination `path` inside the workspace to place
+ the downloaded file or directory. By default the file basename name is used in
+ the current working directory (if this option isn't used). Directories in the
+ given `path` will be created.
- `--file ` - specify a path and/or file name for the `.dvc` file
created by this command (e.g. `--file stages/stage.dvc`). This overrides the
@@ -154,7 +154,7 @@ outs:
cache: true
```
-Several of the values above are pulled from the original `.dvc` file
+Several of the values above are obtained from the original `.dvc` file
`model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock`
subfields under `repo` are used to save the origin and version of the
dependency, respectively.
diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md
index f1a6dada55..9821b097a5 100644
--- a/content/docs/command-reference/update.md
+++ b/content/docs/command-reference/update.md
@@ -49,11 +49,12 @@ $ dvc update --rev master
directory and its subdirectories for import stage `.dvc` files to inspect. If
there are no directories among the targets, this option is ignored.
-- `--to-remote` - update a `.dvc` file created with `dvc import-url` and
- [transfer](/doc/command-reference/import-url#example-import-straight-to-the-remote)
- the new data directly to remote storage (the default one unless `-r` is used).
- No changes are done in the workspace. Use `dvc pull` to get the
- data locally. This option can't be used with DVC or Git repository imports.
+- `--to-remote` - update a `.dvc` file created with `dvc import-url` and store
+ the latest data directly
+ [on remote storage](/doc/command-reference/import-url#straight-to-remote) (the
+ default one unless `-r` is specified). Tracked data is not changed in the
+ workspace. Use `dvc pull` to get the data locally. This option
+ can't be used with data imported from DVC or Git repos (with `dvc import`).
- `-r `, `--remote ` - name of the
[remote storage](/doc/command-reference/remote) (can only be used with
diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md
index b7779ea02a..7426c975a1 100644
--- a/content/docs/user-guide/managing-external-data.md
+++ b/content/docs/user-guide/managing-external-data.md
@@ -1,13 +1,13 @@
# External Outputs
> ⚠️ This is an advanced feature for very specific situations and not
-> recommended except if there's absolutely no other alternative. In most cases
-> alternatives like the
-> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or
-> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage)
-> strategies of `dvc add` and `dvc import-url` are more convenient. **Note**
-> that external outputs are not pushed or pulled from/to
+> recommended except if there's absolutely no other alternative. Note that
+> external outputs are not pushed or pulled from/to
> [remote storage](/doc/command-reference/remote).
+>
+> In most cases the [to-cache](/doc/command-reference/add#example-external-data)
+> or [to-remote](/doc/command-reference/add#straight-to-remote) strategies of
+> `dvc add` and `dvc import-url` are better.
There are cases when data is so large, or its processing is organized in such a
way, that its impossible to handle it in the local machine disk. For example