From 3f1217ebc950bea4123acd81cb775f41750c7703 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Fri, 12 Mar 2021 20:19:51 -0700 Subject: [PATCH 01/28] guide: update Ext Data guide link to add to-cache/remote examples per https://github.com/iterative/dvc.org/pull/2246#issuecomment-796981567 --- content/docs/user-guide/managing-external-data.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index e038c0e6b7..1a1588358d 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -2,9 +2,10 @@ > ⚠ī¸ This is an advanced feature that we don't recommend using unless you really > know what you are doing. Artifacts added with --external are not affected by -> `dvc push/pull/status -c`. You are likely looking for -> [straight-to-remote/cache](https://github.com/iterative/dvc/issues/4520) -> functionality or `dvc import-url` +> `dvc push/pull/status -c`. You are likely looking for straight +> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or +> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) +> transfers, or `dvc import-url`). There are cases when data is so large, or its processing is organized in such a way, that its preferable to avoid moving it from its original location. For From bab95a913c8e1a71afeb7bbe6cd6bf4bac969bc4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 13 Mar 2021 20:22:13 -0700 Subject: [PATCH 02/28] ref: config options copy edits --- content/docs/command-reference/config.md | 29 ++++++++++++------------ 1 file changed, 15 insertions(+), 14 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index aaf7372f46..98831f1420 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -11,7 +11,7 @@ usage: dvc config [-h] [--global | --system | --local] [-q | -v] [-u] positional arguments: name Option name in format: section.option or remote.name.option e.g. 'core.check_update', 'cache.dir', 'remote.myremote.url' - value Option value. + value Option value. ``` ## Description @@ -187,25 +187,26 @@ This section contains the following options, which affect the project's for example, they're determined using [`os.umask`](https://docs.python.org/3/library/os.html#os.umask). -- `cache.local` - name of a _local remote_ to use as a - [custom cache](/doc/user-guide/managing-external-data#examples) directory. - (Refer to `dvc remote` for more information on "local remotes".) This will - overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. +The following parameters allow setting an +[external cache](/doc/user-guide/managing-external-data#examples) location. A +[DVC remote](/doc/command-reference/remote) name is used (instead of the URL) +because often it's necessary to configure authentication or other connection +settings, and configuring a remote is the way that can be done. -- `cache.s3` - name of an Amazon S3 remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). +- `cache.local` - name of a _local remote_ to use as external cache (refer to + `dvc remote` for more info. on "local remotes".) This will overwrite the value + in `cache.dir` (see `dvc cache dir`). -- `cache.gs` - name of a Google Cloud Storage remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). +- `cache.s3` - name of an Amazon S3 remote to use as external cache. -- `cache.ssh` - name of an SSH remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). +- `cache.gs` - name of a Google Cloud Storage remote to use as external cache. -- `cache.hdfs` - name of an HDFS remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). +- `cache.ssh` - name of an SSH remote to use as external cache. + +- `cache.hdfs` - name of an HDFS remote to use as external cache. - `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as - [external cache](/doc/user-guide/managing-external-data#examples). + external cache. > Avoid using the same [DVC remote](/doc/command-reference/remote) (used for > `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file From 69cbbb6588fe09c6823a1b2af6440e920f09fb50 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 13 Mar 2021 20:24:27 -0700 Subject: [PATCH 03/28] ref: destroy copy edit --- content/docs/command-reference/destroy.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/destroy.md b/content/docs/command-reference/destroy.md index 8f1111af76..8c4a4fb695 100644 --- a/content/docs/command-reference/destroy.md +++ b/content/docs/command-reference/destroy.md @@ -12,8 +12,8 @@ usage: dvc destroy [-h] [-q | -v] [-f] ## Description -`dvc destroy` removes `dvc.yaml` and `.dvc` files, as well as the internal -`.dvc/` directory from the workspace. +`dvc destroy` removes `dvc.yaml`, `.dvc` files, and the internal `.dvc/` +directory from the project. Note that the cache directory will be removed as well, unless it's set to an @@ -99,9 +99,9 @@ $ ls -a .git code.py foo ``` -`dvc destroy` command removed `foo.dvc` and the `.dvc/` directory from the -workspace. But the cache files that are present in `/mnt/cache` -still persist: +`dvc destroy` removed `foo.dvc` and the internal `.dvc/` directory from +project. But the cache files that are present in `/mnt/cache` +persist: ```dvc $ tree /mnt/cache From cd599c8229f6d49f204acd271986e23ef612c59d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Mar 2021 21:29:39 -0600 Subject: [PATCH 04/28] ref: fix mac config file locs per https://github.com/iterative/dvc.org/issues/2032#issuecomment-794865142 --- content/docs/command-reference/config.md | 8 ++++---- 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 5893f42fe6..77fceaa72c 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -45,10 +45,10 @@ multiple projects and users, respectively: } -| Flag | Priority | Mac location | Linux location (typical\*) | Windows location | -| ---------- | -------- | -------------------------------------- | -------------------------- | --------------------------------------------------------- | -| `--global` | 3 | `$HOME/Library/Preferences/dvc/config` | `$HOME/.config/dvc/config` | `%LocalAppData%\iterative\dvc\config` | -| `--system` | 4 | `/Library/Preferences/dvc/config` | `/etc/xdg/dvc/config` | `%AllUsersProfile%\Application Data\iterative\dvc\config` | +| Flag | Priority | Mac location | Linux location (typical\*) | Windows location | +| ---------- | -------- | ----------------------------------------------- | -------------------------- | --------------------------------------------------------- | +| `--global` | 3 | `$HOME/Library/Application\ Support/dvc/config` | `$HOME/.config/dvc/config` | `%LocalAppData%\iterative\dvc\config` | +| `--system` | 4 | `/Library/Application\ Support/dvc/config` | `/etc/xdg/dvc/config` | `%AllUsersProfile%\Application Data\iterative\dvc\config` | > \* For Linux, the global `dvc/config` may be found in `$XDG_CONFIG_HOME`, and > the system-wide one in `$XDG_CONFIG_DIRS[0]`, if those env vars are defined. From eb4af9725c6a17b0bd20e978bb419413aec8b748 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Mar 2021 22:18:37 -0600 Subject: [PATCH 05/28] ref: small update to plots * --open --- content/docs/command-reference/plots/diff.md | 2 +- content/docs/command-reference/plots/show.md | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index cdea58d2f8..cf7a120bf6 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -91,7 +91,7 @@ all the current plots, without comparisons. [Vega specification](https://vega.github.io/vega/docs/specification/) file instead of HTML. See `dvc plots` for more info. -- `--open` - opens the generated plot directly in the browser. +- `--open` - opens the generated plot in the browser automatically. - `--no-header` - lets DVC know that CSV or TSV `--targets` do not have a header. A 0-based numeric index can be used to identify each column instead of diff --git a/content/docs/command-reference/plots/show.md b/content/docs/command-reference/plots/show.md index 692dd49a7e..cc6cfc9c81 100644 --- a/content/docs/command-reference/plots/show.md +++ b/content/docs/command-reference/plots/show.md @@ -65,7 +65,7 @@ please see `dvc plots`. [Vega specification](https://vega.github.io/vega/docs/specification/) file instead of HTML. See `dvc plots` for more info. -- `--open` - opens the generated plot directly in the browser. +- `--open` - opens the generated plot in the browser automatically. - `--no-header` - lets DVC know that CSV or TSV `targets` do not have a header. A 0-based numeric index can be used to identify each column instead of names. From 043db23f1f70b1dfbacef5c7b25b662c681c43c9 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 14 Mar 2021 23:32:32 -0600 Subject: [PATCH 06/28] ref: clarify and correct info on add/import to-cache/remote strategies per https://github.com/iterative/dvc.org/pull/2172#discussion_r583065469 --- content/docs/command-reference/add.md | 90 +++++++++---------- content/docs/command-reference/get.md | 6 +- content/docs/command-reference/import-url.md | 77 ++++++++-------- content/docs/command-reference/import.md | 23 +++-- .../docs/user-guide/managing-external-data.md | 4 +- 5 files changed, 95 insertions(+), 105 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index d59ded757c..d14606c443 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -153,10 +153,12 @@ not. > Additionally, this typically requires an external cache setup (see link > above). -- `-o `, `--out ` - destination `path` to make a local target copy, - or to [transfer](#example-transfer-to-cache) an external target into the cache - (and link to workspace). Note that this can be combined with `--to-remote` to - avoid storing the data locally, while still adding it to the project. +- `-o `, `--out ` - destination `path` inside the workspace to link + (or copy) a data target, which will now be tracked by DVC. Note that combining + this with an + [external cache transfer](#example-transfer-to-an-external-cache), or with the + `--to-remote` option, let's you avoid storing an external target locally, + while still adding it to the project. - `--to-remote` - import an external target, but don't move it into the workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it @@ -336,28 +338,21 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -## Example: Transfer to the cache +## Example: Transfer to an external cache -When you have a large dataset in an external location, you may want to add it to -the project without having to copy it into the workspace. Maybe -your local disk doesn't have enough space, but you have setup an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -that could handle it. +Sometimes you may want to add a large dataset currently found in an external +location, so it becomes local to the project. However, your local file system +may not have enough space to download it — which is needed to add data in DVC, +right? Not necessarily! -The `--out` option lets you add external paths in a way that they are +The `--out` option lets you add external data in a way that it's cached first, and then [linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -to a given path inside the workspace. Let's initialize an example -DVC project to try this: - -```dvc -$ mkdir example # workspace -$ cd example -$ git init -$ dvc init -``` +to a given path inside the workspace. Combined with an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +setup, this let's you avoid using your local file system completely. -Now we can add a `data.xml` file via HTTP for example, putting it a local path +For example, we can add a `data.xml` file via HTTP, outputting it a local path in our project: ```dvc @@ -368,9 +363,10 @@ data.xml data.xml.dvc ``` The resulting `.dvc` file will save the provided local `path` as if the data was -already in the workspace, while the `md5` hash points to the copy of the data -that has now been transferred to the cache. Let's check the -contents of `data.xml.dvc` in this case: +always in the workspace, while the `md5` hash points to the copy of the data +that has now been transferred to the cache (which again, we assume +it's already setup in some storage drive that can handle it). Let's check the +contents of `data.xml.dvc`: ```yaml outs: @@ -384,43 +380,37 @@ outs: ## Example: Transfer to remote storage -When you have a large dataset in an external location, you may want to track it -as if it was in your project, but without downloading it locally (for now). The -`--to-remote` option lets you do so, while storing a copy -[remotely](/doc/command-reference/remote) so it can be -[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project, -and setup a remote: +Similarly to the previous scenario, you may sometimes want to track a large +dataset found externally into a regular project (with a local +cache). Can it be done without downloading the data locally (for +now)? Yes! -```dvc -$ mkdir example # workspace -$ cd example -$ git init -$ dvc init -$ mkdir /tmp/dvc-storage -$ dvc remote add myremote /tmp/dvc-storage -``` +The `--to-remote` option lets you transfer a copy of the target data to +[remote storage](/doc/command-reference/remote), while creating a `.dvc` file +locally so it can be [pulled](/doc/command-reference/plots) later. This is a way +to "bootstrap" your project in your local machine, to be reproduced on the right +environment later (e.g. a GPU cloud server or a CI/CD system). -Now let's add the `data.xml` to our remote storage from the given remote -location. +Let's setup a simple remote and transfer a `data.xml` file from the web into it +via DVC: ```dvc +$ mkdir /tmp/dvc-storage +$ dvc remote add myremote /tmp/dvc-storage $ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \ --to-remote -r myremote ... -``` - -The only difference that dataset is transferred straight to remote, so DVC won't -control the remote location you gave but rather continue managing your remote -storage where the data is now on. The operation will still be resulted with an -`.dvc` file: - -```dvc $ ls data.xml.dvc ``` -Whenever anyone wants to actually download the added data (for example from a -system that can handle it), they can use `dvc pull` as usual: +> Note that this can be combined with `--out` to specify a local destination +> `path` (written to the `.dvc` file). + +DVC won't control the original data source after this, but rather continue +managing your remote storage, where the data is now found. Whenever anyone wants +to actually download the added data (from a system that can handle it), they can +use `dvc pull` as usual: ```dvc $ dvc pull data.xml.dvc -r tmp_remote diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 05312cca53..17f67af16d 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -24,13 +24,13 @@ repository (e.g. source code, small image/other files). `dvc get` copies the target file or directory (found at `path` in `url`) to the current working directory. (Analogous to `wget`, but for repos.) +> See `dvc list` for a way to browse repository contents to find files or +> directories to download. + > Note that unlike `dvc import`, this command does not track the downloaded > files (does not create a `.dvc` file). For that reason, it doesn't require an > existing DVC project to run in. -> See `dvc list` for a way to browse repository contents to find files or -> directories to download. - The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. `[user@]server:project.git`). `url` can also be a local file system path diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 325b58c7e9..5e873e4e34 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -23,9 +23,8 @@ positional arguments: ## Description In some cases it's convenient to add a data file or directory from an external -location into the workspace (or to -[remote storage](/doc/command-reference/remote)), such that it can be updated -later, if/when the external data source changes. Example scenarios: +location into the project, such that it can be updated later if/when the +external data source changes. Example scenarios: - A remote system may produce occasional data files that are used in other projects. @@ -37,25 +36,27 @@ later, if/when the external data source changes. Example scenarios: `dvc import-url` helps you create such an external data dependency, without having to manually copy files from the supported locations (listed below), which -may require installing a different tool for each type. - -When you don't want to store the target data in your local system, you can still -create an import `.dvc` file while transferring a file or directory directly to -remote storage, by using the `--to-remote` option. See the -[Transfer to remote storage](#example-transfer-to-remote-storage) example for -more details. +would require installing/using a different tool for each type. The `url` argument specifies the external location of the data to be imported. The imported data is cached, and linked (or copied) to the current -working directory with its original file name e.g. `data.txt` (or to a location -provided with `out`). +working directory with its original file name e.g. `data.txt`, or to a location +provided with `out`. An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – -similar to using `dvc add` after downloading the data. This makes it possible to -update the import later, if the data source has changed (see `dvc update`). +similar to using `dvc add` after downloading the data. It saves the information +about the data source, so the import can be updated later if the data source has +changed (see `dvc update`). + +💡 Using an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +or the `--to-remote` option lets you +[transfer](#example-transfer-to-remote-storage) an import without using the +local file system. -> Note that the imported data can be [pushed](/doc/command-reference/push) to -> remote storage normally. +> Note that imported data can be [pushed](/doc/command-reference/push) and +> [pulled](/doc/command-reference/pull) to/from +> [remote storage](/doc/command-reference/remote) normally. `.dvc` files support references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such an @@ -64,8 +65,9 @@ field contains the corresponding local path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -Note that `dvc repro` doesn't check or update import `.dvc` files, use -`dvc update` to bring the import up to date from the data source. +Note that `dvc repro` doesn't check or update import `.dvc` files by default +(see `dvc freeze`), use `dvc update` to bring the import up to date from the +data source. DVC supports several types of external locations (protocols): @@ -360,40 +362,33 @@ Running stage 'prepare' with command: ## Example: Transfer to remote storage -When you have a large dataset in an external location, you may want to import it -to your project without downloading it to the local file system (for using it -later/elsewhere). The `--to-remote` option let you skip the download, while -storing the imported data [remotely](/doc/command-reference/remote). Let's -initialize a DVC project, and setup a remote: +Normally, `dvc import-url` downloads the target data (to the cache) +in order to link and track it locally. But what if there's not enough disk space +for the download? -```dvc -$ mkdir example # workspace -$ cd example -$ git init -$ dvc init -$ mkdir /tmp/dvc-storage -$ dvc remote add myremote /tmp/dvc-storage -``` +One option is to setup an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +in a location that can handle the data. Another is to use the `--to-remote` +option so the target data is transferred to +[remote storage](/doc/command-reference/remote), while also tracked via an +import `.dvc` file in the project. -Now let's create an import `.dvc` file without downloading the target data, -transferring it directly to remote storage instead: +Let's setup a simple remote and create an import `.dvc` file without downloading +the target data, transferring it directly to the remote: ``` +$ mkdir /tmp/dvc-storage +$ dvc remote add myremote /tmp/dvc-storage $ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \ --to-remote -r myremote ... -``` - -The only change in our local workspace is a newly created import -`.dvc` file: - -```dvc $ ls data.xml.dvc ``` -Whenever anyone wants to actually download the imported data (for example from a -system that can handle it), they can use `dvc pull` as usual: +The only change in our local workspace is a the tiny `.dvc` file +that was created. Whenever anyone wants to actually download the imported data +(into a system that can handle it), they can use `dvc pull` as usual: ``` $ dvc pull data.xml.dvc -r tmp_remote diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index b16a14b367..ca41ac0e11 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -27,20 +27,25 @@ target file or directory (found at `path` in `url`), and tracks it in the local project. This makes it possible to update the import later, if the data source has changed (see `dvc update`). -> Note that `dvc get` corresponds to the first step this command performs (just -> download the data). - > See `dvc list` for a way to browse repository contents to find files or > directories to import. +> Note that `dvc get` corresponds to the first step this command performs (just +> download the data). + The imported data is cached, and linked (or copied) to the current working directory with its original file name e.g. `data.txt` (or to a location provided with `--out`). An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. -(ℹī¸) DVC won't push or pull imported data to/from -[remote storage](/doc/command-reference/remote), it will rely on it's original -source. +💡 Using an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) +lets you transfer an import there (and link it in the workspace), +without using the local file system. + +> Note that imported data can be [pushed](/doc/command-reference/push) and +> [pulled](/doc/command-reference/pull) to/from +> [remote storage](/doc/command-reference/remote) normally. The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. @@ -70,9 +75,9 @@ enable DVC efficiently determining whether the local copy is out of date. To actually [version the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import `.dvc` file. -Note that `dvc repro` doesn't check or update import `.dvc` files (see -`dvc freeze`), use `dvc update` to bring the import up to date from the data -source. +Note that `dvc repro` doesn't check or update import `.dvc` files by default +(see `dvc freeze`), use `dvc update` to bring the import up to date from the +data source. Also note that chained imports (importing data that was imported into the source repo at `url`) are not supported. diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index b361db2cfe..dae92950d7 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -6,8 +6,8 @@ > [remote storage](/doc/command-reference/remote). > > In most cases the -> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or -> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) +> [to-cache](/doc/command-reference/add#example-transfer-to-an-external-cache) +> or [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) > strategies of `dvc add` and `dvc import-url` are more convenient. There are cases when data is so large, or its processing is organized in such a From fa82c89dbf00bc7f2e72c806e082ce9fa96e74a4 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 01:12:44 -0600 Subject: [PATCH 07/28] ref: import-url vs import in terms of remote sync per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-611837892 --- content/docs/command-reference/import-url.md | 6 ++++-- content/docs/command-reference/import.md | 4 ++++ 2 files changed, 8 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 5e873e4e34..0de4ec0080 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -54,9 +54,11 @@ or the `--to-remote` option lets you [transfer](#example-transfer-to-remote-storage) an import without using the local file system. -> Note that imported data can be [pushed](/doc/command-reference/push) and +> Note that data imported from external locaitons can be +> [pushed](/doc/command-reference/push) and > [pulled](/doc/command-reference/pull) to/from -> [remote storage](/doc/command-reference/remote) normally. +> [remote storage](/doc/command-reference/remote) normally (unlike for +> `dvc import`). `.dvc` files support references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such an diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index ca41ac0e11..4ee9d80688 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -43,6 +43,10 @@ e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. lets you transfer an import there (and link it in the workspace), without using the local file system. +(ℹī¸) DVC won't push or pull data imported from other DVC repos to/from +[remote storage](/doc/command-reference/remote), it will rely on the original +source instead. + > Note that imported data can be [pushed](/doc/command-reference/push) and > [pulled](/doc/command-reference/pull) to/from > [remote storage](/doc/command-reference/remote) normally. From 5729f49bc9416a823dfd3cc547cda5f8659663f8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 01:19:03 -0600 Subject: [PATCH 08/28] ref: roll back changes unrelated to get/import from this PR per https://github.com/iterative/dvc.org/pull/2302#issuecomment-799157414 --- content/docs/command-reference/config.md | 8 ++++---- content/docs/command-reference/plots/diff.md | 2 +- content/docs/command-reference/plots/show.md | 2 +- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 77fceaa72c..5893f42fe6 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -45,10 +45,10 @@ multiple projects and users, respectively: } -| Flag | Priority | Mac location | Linux location (typical\*) | Windows location | -| ---------- | -------- | ----------------------------------------------- | -------------------------- | --------------------------------------------------------- | -| `--global` | 3 | `$HOME/Library/Application\ Support/dvc/config` | `$HOME/.config/dvc/config` | `%LocalAppData%\iterative\dvc\config` | -| `--system` | 4 | `/Library/Application\ Support/dvc/config` | `/etc/xdg/dvc/config` | `%AllUsersProfile%\Application Data\iterative\dvc\config` | +| Flag | Priority | Mac location | Linux location (typical\*) | Windows location | +| ---------- | -------- | -------------------------------------- | -------------------------- | --------------------------------------------------------- | +| `--global` | 3 | `$HOME/Library/Preferences/dvc/config` | `$HOME/.config/dvc/config` | `%LocalAppData%\iterative\dvc\config` | +| `--system` | 4 | `/Library/Preferences/dvc/config` | `/etc/xdg/dvc/config` | `%AllUsersProfile%\Application Data\iterative\dvc\config` | > \* For Linux, the global `dvc/config` may be found in `$XDG_CONFIG_HOME`, and > the system-wide one in `$XDG_CONFIG_DIRS[0]`, if those env vars are defined. diff --git a/content/docs/command-reference/plots/diff.md b/content/docs/command-reference/plots/diff.md index cf7a120bf6..cdea58d2f8 100644 --- a/content/docs/command-reference/plots/diff.md +++ b/content/docs/command-reference/plots/diff.md @@ -91,7 +91,7 @@ all the current plots, without comparisons. [Vega specification](https://vega.github.io/vega/docs/specification/) file instead of HTML. See `dvc plots` for more info. -- `--open` - opens the generated plot in the browser automatically. +- `--open` - opens the generated plot directly in the browser. - `--no-header` - lets DVC know that CSV or TSV `--targets` do not have a header. A 0-based numeric index can be used to identify each column instead of diff --git a/content/docs/command-reference/plots/show.md b/content/docs/command-reference/plots/show.md index cc6cfc9c81..692dd49a7e 100644 --- a/content/docs/command-reference/plots/show.md +++ b/content/docs/command-reference/plots/show.md @@ -65,7 +65,7 @@ please see `dvc plots`. [Vega specification](https://vega.github.io/vega/docs/specification/) file instead of HTML. See `dvc plots` for more info. -- `--open` - opens the generated plot in the browser automatically. +- `--open` - opens the generated plot directly in the browser. - `--no-header` - lets DVC know that CSV or TSV `targets` do not have a header. A 0-based numeric index can be used to identify each column instead of names. From c39321237dd72112667fe8fb2166c928f0686277 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 01:27:48 -0600 Subject: [PATCH 09/28] ref: remove wrong info about import* to-cache per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-611840160 --- content/docs/command-reference/import-url.md | 9 ++------- content/docs/command-reference/import.md | 5 ----- 2 files changed, 2 insertions(+), 12 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 0de4ec0080..6becb48ef8 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -48,9 +48,7 @@ similar to using `dvc add` after downloading the data. It saves the information about the data source, so the import can be updated later if the data source has changed (see `dvc update`). -💡 Using an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -or the `--to-remote` option lets you +💡 Using the `--to-remote` option lets you [transfer](#example-transfer-to-remote-storage) an import without using the local file system. @@ -368,10 +366,7 @@ Normally, `dvc import-url` downloads the target data (to the cache) in order to link and track it locally. But what if there's not enough disk space for the download? -One option is to setup an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -in a location that can handle the data. Another is to use the `--to-remote` -option so the target data is transferred to +You can use the `--to-remote` option so the target data is transferred to [remote storage](/doc/command-reference/remote), while also tracked via an import `.dvc` file in the project. diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 4ee9d80688..ee9cf4d345 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -38,11 +38,6 @@ working directory with its original file name e.g. `data.txt` (or to a location provided with `--out`). An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. -💡 Using an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -lets you transfer an import there (and link it in the workspace), -without using the local file system. - (ℹī¸) DVC won't push or pull data imported from other DVC repos to/from [remote storage](/doc/command-reference/remote), it will rely on the original source instead. From b020b4eaabb5e0d121bfd256e2349b5067c56318 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 01:40:04 -0600 Subject: [PATCH 10/28] Update content/docs/command-reference/add.md Co-authored-by: Batuhan Taskaya --- content/docs/command-reference/add.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index d14606c443..21ee7d14a9 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -352,7 +352,7 @@ to a given path inside the workspace. Combined with an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) setup, this let's you avoid using your local file system completely. -For example, we can add a `data.xml` file via HTTP, outputting it a local path +For example, we can add a `data.xml` file via HTTP, outputting it to a local path in our project: ```dvc From 86813adb5500a0d28c5365af6f415263ff8ceb0a Mon Sep 17 00:00:00 2001 From: "Restyled.io" Date: Mon, 15 Mar 2021 07:40:17 +0000 Subject: [PATCH 11/28] Restyled by prettier --- content/docs/command-reference/add.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 21ee7d14a9..3d5aca39cc 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -352,8 +352,8 @@ to a given path inside the workspace. Combined with an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) setup, this let's you avoid using your local file system completely. -For example, we can add a `data.xml` file via HTTP, outputting it to a local path -in our project: +For example, we can add a `data.xml` file via HTTP, outputting it to a local +path in our project: ```dvc $ dvc add https://data.dvc.org/get-started/data.xml -o data.xml From f2350ab68b037cee3b663094fc0903cef0241d36 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 15:59:29 -0600 Subject: [PATCH 12/28] ref: import + push/pull notes --- content/docs/command-reference/import.md | 10 +++------- content/docs/command-reference/pull.md | 3 +++ content/docs/command-reference/push.md | 2 ++ 3 files changed, 8 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index ee9cf4d345..1e37ccf18d 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -39,12 +39,8 @@ provided with `--out`). An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. (ℹī¸) DVC won't push or pull data imported from other DVC repos to/from -[remote storage](/doc/command-reference/remote), it will rely on the original -source instead. - -> Note that imported data can be [pushed](/doc/command-reference/push) and -> [pulled](/doc/command-reference/pull) to/from -> [remote storage](/doc/command-reference/remote) normally. +[remote storage](/doc/command-reference/remote). `dvc pull` will download from +the original source instead. The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. @@ -158,7 +154,7 @@ outs: cache: true ``` -Several of the values above are pulled from the original `.dvc` file +Several of the values above are obtained from the original `.dvc` file `model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock` subfields under `repo` are used to save the origin and version of the dependency, respectively. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index 40c477e48f..c2bcf99ada 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -74,6 +74,9 @@ Note that the command `dvc status -c` can list files referenced in current stages (in `dvc.yaml`) or `.dvc` files, but missing from the cache. It can be used to see what files `dvc pull` would download. +> Note that in the case of `dvc import` data, `dvc pull` downloads from the +> original data source (an external DVC repo's remote storage, typically). + ## Options - `-a`, `--all-branches` - determines the files to download by examining diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index 6bc86f93bd..d3320b234f 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -57,6 +57,8 @@ Note that the `dvc status -c` command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files `dvc push` would upload. +> Note that `dvc push` doesn't upload `dvc import` data. + ## Options - `-a`, `--all-branches` - determines the files to upload by examining From 14b62cc1448b414ce5c33534b33a11dd92674274 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 16:06:07 -0600 Subject: [PATCH 13/28] ref: simplify add -o per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612504838 --- content/docs/command-reference/add.md | 7 ++----- content/docs/command-reference/get.md | 7 +++---- content/docs/command-reference/import.md | 8 ++++---- 3 files changed, 9 insertions(+), 13 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 3d5aca39cc..20723d3a17 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -154,11 +154,8 @@ not. > above). - `-o `, `--out ` - destination `path` inside the workspace to link - (or copy) a data target, which will now be tracked by DVC. Note that combining - this with an - [external cache transfer](#example-transfer-to-an-external-cache), or with the - `--to-remote` option, let's you avoid storing an external target locally, - while still adding it to the project. + (or copy) a data target (instead of using the current working directory). + Directories specified in the path will be created by this command. - `--to-remote` - import an external target, but don't move it into the workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 17f67af16d..91f4d348c2 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -56,10 +56,9 @@ name. ## Options -- `-o `, `--out ` - specify a path to the desired location in the - workspace to place the downloaded file or directory (instead of using the - current working directory). Directories specified in the path will be created - by this command. +- `-o `, `--out ` - destination `path` to place the downloaded file + or directory (instead of using the current working directory). Directories + specified in the path will be created by this command. - `--rev ` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 1e37ccf18d..91e3270e97 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -79,10 +79,10 @@ repo at `url`) are not supported. ## Options -- `-o `, `--out ` - specify a path to the desired location in the - workspace to place the downloaded file or directory (instead of using the - current working directory). Directories specified in the path must already - exist, otherwise this command will fail. +- `-o `, `--out ` - destination `path` inside the workspace to place + the downloaded file or directory (instead of using the current working + directory). Directories specified in the path must already exist, otherwise + this command will fail. - `--file ` - specify a path and/or file name for the `.dvc` file created by this command (e.g. `--file stages/stage.dvc`). This overrides the From d63b07f3a73e7db214a8c1c9f05f3318f27938cf Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 16:38:23 -0600 Subject: [PATCH 14/28] ref: update add --to-remote desc per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612505065 --- content/docs/command-reference/add.md | 54 ++++++++++---------- content/docs/command-reference/import-url.md | 4 +- content/docs/command-reference/update.md | 7 +-- 3 files changed, 34 insertions(+), 31 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 20723d3a17..c748372eae 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -33,23 +33,22 @@ option to avoid this, and `dvc commit` to finish the process when needed). > See also `dvc.yaml` and `dvc run` for more advanced ways to track and version > intermediate and final results (like ML models). -After checking that each `target` hasn't been added before (or tracked with -other DVC commands), a few actions are taken under the hood: +After checking that each `target` isn't already tracked with DVC, a few actions +are taken under the hood: 1. Calculate the file hash. -2. Move the file contents to the cache (by default in `.dvc/cache`) (or to - remote storage if `--to-remote` is given), using the file hash to form the - cached file path. (See +2. Move the file contents to the cache (transfer them to remote storage with + `--to-remote`), using the file hash to form the cached file path (see [Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) - for more details.) -3. Attempt to replace the file with a link to the cached data (more details on - file linking further down). Skipped if `--to-remote` is used. -4. Create a corresponding `.dvc` file to track the file, using its path and hash - to identify the cached data (with `--to-remote`/`-o`, an external path is - moved to the workspace). The `.dvc` file lists the DVC-tracked file as an - output (`outs` field). Unless the `--file` option is used, the - `.dvc` file name generated by default is `.dvc`, where `` is the - file name of the first target. + for details). +3. Attempt to replace the file with a link to (or copy of) the cached data (more + details on file linking ahead). A new link is created if a different `--out` + `path` is given. Skipped if `--to-remote` is used +4. Create a `.dvc` file to track the file or directory, saving it's path, and + the hash as a pointer to the cached data. The `.dvc` file lists the data as + an output (`outs` field). Unless the `--file` option is used, + the `.dvc` file name generated by default is `.dvc`, where `` is + the file name of the first target. 5. Add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository (unless `dvc init --no-scm` was used when initializing the DVC project). @@ -145,27 +144,30 @@ not. [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` -- `--external` - allow `targets` that are outside of the DVC repository. See - [Managing External Data](/doc/user-guide/managing-external-data). - - > ⚠ī¸ Note that this is an advanced feature for very specific situations and - > not recommended except if there's absolutely no other alternative. - > Additionally, this typically requires an external cache setup (see link - > above). - - `-o `, `--out ` - destination `path` inside the workspace to link (or copy) a data target (instead of using the current working directory). Directories specified in the path will be created by this command. -- `--to-remote` - import an external target, but don't move it into the - workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it - directly to remote storage (the default one, unless `-r` is specified) - instead. Use `dvc pull` to get the data locally. +- `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 + object, SSH directory URL, file on mounted volume, etc.) but don't move it + into the workspace, nor cache it. + [Transfer it](#example-transfer-to-remote-storage) it directly to remote + storage instead (the default one unless `-r` is specified). Use `dvc pull` to + get the data locally. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) to transfer external target to (can only be used with `--to-remote`). +- `--external` - allow `targets` that are outside of the DVC repository, to + track in-place. See + [Managing External Data](/doc/user-guide/managing-external-data). + + > ⚠ī¸ Note that this is an advanced feature for very specific situations and + > not recommended except if there's absolutely no other alternative. + > Additionally, this typically requires an external cache setup (see link + > above). + - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations. diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 6becb48ef8..6612af8c57 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -143,8 +143,8 @@ $ dvc run -n download_data \ - `--to-remote` - import an external target, but don't move it into the workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it - directly to remote storage (the default one, unless `-r` is specified) - instead. Use `dvc pull` to get the data locally. + directly to remote storage (the default one unless `-r` is specified) instead. + Use `dvc pull` to get the data locally. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) (can only be used with diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index f1a6dada55..76b851972b 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -51,9 +51,10 @@ $ dvc update --rev master - `--to-remote` - update a `.dvc` file created with `dvc import-url` and [transfer](/doc/command-reference/import-url#example-import-straight-to-the-remote) - the new data directly to remote storage (the default one unless `-r` is used). - No changes are done in the workspace. Use `dvc pull` to get the - data locally. This option can't be used with DVC or Git repository imports. + the new data directly to remote storage (the default one unless `-r` is + specified). No changes are done in the workspace. Use `dvc pull` + to get the data locally. This option can't be used with DVC or Git repository + imports. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) (can only be used with From 94010d74a2861c24c4a1e27866c2863419d70996 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 16:49:18 -0600 Subject: [PATCH 15/28] ref: simplify add -o example intro per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612517358 --- content/docs/command-reference/add.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index c748372eae..d0c2924a30 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -340,9 +340,8 @@ Only the hash values of the `dir/` directory (with `.dir` file extension) and ## Example: Transfer to an external cache Sometimes you may want to add a large dataset currently found in an external -location, so it becomes local to the project. However, your local file system -may not have enough space to download it — which is needed to add data in DVC, -right? Not necessarily! +location, so it becomes local to the project. But what if there's not enough +disk space to download the data first? The `--out` option lets you add external data in a way that it's cached first, and then From 562b63c2bab94567f9da8045e68c8c1b46685101 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 16:59:58 -0600 Subject: [PATCH 16/28] ref: mention soft/hard links in add -o example per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612524924 --- content/docs/command-reference/add.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index d0c2924a30..43ce07cba2 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -348,29 +348,30 @@ The `--out` option lets you add external data in a way that it's [linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) to a given path inside the workspace. Combined with an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -setup, this let's you avoid using your local file system completely. +and the right +[kind of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), +this let's you avoid using the local file system completely. For example, we can add a `data.xml` file via HTTP, outputting it to a local path in our project: ```dvc -$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml +$ dvc add https://data.dvc.org/get-started/data.xml -o raw/data.xml $ ls data.xml data.xml.dvc ``` -The resulting `.dvc` file will save the provided local `path` as if the data was -always in the workspace, while the `md5` hash points to the copy of the data -that has now been transferred to the cache (which again, we assume -it's already setup in some storage drive that can handle it). Let's check the -contents of `data.xml.dvc`: +The local `data.xml` should be a symlink or hardlink to the externally +cached data that was transferred. The resulting `.dvc` file will +save the local `path` as if the data was already there before this command. +Let's check the contents of `data.xml.dvc`: ```yaml outs: - md5: a304afb96060aad90176268345e10355 nfiles: 1 - path: data.xml + path: raw/data.xml ``` > For a similar operation that actually keeps a connection to the data source, From f694719bd3079502fcd52a20828959ca8cdcb91e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 17:13:59 -0600 Subject: [PATCH 17/28] ref: external data cop edits --- content/docs/command-reference/add.md | 2 +- content/docs/command-reference/import-url.md | 6 +++--- content/docs/user-guide/managing-external-data.md | 2 +- 3 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 43ce07cba2..e04bdab251 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -379,7 +379,7 @@ outs: ## Example: Transfer to remote storage -Similarly to the previous scenario, you may sometimes want to track a large +Similarly to the previous scenario, you may sometimes want to add a large dataset found externally into a regular project (with a local cache). Can it be done without downloading the data locally (for now)? Yes! diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 6612af8c57..fa46402d1d 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -383,9 +383,9 @@ $ ls data.xml.dvc ``` -The only change in our local workspace is a the tiny `.dvc` file -that was created. Whenever anyone wants to actually download the imported data -(into a system that can handle it), they can use `dvc pull` as usual: +The only change in our local workspace is the tiny `.dvc` file that +was created. Whenever anyone wants to actually download the imported data (into +a system that can handle it), they can use `dvc pull` as usual: ``` $ dvc pull data.xml.dvc -r tmp_remote diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index dae92950d7..68a5a48871 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -8,7 +8,7 @@ > In most cases the > [to-cache](/doc/command-reference/add#example-transfer-to-an-external-cache) > or [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) -> strategies of `dvc add` and `dvc import-url` are more convenient. +> strategies of `dvc add` and `dvc import-url` are better. There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example From d5e793e6ce05086876cc2fd01b48cbf6df58749b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 17:43:55 -0600 Subject: [PATCH 18/28] ref: avoid term "transfer" for -o/-to-remote (1) --- content/docs/command-reference/add.md | 32 ++++++++++---------- content/docs/command-reference/import-url.md | 12 ++++---- content/docs/command-reference/update.md | 12 ++++---- 3 files changed, 28 insertions(+), 28 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index e04bdab251..20e5e13412 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -37,10 +37,11 @@ After checking that each `target` isn't already tracked with DVC, a few actions are taken under the hood: 1. Calculate the file hash. -2. Move the file contents to the cache (transfer them to remote storage with - `--to-remote`), using the file hash to form the cached file path (see +2. Move the file contents to the cache, using the file hash to form the cached + file path (see [Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) - for details). + for details). Using the `--out` and `--to-remote` options with an external + target, the data is copied instead (to cache or remote storage). 3. Attempt to replace the file with a link to (or copy of) the cached data (more details on file linking ahead). A new link is created if a different `--out` `path` is given. Skipped if `--to-remote` is used @@ -151,13 +152,13 @@ not. - `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 object, SSH directory URL, file on mounted volume, etc.) but don't move it into the workspace, nor cache it. - [Transfer it](#example-transfer-to-remote-storage) it directly to remote - storage instead (the default one unless `-r` is specified). Use `dvc pull` to - get the data locally. + [Store a copy](#example-transfer-to-remote-storage) on a remote directly + instead (the default one unless `-r` is specified). Use `dvc pull` to get the + data locally later. - `-r `, `--remote ` - name of the - [remote storage](/doc/command-reference/remote) to transfer external target to - (can only be used with `--to-remote`). + [remote](/doc/command-reference/remote) to store an external target on (can + only be used with `--to-remote`). - `--external` - allow `targets` that are outside of the DVC repository, to track in-place. See @@ -362,10 +363,10 @@ $ ls data.xml data.xml.dvc ``` -The local `data.xml` should be a symlink or hardlink to the externally -cached data that was transferred. The resulting `.dvc` file will -save the local `path` as if the data was already there before this command. -Let's check the contents of `data.xml.dvc`: +The local `data.xml` should be a symlink or hard link to the externally +cached data copy. The resulting `.dvc` file will save the local +`path` as if the data was already there before this command. Let's check the +contents of `data.xml.dvc`: ```yaml outs: @@ -384,14 +385,13 @@ dataset found externally into a regular project (with a local cache). Can it be done without downloading the data locally (for now)? Yes! -The `--to-remote` option lets you transfer a copy of the target data to -[remote storage](/doc/command-reference/remote), while creating a `.dvc` file +The `--to-remote` option lets you store a copy of the target data on a +[DVC remote](/doc/command-reference/remote), while creating a `.dvc` file locally so it can be [pulled](/doc/command-reference/plots) later. This is a way to "bootstrap" your project in your local machine, to be reproduced on the right environment later (e.g. a GPU cloud server or a CI/CD system). -Let's setup a simple remote and transfer a `data.xml` file from the web into it -via DVC: +Let's setup a simple remote and add a `data.xml` file from the web this way: ```dvc $ mkdir /tmp/dvc-storage diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index fa46402d1d..c288e42813 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -48,9 +48,9 @@ similar to using `dvc add` after downloading the data. It saves the information about the data source, so the import can be updated later if the data source has changed (see `dvc update`). -💡 Using the `--to-remote` option lets you -[transfer](#example-transfer-to-remote-storage) an import without using the -local file system. +💡 The `--to-remote` option lets you store an import +[on remote storage](#example-transfer-to-remote-storage) without using the local +file system. > Note that data imported from external locaitons can be > [pushed](/doc/command-reference/push) and @@ -366,12 +366,12 @@ Normally, `dvc import-url` downloads the target data (to the cache) in order to link and track it locally. But what if there's not enough disk space for the download? -You can use the `--to-remote` option so the target data is transferred to -[remote storage](/doc/command-reference/remote), while also tracked via an +You can use the `--to-remote` option to store a copy of the target on a +[DVC remote](/doc/command-reference/remote) directly, while also tracked via an import `.dvc` file in the project. Let's setup a simple remote and create an import `.dvc` file without downloading -the target data, transferring it directly to the remote: +the target data: ``` $ mkdir /tmp/dvc-storage diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 76b851972b..2e082f1e62 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -49,12 +49,12 @@ $ dvc update --rev master directory and its subdirectories for import stage `.dvc` files to inspect. If there are no directories among the targets, this option is ignored. -- `--to-remote` - update a `.dvc` file created with `dvc import-url` and - [transfer](/doc/command-reference/import-url#example-import-straight-to-the-remote) - the new data directly to remote storage (the default one unless `-r` is - specified). No changes are done in the workspace. Use `dvc pull` - to get the data locally. This option can't be used with DVC or Git repository - imports. +- `--to-remote` - update a `.dvc` file created with `dvc import-url` and store + the latest data directly + [on remote storage](/doc/command-reference/import-url#example-import-straight-to-the-remote) + (the default one unless `-r` is specified). No changes are done in the + workspace. Use `dvc pull` to get the data locally. This option + can't be used with DVC or Git repository imports. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) (can only be used with From 0166b1f5adebe4edeba83aeb6c1fa3dc3be225f6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 18:07:06 -0600 Subject: [PATCH 19/28] ref: relink to add/import -o/-to-remote examples including including reinstating the link from add -o --- content/docs/command-reference/add.md | 17 +++++++++-------- content/docs/command-reference/import-url.md | 11 +++++------ content/docs/command-reference/update.md | 4 ++-- .../docs/user-guide/managing-external-data.md | 7 +++---- 4 files changed, 19 insertions(+), 20 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 20e5e13412..f45edaf275 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -147,14 +147,15 @@ not. - `-o `, `--out ` - destination `path` inside the workspace to link (or copy) a data target (instead of using the current working directory). - Directories specified in the path will be created by this command. + Directories specified in the path will be created by this command. Note that + this can be used [with an external cache](#straight-to-cache) to avoid using + the disk. - `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 object, SSH directory URL, file on mounted volume, etc.) but don't move it - into the workspace, nor cache it. - [Store a copy](#example-transfer-to-remote-storage) on a remote directly - instead (the default one unless `-r` is specified). Use `dvc pull` to get the - data locally later. + into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a + remote instead (the default one unless `-r` is specified). Use `dvc pull` to + get the data locally later. - `-r `, `--remote ` - name of the [remote](/doc/command-reference/remote) to store an external target on (can @@ -338,11 +339,11 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -## Example: Transfer to an external cache +## Example: Adding large data without using the disk {#straight-to-cache} Sometimes you may want to add a large dataset currently found in an external location, so it becomes local to the project. But what if there's not enough -disk space to download the data first? +disk space to download the data? The `--out` option lets you add external data in a way that it's cached first, and then @@ -378,7 +379,7 @@ outs: > For a similar operation that actually keeps a connection to the data source, > please see `dvc import-url`. -## Example: Transfer to remote storage +## Example: Transfer to remote storage {#straight-to-remote} Similarly to the previous scenario, you may sometimes want to add a large dataset found externally into a regular project (with a local diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index c288e42813..f1f05f4b2b 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -49,8 +49,7 @@ about the data source, so the import can be updated later if the data source has changed (see `dvc update`). 💡 The `--to-remote` option lets you store an import -[on remote storage](#example-transfer-to-remote-storage) without using the local -file system. +[on remote storage](#straight-to-remote) without using the local file system. > Note that data imported from external locaitons can be > [pushed](/doc/command-reference/push) and @@ -142,9 +141,9 @@ $ dvc run -n download_data \ want to "DVCfy" this state of the project (see also `dvc commit`). - `--to-remote` - import an external target, but don't move it into the - workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it - directly to remote storage (the default one unless `-r` is specified) instead. - Use `dvc pull` to get the data locally. + workspace, nor cache it. [Store a copy](#straight-to-remote) on a remote + instead (the default one unless `-r` is specified). Use `dvc pull` to get the + data locally. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) (can only be used with @@ -360,7 +359,7 @@ Running stage 'prepare' with command: python src/prepare.py data/data.xml ``` -## Example: Transfer to remote storage +## Example: Transfer to remote storage {#straight-to-remote} Normally, `dvc import-url` downloads the target data (to the cache) in order to link and track it locally. But what if there's not enough disk space diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 2e082f1e62..1d0277ec18 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -51,8 +51,8 @@ $ dvc update --rev master - `--to-remote` - update a `.dvc` file created with `dvc import-url` and store the latest data directly - [on remote storage](/doc/command-reference/import-url#example-import-straight-to-the-remote) - (the default one unless `-r` is specified). No changes are done in the + [on remote storage](/doc/command-reference/import-url#straight-to-remote) (the + default one unless `-r` is specified). No changes are done in the workspace. Use `dvc pull` to get the data locally. This option can't be used with DVC or Git repository imports. diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 68a5a48871..2b9de940b4 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -5,10 +5,9 @@ > external outputs are not pushed or pulled from/to > [remote storage](/doc/command-reference/remote). > -> In most cases the -> [to-cache](/doc/command-reference/add#example-transfer-to-an-external-cache) -> or [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) -> strategies of `dvc add` and `dvc import-url` are better. +> In most cases the [to-cache](/doc/command-reference/add#straight-to-cache) or +> [to-remote](/doc/command-reference/add#straight-to-remote) strategies of +> `dvc add` and `dvc import-url` are better. There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example From 696fa531fa6c64a3d34f73f96d5a03614fa231b3 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 18:14:04 -0600 Subject: [PATCH 20/28] ref: updated add/import to-cache/remote example titles per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612526585 --- content/docs/command-reference/add.md | 4 ++-- content/docs/command-reference/import-url.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index f45edaf275..6a8abc53c0 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -339,7 +339,7 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -## Example: Adding large data without using the disk {#straight-to-cache} +## Example: Caching large data externally {#straight-to-cache} Sometimes you may want to add a large dataset currently found in an external location, so it becomes local to the project. But what if there's not enough @@ -379,7 +379,7 @@ outs: > For a similar operation that actually keeps a connection to the data source, > please see `dvc import-url`. -## Example: Transfer to remote storage {#straight-to-remote} +## Example: Storing large data remotely {#straight-to-remote} Similarly to the previous scenario, you may sometimes want to add a large dataset found externally into a regular project (with a local diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index f1f05f4b2b..a1ee7f5b53 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -359,7 +359,7 @@ Running stage 'prepare' with command: python src/prepare.py data/data.xml ``` -## Example: Transfer to remote storage {#straight-to-remote} +## Example: Storing large data remotely {#straight-to-remote} Normally, `dvc import-url` downloads the target data (to the cache) in order to link and track it locally. But what if there's not enough disk space From 15097f1d4ad20681b022183e926d78eae9e1436b Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 22:53:43 -0600 Subject: [PATCH 21/28] ref: a couple more copy edits to add -o/-to-remote per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612823191 and https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-612823873 --- content/docs/command-reference/add.md | 14 +++++++------- content/docs/command-reference/import-url.md | 4 ++-- 2 files changed, 9 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 6a8abc53c0..ceff6378fe 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -145,11 +145,11 @@ not. [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` -- `-o `, `--out ` - destination `path` inside the workspace to link - (or copy) a data target (instead of using the current working directory). - Directories specified in the path will be created by this command. Note that - this can be used [with an external cache](#straight-to-cache) to avoid using - the disk. +- `-o `, `--out ` - destination `path` inside the workspace to place + a data target (instead of using the current working directory). Directories + specified in the path will be created by this command. Note that this can be + used [with an external cache](#straight-to-cache) to avoid using the local + file system. - `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 object, SSH directory URL, file on mounted volume, etc.) but don't move it @@ -158,8 +158,8 @@ not. get the data locally later. - `-r `, `--remote ` - name of the - [remote](/doc/command-reference/remote) to store an external target on (can - only be used with `--to-remote`). + [remote](/doc/command-reference/remote) to store data on (can only be used + with `--to-remote`). - `--external` - allow `targets` that are outside of the DVC repository, to track in-place. See diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index a1ee7f5b53..5e916d27b0 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -146,8 +146,8 @@ $ dvc run -n download_data \ data locally. - `-r `, `--remote ` - name of the - [remote storage](/doc/command-reference/remote) (can only be used with - `--to-remote`). + [remote](/doc/command-reference/remote) to store data on (can only be used + with `--to-remote`). - `-j `, `--jobs ` - parallelism level for DVC to download data from the source. The default value is `4 * cpu_count()`. For SSH remotes, the From 31394de7b9c2b2a7504c4e79b8156fe08bd26a3d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 23:10:53 -0600 Subject: [PATCH 22/28] ref: update --to-remote copy edits --- content/docs/command-reference/update.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 1d0277ec18..9821b097a5 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -52,9 +52,9 @@ $ dvc update --rev master - `--to-remote` - update a `.dvc` file created with `dvc import-url` and store the latest data directly [on remote storage](/doc/command-reference/import-url#straight-to-remote) (the - default one unless `-r` is specified). No changes are done in the + default one unless `-r` is specified). Tracked data is not changed in the workspace. Use `dvc pull` to get the data locally. This option - can't be used with DVC or Git repository imports. + can't be used with data imported from DVC or Git repos (with `dvc import`). - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) (can only be used with From 0f2a2b11f2d9a626ba5870a6912c59baeee924da Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 15 Mar 2021 23:13:58 -0600 Subject: [PATCH 23/28] ref: roll back changes not related to #2302 --- content/docs/command-reference/get.md | 6 +++--- content/docs/command-reference/import-url.md | 7 ++----- content/docs/command-reference/import.md | 12 ++++++------ content/docs/command-reference/pull.md | 3 --- content/docs/command-reference/push.md | 2 -- 5 files changed, 11 insertions(+), 19 deletions(-) diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 91f4d348c2..97eb45c65c 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -24,13 +24,13 @@ repository (e.g. source code, small image/other files). `dvc get` copies the target file or directory (found at `path` in `url`) to the current working directory. (Analogous to `wget`, but for repos.) -> See `dvc list` for a way to browse repository contents to find files or -> directories to download. - > Note that unlike `dvc import`, this command does not track the downloaded > files (does not create a `.dvc` file). For that reason, it doesn't require an > existing DVC project to run in. +> See `dvc list` for a way to browse repository contents to find files or +> directories to download. + The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. `[user@]server:project.git`). `url` can also be a local file system path diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 5e916d27b0..735c34f89b 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -51,11 +51,8 @@ changed (see `dvc update`). 💡 The `--to-remote` option lets you store an import [on remote storage](#straight-to-remote) without using the local file system. -> Note that data imported from external locaitons can be -> [pushed](/doc/command-reference/push) and -> [pulled](/doc/command-reference/pull) to/from -> [remote storage](/doc/command-reference/remote) normally (unlike for -> `dvc import`). +> Note that the imported data can be [pushed](/doc/command-reference/push) to +> remote storage normally. `.dvc` files support references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such an diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 91e3270e97..d5c1b243ce 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -27,20 +27,20 @@ target file or directory (found at `path` in `url`), and tracks it in the local project. This makes it possible to update the import later, if the data source has changed (see `dvc update`). -> See `dvc list` for a way to browse repository contents to find files or -> directories to import. - > Note that `dvc get` corresponds to the first step this command performs (just > download the data). +> See `dvc list` for a way to browse repository contents to find files or +> directories to import. + The imported data is cached, and linked (or copied) to the current working directory with its original file name e.g. `data.txt` (or to a location provided with `--out`). An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. -(ℹī¸) DVC won't push or pull data imported from other DVC repos to/from -[remote storage](/doc/command-reference/remote). `dvc pull` will download from -the original source instead. +(ℹī¸) DVC won't push or pull imported data to/from +[remote storage](/doc/command-reference/remote), it will rely on it's original +source. The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. diff --git a/content/docs/command-reference/pull.md b/content/docs/command-reference/pull.md index c2bcf99ada..40c477e48f 100644 --- a/content/docs/command-reference/pull.md +++ b/content/docs/command-reference/pull.md @@ -74,9 +74,6 @@ Note that the command `dvc status -c` can list files referenced in current stages (in `dvc.yaml`) or `.dvc` files, but missing from the cache. It can be used to see what files `dvc pull` would download. -> Note that in the case of `dvc import` data, `dvc pull` downloads from the -> original data source (an external DVC repo's remote storage, typically). - ## Options - `-a`, `--all-branches` - determines the files to download by examining diff --git a/content/docs/command-reference/push.md b/content/docs/command-reference/push.md index d3320b234f..6bc86f93bd 100644 --- a/content/docs/command-reference/push.md +++ b/content/docs/command-reference/push.md @@ -57,8 +57,6 @@ Note that the `dvc status -c` command can list files tracked by DVC that are new in the cache (compared to the default remote.) It can be used to see what files `dvc push` would upload. -> Note that `dvc push` doesn't upload `dvc import` data. - ## Options - `-a`, `--all-branches` - determines the files to upload by examining From 1082c85cbb0d308b92befc32869f20d6d4ea39f6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 17 Mar 2021 09:39:53 -0600 Subject: [PATCH 24/28] ref: clarfy --out option per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-613870393 --- content/docs/command-reference/add.md | 8 ++++---- content/docs/command-reference/get.md | 5 +++-- content/docs/command-reference/import.md | 6 +++--- 3 files changed, 10 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index ceff6378fe..6f68b2c6c4 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -146,10 +146,10 @@ not. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` - `-o `, `--out ` - destination `path` inside the workspace to place - a data target (instead of using the current working directory). Directories - specified in the path will be created by this command. Note that this can be - used [with an external cache](#straight-to-cache) to avoid using the local - file system. + the data target. By default the data file basename is used in the current + working directory (if this option isn't used). Directories in the given `path` + will be created. Note that for external targets, this can be combined + [with an external cache](#straight-to-cache) to skip the local file system. - `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 object, SSH directory URL, file on mounted volume, etc.) but don't move it diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 97eb45c65c..0d749a3d5c 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -57,8 +57,9 @@ name. ## Options - `-o `, `--out ` - destination `path` to place the downloaded file - or directory (instead of using the current working directory). Directories - specified in the path will be created by this command. + or directory. By default the data file basename is used in the current working + directory (if this option isn't used). Directories in the given `path` will be + created. - `--rev ` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index d5c1b243ce..88727248c1 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -80,9 +80,9 @@ repo at `url`) are not supported. ## Options - `-o `, `--out ` - destination `path` inside the workspace to place - the downloaded file or directory (instead of using the current working - directory). Directories specified in the path must already exist, otherwise - this command will fail. + the downloaded file or directory. By default the file basename name is used in + the current working directory (if this option isn't used). Directories in the + given `path` will be created. - `--file ` - specify a path and/or file name for the `.dvc` file created by this command (e.g. `--file stages/stage.dvc`). This overrides the From b943df5c44f227de33d63400791058e45221f67d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 17 Mar 2021 10:23:33 -0600 Subject: [PATCH 25/28] ref: rename add -o/-to-remote examples --- content/docs/command-reference/add.md | 8 ++++---- content/docs/command-reference/import-url.md | 6 +++--- 2 files changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 6f68b2c6c4..a1806ed15e 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -40,8 +40,8 @@ are taken under the hood: 2. Move the file contents to the cache, using the file hash to form the cached file path (see [Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) - for details). Using the `--out` and `--to-remote` options with an external - target, the data is copied instead (to cache or remote storage). + for details). Using `--out`, or `--to-remote` with an external target, the + data is copied instead (to cache or remote storage). 3. Attempt to replace the file with a link to (or copy of) the cached data (more details on file linking ahead). A new link is created if a different `--out` `path` is given. Skipped if `--to-remote` is used @@ -339,7 +339,7 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -## Example: Caching large data externally {#straight-to-cache} +## Example: Large external data {#straight-to-cache} Sometimes you may want to add a large dataset currently found in an external location, so it becomes local to the project. But what if there's not enough @@ -379,7 +379,7 @@ outs: > For a similar operation that actually keeps a connection to the data source, > please see `dvc import-url`. -## Example: Storing large data remotely {#straight-to-remote} +## Example: External data onto remote storage {#straight-to-remote} Similarly to the previous scenario, you may sometimes want to add a large dataset found externally into a regular project (with a local diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 735c34f89b..1a033aadde 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -48,8 +48,8 @@ similar to using `dvc add` after downloading the data. It saves the information about the data source, so the import can be updated later if the data source has changed (see `dvc update`). -💡 The `--to-remote` option lets you store an import -[on remote storage](#straight-to-remote) without using the local file system. +💡 The `--to-remote` option lets you store an import on a +[DVC remote](/doc/command-reference/remote) without using the local file system. > Note that the imported data can be [pushed](/doc/command-reference/push) to > remote storage normally. @@ -356,7 +356,7 @@ Running stage 'prepare' with command: python src/prepare.py data/data.xml ``` -## Example: Storing large data remotely {#straight-to-remote} +## Example: Importing onto remote storage {#straight-to-remote} Normally, `dvc import-url` downloads the target data (to the cache) in order to link and track it locally. But what if there's not enough disk space From b25da5c23a6b3f0c99be5810134c6c1ce5a8dd0a Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 17 Mar 2021 10:48:53 -0600 Subject: [PATCH 26/28] ref: other copy edits to add -o/-to-remote --- content/docs/command-reference/add.md | 12 +++++------- 1 file changed, 5 insertions(+), 7 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index a1806ed15e..f1686c2e92 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -153,9 +153,9 @@ not. - `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 object, SSH directory URL, file on mounted volume, etc.) but don't move it - into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a - remote instead (the default one unless `-r` is specified). Use `dvc pull` to - get the data locally later. + into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a DVC + remote instead (the default one unless `-r` is specified) to skip the local + file system. Use `dvc pull` to get the data later. - `-r `, `--remote ` - name of the [remote](/doc/command-reference/remote) to store data on (can only be used @@ -342,8 +342,7 @@ Only the hash values of the `dir/` directory (with `.dir` file extension) and ## Example: Large external data {#straight-to-cache} Sometimes you may want to add a large dataset currently found in an external -location, so it becomes local to the project. But what if there's not enough -disk space to download the data? +location. But what if there's not enough disk space to download the data? The `--out` option lets you add external data in a way that it's cached first, and then @@ -383,8 +382,7 @@ outs: Similarly to the previous scenario, you may sometimes want to add a large dataset found externally into a regular project (with a local -cache). Can it be done without downloading the data locally (for -now)? Yes! +cache). Can it be done without downloading the data locally? Yes! The `--to-remote` option lets you store a copy of the target data on a [DVC remote](/doc/command-reference/remote), while creating a `.dvc` file From aa171d422be363aef578658d0cae55e30601623f Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 17 Mar 2021 14:59:05 -0600 Subject: [PATCH 27/28] ref: no hard links for add -o + ext cache per https://github.com/iterative/dvc.org/pull/2302#pullrequestreview-613876633 --- content/docs/command-reference/add.md | 19 +++++++++---------- 1 file changed, 9 insertions(+), 10 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index f1686c2e92..3950a3f1db 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -345,12 +345,11 @@ Sometimes you may want to add a large dataset currently found in an external location. But what if there's not enough disk space to download the data? The `--out` option lets you add external data in a way that it's -cached first, and then -[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -to a given path inside the workspace. Combined with an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -and the right -[kind of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache), +cached first, and then linked to a given path inside the +workspace. Combined with +[symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache), this let's you avoid using the local file system completely. For example, we can add a `data.xml` file via HTTP, outputting it to a local @@ -363,10 +362,10 @@ $ ls data.xml data.xml.dvc ``` -The local `data.xml` should be a symlink or hard link to the externally -cached data copy. The resulting `.dvc` file will save the local -`path` as if the data was already there before this command. Let's check the -contents of `data.xml.dvc`: +The local `data.xml` should be a symlink to the externally cached +data copy. The resulting `.dvc` file will save the local `path` as if the data +was already there before this command. Let's check the contents of +`data.xml.dvc`: ```yaml outs: From 61f8806cc488dcfbf739e142c6519189f09516aa Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 17 Mar 2021 15:22:11 -0600 Subject: [PATCH 28/28] ref: more edits to add/import-url to-cache/remote --- content/docs/command-reference/add.md | 40 ++++++++++--------- content/docs/command-reference/import-url.md | 24 +++++------ .../docs/user-guide/managing-external-data.md | 4 +- 3 files changed, 36 insertions(+), 32 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 3950a3f1db..27f0512cfd 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -149,7 +149,8 @@ not. the data target. By default the data file basename is used in the current working directory (if this option isn't used). Directories in the given `path` will be created. Note that for external targets, this can be combined - [with an external cache](#straight-to-cache) to skip the local file system. + [with an external cache](#example-external-data) to skip the local file + system. - `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 object, SSH directory URL, file on mounted volume, etc.) but don't move it @@ -339,14 +340,15 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -## Example: Large external data {#straight-to-cache} +## Example: External data Sometimes you may want to add a large dataset currently found in an external -location. But what if there's not enough disk space to download the data? +location. But what if there's not enough disk space to download the data? Here's +one method! -The `--out` option lets you add external data in a way that it's -cached first, and then linked to a given path inside the -workspace. Combined with +The `--out` option lets you add external so that it's linked to a given path +inside the workspace after being copied to the cache. +Combined with [symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache), @@ -362,7 +364,7 @@ $ ls data.xml data.xml.dvc ``` -The local `data.xml` should be a symlink to the externally cached +The local `data.xml` should be a symlink to the (externally) cached data copy. The resulting `.dvc` file will save the local `path` as if the data was already there before this command. Let's check the contents of `data.xml.dvc`: @@ -377,17 +379,17 @@ outs: > For a similar operation that actually keeps a connection to the data source, > please see `dvc import-url`. -## Example: External data onto remote storage {#straight-to-remote} +## Example: `--to-remote` usage {#straight-to-remote} -Similarly to the previous scenario, you may sometimes want to add a large -dataset found externally into a regular project (with a local -cache). Can it be done without downloading the data locally? Yes! +Here's another method to add a large dataset found in an external location +without downloading the data (refer to previous example). The `--to-remote` option lets you store a copy of the target data on a [DVC remote](/doc/command-reference/remote), while creating a `.dvc` file locally so it can be [pulled](/doc/command-reference/plots) later. This is a way -to "bootstrap" your project in your local machine, to be reproduced on the right -environment later (e.g. a GPU cloud server or a CI/CD system). +to "bootstrap" a project in your local machine, to be +[reproduced](/doc/command-reference/repro) on the right environment later (e.g. +a GPU cloud server or a CI/CD system). Let's setup a simple remote and add a `data.xml` file from the web this way: @@ -405,13 +407,15 @@ data.xml.dvc > `path` (written to the `.dvc` file). DVC won't control the original data source after this, but rather continue -managing your remote storage, where the data is now found. Whenever anyone wants -to actually download the added data (from a system that can handle it), they can -use `dvc pull` as usual: +managing your remote storage, where the data is now found. To actually download +the data to cache, you can use `dvc fetch` or `dvc pull` as usual +(on a system that can handle it): ```dvc - $ dvc pull data.xml.dvc -r tmp_remote - +$ dvc pull data.xml.dvc -r tmp_remote A data.xml 1 file added and 1 file fetched ``` + +> Note that `dvc repro` will try to download the data too, as part of the +> pipeline execution. diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 1a033aadde..744bea83ee 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -356,18 +356,19 @@ Running stage 'prepare' with command: python src/prepare.py data/data.xml ``` -## Example: Importing onto remote storage {#straight-to-remote} +## Example: `--to-remote` usage {#straight-to-remote} Normally, `dvc import-url` downloads the target data (to the cache) -in order to link and track it locally. But what if there's not enough disk space -for the download? +in order to link and track it locally. But what if there's not enough disk +space? -You can use the `--to-remote` option to store a copy of the target on a -[DVC remote](/doc/command-reference/remote) directly, while also tracked via an -import `.dvc` file in the project. +The `--to-remote` option lets you store a copy of the target data on a +[DVC remote](/doc/command-reference/remote), while creating an import `.dvc` +file locally so it can be [pulled](/doc/command-reference/plots) later. This is +a way to "bootstrap" an import in your local machine, to be downloaded on the +right environment later. -Let's setup a simple remote and create an import `.dvc` file without downloading -the target data: +Let's setup a simple remote and add a `data.xml` file from the web this way: ``` $ mkdir /tmp/dvc-storage @@ -380,12 +381,11 @@ data.xml.dvc ``` The only change in our local workspace is the tiny `.dvc` file that -was created. Whenever anyone wants to actually download the imported data (into -a system that can handle it), they can use `dvc pull` as usual: +was created. To actually download the data to cache, you can use +`dvc fetch` or `dvc pull` as usual (on a system that can handle it): ``` - $ dvc pull data.xml.dvc -r tmp_remote - +$ dvc pull data.xml.dvc -r tmp_remote A data.xml 1 file added and 1 file fetched ``` diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index 2b9de940b4..7426c975a1 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -5,8 +5,8 @@ > external outputs are not pushed or pulled from/to > [remote storage](/doc/command-reference/remote). > -> In most cases the [to-cache](/doc/command-reference/add#straight-to-cache) or -> [to-remote](/doc/command-reference/add#straight-to-remote) strategies of +> In most cases the [to-cache](/doc/command-reference/add#example-external-data) +> or [to-remote](/doc/command-reference/add#straight-to-remote) strategies of > `dvc add` and `dvc import-url` are better. There are cases when data is so large, or its processing is organized in such a