diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index d59ded757c..27f0512cfd 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -33,23 +33,23 @@ option to avoid this, and `dvc commit` to finish the process when needed). > See also `dvc.yaml` and `dvc run` for more advanced ways to track and version > intermediate and final results (like ML models). -After checking that each `target` hasn't been added before (or tracked with -other DVC commands), a few actions are taken under the hood: +After checking that each `target` isn't already tracked with DVC, a few actions +are taken under the hood: 1. Calculate the file hash. -2. Move the file contents to the cache (by default in `.dvc/cache`) (or to - remote storage if `--to-remote` is given), using the file hash to form the - cached file path. (See +2. Move the file contents to the cache, using the file hash to form the cached + file path (see [Structure of cache directory](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) - for more details.) -3. Attempt to replace the file with a link to the cached data (more details on - file linking further down). Skipped if `--to-remote` is used. -4. Create a corresponding `.dvc` file to track the file, using its path and hash - to identify the cached data (with `--to-remote`/`-o`, an external path is - moved to the workspace). The `.dvc` file lists the DVC-tracked file as an - output (`outs` field). Unless the `--file` option is used, the - `.dvc` file name generated by default is `.dvc`, where `` is the - file name of the first target. + for details). Using `--out`, or `--to-remote` with an external target, the + data is copied instead (to cache or remote storage). +3. Attempt to replace the file with a link to (or copy of) the cached data (more + details on file linking ahead). A new link is created if a different `--out` + `path` is given. Skipped if `--to-remote` is used +4. Create a `.dvc` file to track the file or directory, saving it's path, and + the hash as a pointer to the cached data. The `.dvc` file lists the data as + an output (`outs` field). Unless the `--file` option is used, + the `.dvc` file name generated by default is `.dvc`, where `` is + the file name of the first target. 5. Add the `targets` to `.gitignore` in order to prevent them from being committed to the Git repository (unless `dvc init --no-scm` was used when initializing the DVC project). @@ -145,7 +145,25 @@ not. [pattern](https://docs.python.org/3/library/glob.html) specified in `targets`. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` -- `--external` - allow `targets` that are outside of the DVC repository. See +- `-o `, `--out ` - destination `path` inside the workspace to place + the data target. By default the data file basename is used in the current + working directory (if this option isn't used). Directories in the given `path` + will be created. Note that for external targets, this can be combined + [with an external cache](#example-external-data) to skip the local file + system. + +- `--to-remote` - allow a target outside of the DVC repository (e.g. an S3 + object, SSH directory URL, file on mounted volume, etc.) but don't move it + into the workspace, nor cache it. [Store a copy](#straight-to-remote) on a DVC + remote instead (the default one unless `-r` is specified) to skip the local + file system. Use `dvc pull` to get the data later. + +- `-r `, `--remote ` - name of the + [remote](/doc/command-reference/remote) to store data on (can only be used + with `--to-remote`). + +- `--external` - allow `targets` that are outside of the DVC repository, to + track in-place. See [Managing External Data](/doc/user-guide/managing-external-data). > ⚠️ Note that this is an advanced feature for very specific situations and @@ -153,20 +171,6 @@ not. > Additionally, this typically requires an external cache setup (see link > above). -- `-o `, `--out ` - destination `path` to make a local target copy, - or to [transfer](#example-transfer-to-cache) an external target into the cache - (and link to workspace). Note that this can be combined with `--to-remote` to - avoid storing the data locally, while still adding it to the project. - -- `--to-remote` - import an external target, but don't move it into the - workspace, nor cache it. [Transfer it](#example-transfer-to-remote-storage) it - directly to remote storage (the default one, unless `-r` is specified) - instead. Use `dvc pull` to get the data locally. - -- `-r `, `--remote ` - name of the - [remote storage](/doc/command-reference/remote) to transfer external target to - (can only be used with `--to-remote`). - - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations. @@ -336,95 +340,82 @@ $ tree .dvc/cache Only the hash values of the `dir/` directory (with `.dir` file extension) and `file2` have been cached. -## Example: Transfer to the cache - -When you have a large dataset in an external location, you may want to add it to -the project without having to copy it into the workspace. Maybe -your local disk doesn't have enough space, but you have setup an -[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) -that could handle it. +## Example: External data -The `--out` option lets you add external paths in a way that they are -cached first, and then -[linked](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -to a given path inside the workspace. Let's initialize an example -DVC project to try this: +Sometimes you may want to add a large dataset currently found in an external +location. But what if there's not enough disk space to download the data? Here's +one method! -```dvc -$ mkdir example # workspace -$ cd example -$ git init -$ dvc init -``` +The `--out` option lets you add external so that it's linked to a given path +inside the workspace after being copied to the cache. +Combined with +[symlinking](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +an +[external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache), +this let's you avoid using the local file system completely. -Now we can add a `data.xml` file via HTTP for example, putting it a local path -in our project: +For example, we can add a `data.xml` file via HTTP, outputting it to a local +path in our project: ```dvc -$ dvc add https://data.dvc.org/get-started/data.xml -o data.xml +$ dvc add https://data.dvc.org/get-started/data.xml -o raw/data.xml $ ls data.xml data.xml.dvc ``` -The resulting `.dvc` file will save the provided local `path` as if the data was -already in the workspace, while the `md5` hash points to the copy of the data -that has now been transferred to the cache. Let's check the -contents of `data.xml.dvc` in this case: +The local `data.xml` should be a symlink to the (externally) cached +data copy. The resulting `.dvc` file will save the local `path` as if the data +was already there before this command. Let's check the contents of +`data.xml.dvc`: ```yaml outs: - md5: a304afb96060aad90176268345e10355 nfiles: 1 - path: data.xml + path: raw/data.xml ``` > For a similar operation that actually keeps a connection to the data source, > please see `dvc import-url`. -## Example: Transfer to remote storage +## Example: `--to-remote` usage {#straight-to-remote} -When you have a large dataset in an external location, you may want to track it -as if it was in your project, but without downloading it locally (for now). The -`--to-remote` option lets you do so, while storing a copy -[remotely](/doc/command-reference/remote) so it can be -[pulled](/doc/command-reference/plots) later. Let's initialize a DVC project, -and setup a remote: +Here's another method to add a large dataset found in an external location +without downloading the data (refer to previous example). -```dvc -$ mkdir example # workspace -$ cd example -$ git init -$ dvc init -$ mkdir /tmp/dvc-storage -$ dvc remote add myremote /tmp/dvc-storage -``` +The `--to-remote` option lets you store a copy of the target data on a +[DVC remote](/doc/command-reference/remote), while creating a `.dvc` file +locally so it can be [pulled](/doc/command-reference/plots) later. This is a way +to "bootstrap" a project in your local machine, to be +[reproduced](/doc/command-reference/repro) on the right environment later (e.g. +a GPU cloud server or a CI/CD system). -Now let's add the `data.xml` to our remote storage from the given remote -location. +Let's setup a simple remote and add a `data.xml` file from the web this way: ```dvc +$ mkdir /tmp/dvc-storage +$ dvc remote add myremote /tmp/dvc-storage $ dvc add https://data.dvc.org/get-started/data.xml -o data.xml \ --to-remote -r myremote ... -``` - -The only difference that dataset is transferred straight to remote, so DVC won't -control the remote location you gave but rather continue managing your remote -storage where the data is now on. The operation will still be resulted with an -`.dvc` file: - -```dvc $ ls data.xml.dvc ``` -Whenever anyone wants to actually download the added data (for example from a -system that can handle it), they can use `dvc pull` as usual: +> Note that this can be combined with `--out` to specify a local destination +> `path` (written to the `.dvc` file). -```dvc - $ dvc pull data.xml.dvc -r tmp_remote +DVC won't control the original data source after this, but rather continue +managing your remote storage, where the data is now found. To actually download +the data to cache, you can use `dvc fetch` or `dvc pull` as usual +(on a system that can handle it): +```dvc +$ dvc pull data.xml.dvc -r tmp_remote A data.xml 1 file added and 1 file fetched ``` + +> Note that `dvc repro` will try to download the data too, as part of the +> pipeline execution. diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 05312cca53..0d749a3d5c 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -56,10 +56,10 @@ name. ## Options -- `-o `, `--out ` - specify a path to the desired location in the - workspace to place the downloaded file or directory (instead of using the - current working directory). Directories specified in the path will be created - by this command. +- `-o `, `--out ` - destination `path` to place the downloaded file + or directory. By default the data file basename is used in the current working + directory (if this option isn't used). Directories in the given `path` will be + created. - `--rev ` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 325b58c7e9..744bea83ee 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -23,9 +23,8 @@ positional arguments: ## Description In some cases it's convenient to add a data file or directory from an external -location into the workspace (or to -[remote storage](/doc/command-reference/remote)), such that it can be updated -later, if/when the external data source changes. Example scenarios: +location into the project, such that it can be updated later if/when the +external data source changes. Example scenarios: - A remote system may produce occasional data files that are used in other projects. @@ -37,22 +36,20 @@ later, if/when the external data source changes. Example scenarios: `dvc import-url` helps you create such an external data dependency, without having to manually copy files from the supported locations (listed below), which -may require installing a different tool for each type. - -When you don't want to store the target data in your local system, you can still -create an import `.dvc` file while transferring a file or directory directly to -remote storage, by using the `--to-remote` option. See the -[Transfer to remote storage](#example-transfer-to-remote-storage) example for -more details. +would require installing/using a different tool for each type. The `url` argument specifies the external location of the data to be imported. The imported data is cached, and linked (or copied) to the current -working directory with its original file name e.g. `data.txt` (or to a location -provided with `out`). +working directory with its original file name e.g. `data.txt`, or to a location +provided with `out`. An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – -similar to using `dvc add` after downloading the data. This makes it possible to -update the import later, if the data source has changed (see `dvc update`). +similar to using `dvc add` after downloading the data. It saves the information +about the data source, so the import can be updated later if the data source has +changed (see `dvc update`). + +💡 The `--to-remote` option lets you store an import on a +[DVC remote](/doc/command-reference/remote) without using the local file system. > Note that the imported data can be [pushed](/doc/command-reference/push) to > remote storage normally. @@ -64,8 +61,9 @@ field contains the corresponding local path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -Note that `dvc repro` doesn't check or update import `.dvc` files, use -`dvc update` to bring the import up to date from the data source. +Note that `dvc repro` doesn't check or update import `.dvc` files by default +(see `dvc freeze`), use `dvc update` to bring the import up to date from the +data source. DVC supports several types of external locations (protocols): @@ -140,13 +138,13 @@ $ dvc run -n download_data \ want to "DVCfy" this state of the project (see also `dvc commit`). - `--to-remote` - import an external target, but don't move it into the - workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it - directly to remote storage (the default one, unless `-r` is specified) - instead. Use `dvc pull` to get the data locally. + workspace, nor cache it. [Store a copy](#straight-to-remote) on a remote + instead (the default one unless `-r` is specified). Use `dvc pull` to get the + data locally. - `-r `, `--remote ` - name of the - [remote storage](/doc/command-reference/remote) (can only be used with - `--to-remote`). + [remote](/doc/command-reference/remote) to store data on (can only be used + with `--to-remote`). - `-j `, `--jobs ` - parallelism level for DVC to download data from the source. The default value is `4 * cpu_count()`. For SSH remotes, the @@ -358,46 +356,36 @@ Running stage 'prepare' with command: python src/prepare.py data/data.xml ``` -## Example: Transfer to remote storage +## Example: `--to-remote` usage {#straight-to-remote} -When you have a large dataset in an external location, you may want to import it -to your project without downloading it to the local file system (for using it -later/elsewhere). The `--to-remote` option let you skip the download, while -storing the imported data [remotely](/doc/command-reference/remote). Let's -initialize a DVC project, and setup a remote: +Normally, `dvc import-url` downloads the target data (to the cache) +in order to link and track it locally. But what if there's not enough disk +space? -```dvc -$ mkdir example # workspace -$ cd example -$ git init -$ dvc init -$ mkdir /tmp/dvc-storage -$ dvc remote add myremote /tmp/dvc-storage -``` +The `--to-remote` option lets you store a copy of the target data on a +[DVC remote](/doc/command-reference/remote), while creating an import `.dvc` +file locally so it can be [pulled](/doc/command-reference/plots) later. This is +a way to "bootstrap" an import in your local machine, to be downloaded on the +right environment later. -Now let's create an import `.dvc` file without downloading the target data, -transferring it directly to remote storage instead: +Let's setup a simple remote and add a `data.xml` file from the web this way: ``` +$ mkdir /tmp/dvc-storage +$ dvc remote add myremote /tmp/dvc-storage $ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \ --to-remote -r myremote ... -``` - -The only change in our local workspace is a newly created import -`.dvc` file: - -```dvc $ ls data.xml.dvc ``` -Whenever anyone wants to actually download the imported data (for example from a -system that can handle it), they can use `dvc pull` as usual: +The only change in our local workspace is the tiny `.dvc` file that +was created. To actually download the data to cache, you can use +`dvc fetch` or `dvc pull` as usual (on a system that can handle it): ``` - $ dvc pull data.xml.dvc -r tmp_remote - +$ dvc pull data.xml.dvc -r tmp_remote A data.xml 1 file added and 1 file fetched ``` diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index b16a14b367..88727248c1 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -70,19 +70,19 @@ enable DVC efficiently determining whether the local copy is out of date. To actually [version the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import `.dvc` file. -Note that `dvc repro` doesn't check or update import `.dvc` files (see -`dvc freeze`), use `dvc update` to bring the import up to date from the data -source. +Note that `dvc repro` doesn't check or update import `.dvc` files by default +(see `dvc freeze`), use `dvc update` to bring the import up to date from the +data source. Also note that chained imports (importing data that was imported into the source repo at `url`) are not supported. ## Options -- `-o `, `--out ` - specify a path to the desired location in the - workspace to place the downloaded file or directory (instead of using the - current working directory). Directories specified in the path must already - exist, otherwise this command will fail. +- `-o `, `--out ` - destination `path` inside the workspace to place + the downloaded file or directory. By default the file basename name is used in + the current working directory (if this option isn't used). Directories in the + given `path` will be created. - `--file ` - specify a path and/or file name for the `.dvc` file created by this command (e.g. `--file stages/stage.dvc`). This overrides the @@ -154,7 +154,7 @@ outs: cache: true ``` -Several of the values above are pulled from the original `.dvc` file +Several of the values above are obtained from the original `.dvc` file `model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock` subfields under `repo` are used to save the origin and version of the dependency, respectively. diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index f1a6dada55..9821b097a5 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -49,11 +49,12 @@ $ dvc update --rev master directory and its subdirectories for import stage `.dvc` files to inspect. If there are no directories among the targets, this option is ignored. -- `--to-remote` - update a `.dvc` file created with `dvc import-url` and - [transfer](/doc/command-reference/import-url#example-import-straight-to-the-remote) - the new data directly to remote storage (the default one unless `-r` is used). - No changes are done in the workspace. Use `dvc pull` to get the - data locally. This option can't be used with DVC or Git repository imports. +- `--to-remote` - update a `.dvc` file created with `dvc import-url` and store + the latest data directly + [on remote storage](/doc/command-reference/import-url#straight-to-remote) (the + default one unless `-r` is specified). Tracked data is not changed in the + workspace. Use `dvc pull` to get the data locally. This option + can't be used with data imported from DVC or Git repos (with `dvc import`). - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) (can only be used with diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/managing-external-data.md index b7779ea02a..7426c975a1 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/managing-external-data.md @@ -1,13 +1,13 @@ # External Outputs > ⚠️ This is an advanced feature for very specific situations and not -> recommended except if there's absolutely no other alternative. In most cases -> alternatives like the -> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or -> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) -> strategies of `dvc add` and `dvc import-url` are more convenient. **Note** -> that external outputs are not pushed or pulled from/to +> recommended except if there's absolutely no other alternative. Note that +> external outputs are not pushed or pulled from/to > [remote storage](/doc/command-reference/remote). +> +> In most cases the [to-cache](/doc/command-reference/add#example-external-data) +> or [to-remote](/doc/command-reference/add#straight-to-remote) strategies of +> `dvc add` and `dvc import-url` are better. There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example