diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index a86bf0dbdec..05312cca53c 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -24,13 +24,13 @@ repository (e.g. source code, small image/other files). `dvc get` copies the target file or directory (found at `path` in `url`) to the current working directory. (Analogous to `wget`, but for repos.) -> See `dvc list` for a way to browse repository contents to find files or -> directories to download. - > Note that unlike `dvc import`, this command does not track the downloaded > files (does not create a `.dvc` file). For that reason, it doesn't require an > existing DVC project to run in. +> See `dvc list` for a way to browse repository contents to find files or +> directories to download. + The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. `[user@]server:project.git`). `url` can also be a local file system path @@ -56,10 +56,10 @@ name. ## Options -- `-o `, `--out ` - destination `path` to place the downloaded file - or directory. By default the data file basename is used in the current working - directory (if this option isn't used). Directories in the given `path` will be - created. +- `-o `, `--out ` - specify a path to the desired location in the + workspace to place the downloaded file or directory (instead of using the + current working directory). Directories specified in the path will be created + by this command. - `--rev ` - commit hash, branch or tag name, etc. (any [Git revision](https://git-scm.com/docs/revisions)) of the repository to diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 13bcaba28f0..325b58c7e99 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -23,8 +23,9 @@ positional arguments: ## Description In some cases it's convenient to add a data file or directory from an external -location into the project, such that it can be updated later if/when the -external data source changes. Example scenarios: +location into the workspace (or to +[remote storage](/doc/command-reference/remote)), such that it can be updated +later, if/when the external data source changes. Example scenarios: - A remote system may produce occasional data files that are used in other projects. @@ -36,26 +37,25 @@ external data source changes. Example scenarios: `dvc import-url` helps you create such an external data dependency, without having to manually copy files from the supported locations (listed below), which -would require installing/using a different tool for each type. +may require installing a different tool for each type. + +When you don't want to store the target data in your local system, you can still +create an import `.dvc` file while transferring a file or directory directly to +remote storage, by using the `--to-remote` option. See the +[Transfer to remote storage](#example-transfer-to-remote-storage) example for +more details. The `url` argument specifies the external location of the data to be imported. The imported data is cached, and linked (or copied) to the current -working directory with its original file name e.g. `data.txt`, or to a location -provided with `out`. +working directory with its original file name e.g. `data.txt` (or to a location +provided with `out`). An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – -similar to using `dvc add` after downloading the data. It saves the information -about the data source, so the import can be updated later if the data source has -changed (see `dvc update`). - -💡 The `--to-remote` option lets you store an import on a -[DVC remote](/doc/command-reference/remote) without using the local file system. +similar to using `dvc add` after downloading the data. This makes it possible to +update the import later, if the data source has changed (see `dvc update`). -> Note that data imported from external locaitons can be -> [pushed](/doc/command-reference/push) and -> [pulled](/doc/command-reference/pull) to/from -> [remote storage](/doc/command-reference/remote) normally (unlike for -> `dvc import`). +> Note that the imported data can be [pushed](/doc/command-reference/push) to +> remote storage normally. `.dvc` files support references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such an @@ -64,9 +64,8 @@ field contains the corresponding local path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -Note that `dvc repro` doesn't check or update import `.dvc` files by default -(see `dvc freeze`), use `dvc update` to bring the import up to date from the -data source. +Note that `dvc repro` doesn't check or update import `.dvc` files, use +`dvc update` to bring the import up to date from the data source. DVC supports several types of external locations (protocols): @@ -141,13 +140,13 @@ $ dvc run -n download_data \ want to "DVCfy" this state of the project (see also `dvc commit`). - `--to-remote` - import an external target, but don't move it into the - workspace, nor cache it. [Store a copy](#straight-to-remote) on a remote - instead (the default one unless `-r` is specified). Use `dvc pull` to get the - data locally. + workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it + directly to remote storage (the default one, unless `-r` is specified) + instead. Use `dvc pull` to get the data locally. - `-r `, `--remote ` - name of the - [remote](/doc/command-reference/remote) to store data on (can only be used - with `--to-remote`). + [remote storage](/doc/command-reference/remote) (can only be used with + `--to-remote`). - `-j `, `--jobs ` - parallelism level for DVC to download data from the source. The default value is `4 * cpu_count()`. For SSH remotes, the @@ -359,36 +358,46 @@ Running stage 'prepare' with command: python src/prepare.py data/data.xml ``` -## Example: `--to-remote` usage {#straight-to-remote} +## Example: Transfer to remote storage -Normally, `dvc import-url` downloads the target data (to the cache) -in order to link and track it locally. But what if there's not enough disk -space? +When you have a large dataset in an external location, you may want to import it +to your project without downloading it to the local file system (for using it +later/elsewhere). The `--to-remote` option let you skip the download, while +storing the imported data [remotely](/doc/command-reference/remote). Let's +initialize a DVC project, and setup a remote: -The `--to-remote` option lets you store a copy of the target data on a -[DVC remote](/doc/command-reference/remote), while creating an import `.dvc` -file locally so it can be [pulled](/doc/command-reference/plots) later. This is -a way to "bootstrap" an import in your local machine, to be downloaded on the -right environment later. +```dvc +$ mkdir example # workspace +$ cd example +$ git init +$ dvc init +$ mkdir /tmp/dvc-storage +$ dvc remote add myremote /tmp/dvc-storage +``` -Let's setup a simple remote and add a `data.xml` file from the web this way: +Now let's create an import `.dvc` file without downloading the target data, +transferring it directly to remote storage instead: ``` -$ mkdir /tmp/dvc-storage -$ dvc remote add myremote /tmp/dvc-storage $ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \ --to-remote -r myremote ... +``` + +The only change in our local workspace is a newly created import +`.dvc` file: + +```dvc $ ls data.xml.dvc ``` -The only change in our local workspace is the tiny `.dvc` file that -was created. To actually download the data to cache, you can use -`dvc fetch` or `dvc pull` as usual (on a system that can handle it): +Whenever anyone wants to actually download the imported data (for example from a +system that can handle it), they can use `dvc pull` as usual: ``` -$ dvc pull data.xml.dvc -r tmp_remote + $ dvc pull data.xml.dvc -r tmp_remote + A data.xml 1 file added and 1 file fetched ``` diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index a853537edf2..b16a14b3677 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -27,20 +27,20 @@ target file or directory (found at `path` in `url`), and tracks it in the local project. This makes it possible to update the import later, if the data source has changed (see `dvc update`). -> See `dvc list` for a way to browse repository contents to find files or -> directories to import. - > Note that `dvc get` corresponds to the first step this command performs (just > download the data). +> See `dvc list` for a way to browse repository contents to find files or +> directories to import. + The imported data is cached, and linked (or copied) to the current working directory with its original file name e.g. `data.txt` (or to a location provided with `--out`). An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. -(ℹ️) DVC won't push or pull data imported from other DVC repos to/from -[remote storage](/doc/command-reference/remote). `dvc pull` will download from -the original source instead. +(ℹ️) DVC won't push or pull imported data to/from +[remote storage](/doc/command-reference/remote), it will rely on it's original +source. The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. @@ -70,19 +70,19 @@ enable DVC efficiently determining whether the local copy is out of date. To actually [version the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import `.dvc` file. -Note that `dvc repro` doesn't check or update import `.dvc` files by default -(see `dvc freeze`), use `dvc update` to bring the import up to date from the -data source. +Note that `dvc repro` doesn't check or update import `.dvc` files (see +`dvc freeze`), use `dvc update` to bring the import up to date from the data +source. Also note that chained imports (importing data that was imported into the source repo at `url`) are not supported. ## Options -- `-o `, `--out ` - destination `path` inside the workspace to place - the downloaded file or directory. By default the file basename name is used in - the current working directory (if this option isn't used). Directories in the - given `path` will be created. +- `-o `, `--out ` - specify a path to the desired location in the + workspace to place the downloaded file or directory (instead of using the + current working directory). Directories specified in the path must already + exist, otherwise this command will fail. - `--file ` - specify a path and/or file name for the `.dvc` file created by this command (e.g. `--file stages/stage.dvc`). This overrides the @@ -154,7 +154,7 @@ outs: cache: true ``` -Several of the values above are obtained from the original `.dvc` file +Several of the values above are pulled from the original `.dvc` file `model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock` subfields under `repo` are used to save the origin and version of the dependency, respectively. diff --git a/content/docs/command-reference/list.md b/content/docs/command-reference/list.md index 829307e22af..0298dd67167 100644 --- a/content/docs/command-reference/list.md +++ b/content/docs/command-reference/list.md @@ -1,9 +1,7 @@ # list -List project contents, including files, models, and directories tracked by DVC -and by Git. - -> Useful to find data to `dvc get`, `dvc import`, or for `dvc.api` functions. +List repository contents, including files, models, and directories tracked by +DVC (as outputs) and by Git. ## Synopsis @@ -18,15 +16,17 @@ positional arguments: ## Description -Produces a view of a DVC repository (usually online), listing data -files and directories tracked by DVC alongside the remaining Git repo contents. -This is useful because when you browse a hosted repository (e.g. on GitHub or -with `git ls-remote`), you only see the `dvc.yaml` and `.dvc` files with your -code (files tracked by Git). +A side-effect of DVC is that it hides actual data paths, by effectively +replacing files and directories with DVC files. So you don't see +data files/dirs when you browse a DVC repository on Git hosting +(e.g. GitHub), you just see the `dvc.yaml` and `.dvc` files. This can make it +hard to navigate the project, for example to find files or directories for use +with `dvc get`, `dvc import`, or `dvc.api` functions. -This command's output is equivalent to cloning the repo and -[pulling](/doc/command-reference/pull) the data (except that nothing is -downloaded), like this: +This command produces a view of a DVC repository, as if files and directories +tracked by DVC were found directly in the Git repo. Its output is equivalent to +cloning the repo and [pulling](/doc/command-reference/pull) the data (except +that nothing is downloaded by `dvc list`), like this: ```dvc $ git clone example @@ -35,17 +35,17 @@ $ dvc pull $ ls ``` +Only the root directory is listed by default, but the `-R` option can be used to +list files recursively. + The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. `[user@]server:project.git`). `url` can also be a local file system path (including the current project e.g. `.`). The optional `path` argument is used to specify a directory to list within the -Git repo at `url` (including paths inside tracked directories). It's similar to -providing a path to list to commands such as `ls` or `aws s3 ls`. - -Only the root directory is listed by default, but the `-R` option can be used to -list files recursively. +source repository at `url` (including paths inside tracked directories). It's +similar to providing a path to list to commands such as `ls` or `aws s3 ls`. Please note that `dvc list` doesn't check whether the listed data (tracked by DVC) actually exists in remote storage, so it's not guaranteed whether it can be