Skip to content

Commit

Permalink
ref: revert changed related to #2302
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Mar 18, 2021
1 parent 4d5bd5f commit d10a572
Show file tree
Hide file tree
Showing 4 changed files with 87 additions and 78 deletions.
14 changes: 7 additions & 7 deletions content/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,13 +24,13 @@ repository (e.g. source code, small image/other files). `dvc get` copies the
target file or directory (found at `path` in `url`) to the current working
directory. (Analogous to `wget`, but for repos.)

> See `dvc list` for a way to browse repository contents to find files or
> directories to download.
> Note that unlike `dvc import`, this command does not track the downloaded
> files (does not create a `.dvc` file). For that reason, it doesn't require an
> existing DVC project to run in.
> See `dvc list` for a way to browse repository contents to find files or
> directories to download.
The `url` argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
`[user@]server:project.git`). `url` can also be a local file system path
Expand All @@ -56,10 +56,10 @@ name.

## Options

- `-o <path>`, `--out <path>` - destination `path` to place the downloaded file
or directory. By default the data file basename is used in the current working
directory (if this option isn't used). Directories in the given `path` will be
created.
- `-o <path>`, `--out <path>` - specify a path to the desired location in the
workspace to place the downloaded file or directory (instead of using the
current working directory). Directories specified in the path will be created
by this command.

- `--rev <commit>` - commit hash, branch or tag name, etc. (any
[Git revision](https://git-scm.com/docs/revisions)) of the repository to
Expand Down
89 changes: 49 additions & 40 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,9 @@ positional arguments:
## Description

In some cases it's convenient to add a data file or directory from an external
location into the project, such that it can be updated later if/when the
external data source changes. Example scenarios:
location into the workspace (or to
[remote storage](/doc/command-reference/remote)), such that it can be updated
later, if/when the external data source changes. Example scenarios:

- A remote system may produce occasional data files that are used in other
projects.
Expand All @@ -36,26 +37,25 @@ external data source changes. Example scenarios:
`dvc import-url` helps you create such an external data dependency, without
having to manually copy files from the supported locations (listed below), which
would require installing/using a different tool for each type.
may require installing a different tool for each type.

When you don't want to store the target data in your local system, you can still
create an import `.dvc` file while transferring a file or directory directly to
remote storage, by using the `--to-remote` option. See the
[Transfer to remote storage](#example-transfer-to-remote-storage) example for
more details.

The `url` argument specifies the external location of the data to be imported.
The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
working directory with its original file name e.g. `data.txt`, or to a location
provided with `out`.
working directory with its original file name e.g. `data.txt` (or to a location
provided with `out`).

An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc`
similar to using `dvc add` after downloading the data. It saves the information
about the data source, so the import can be updated later if the data source has
changed (see `dvc update`).

💡 The `--to-remote` option lets you store an import on a
[DVC remote](/doc/command-reference/remote) without using the local file system.
similar to using `dvc add` after downloading the data. This makes it possible to
update the import later, if the data source has changed (see `dvc update`).

> Note that data imported from external locaitons can be
> [pushed](/doc/command-reference/push) and
> [pulled](/doc/command-reference/pull) to/from
> [remote storage](/doc/command-reference/remote) normally (unlike for
> `dvc import`).
> Note that the imported data can be [pushed](/doc/command-reference/push) to
> remote storage normally.
`.dvc` files support references to data in an external location, see
[External Dependencies](/doc/user-guide/external-dependencies). In such an
Expand All @@ -64,9 +64,8 @@ field contains the corresponding local path in the <abbr>workspace</abbr>. It
records enough metadata about the imported data to enable DVC efficiently
determining whether the local copy is out of date.

Note that `dvc repro` doesn't check or update import `.dvc` files by default
(see `dvc freeze`), use `dvc update` to bring the import up to date from the
data source.
Note that `dvc repro` doesn't check or update import `.dvc` files, use
`dvc update` to bring the import up to date from the data source.

DVC supports several types of external locations (protocols):

Expand Down Expand Up @@ -141,13 +140,13 @@ $ dvc run -n download_data \
want to "DVCfy" this state of the project (see also `dvc commit`).

- `--to-remote` - import an external target, but don't move it into the
workspace, nor cache it. [Store a copy](#straight-to-remote) on a remote
instead (the default one unless `-r` is specified). Use `dvc pull` to get the
data locally.
workspace, nor cache it. [Transfer](#example-import-straight-to-the-remote) it
directly to remote storage (the default one, unless `-r` is specified)
instead. Use `dvc pull` to get the data locally.

- `-r <name>`, `--remote <name>` - name of the
[remote](/doc/command-reference/remote) to store data on (can only be used
with `--to-remote`).
[remote storage](/doc/command-reference/remote) (can only be used with
`--to-remote`).

- `-j <number>`, `--jobs <number>` - parallelism level for DVC to download data
from the source. The default value is `4 * cpu_count()`. For SSH remotes, the
Expand Down Expand Up @@ -359,36 +358,46 @@ Running stage 'prepare' with command:
python src/prepare.py data/data.xml
```

## Example: `--to-remote` usage {#straight-to-remote}
## Example: Transfer to remote storage

Normally, `dvc import-url` downloads the target data (to the <abbr>cache</abbr>)
in order to link and track it locally. But what if there's not enough disk
space?
When you have a large dataset in an external location, you may want to import it
to your project without downloading it to the local file system (for using it
later/elsewhere). The `--to-remote` option let you skip the download, while
storing the imported data [remotely](/doc/command-reference/remote). Let's
initialize a DVC project, and setup a remote:

The `--to-remote` option lets you store a copy of the target data on a
[DVC remote](/doc/command-reference/remote), while creating an import `.dvc`
file locally so it can be [pulled](/doc/command-reference/plots) later. This is
a way to "bootstrap" an import in your local machine, to be downloaded on the
right environment later.
```dvc
$ mkdir example # workspace
$ cd example
$ git init
$ dvc init
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
```

Let's setup a simple remote and add a `data.xml` file from the web this way:
Now let's create an import `.dvc` file without downloading the target data,
transferring it directly to remote storage instead:

```
$ mkdir /tmp/dvc-storage
$ dvc remote add myremote /tmp/dvc-storage
$ dvc import-url https://data.dvc.org/get-started/data.xml data.xml \
--to-remote -r myremote
...
```

The only change in our local <abbr>workspace</abbr> is a newly created import
`.dvc` file:

```dvc
$ ls
data.xml.dvc
```

The only change in our local <abbr>workspace</abbr> is the tiny `.dvc` file that
was created. To actually download the data to <abbr>cache</abbr>, you can use
`dvc fetch` or `dvc pull` as usual (on a system that can handle it):
Whenever anyone wants to actually download the imported data (for example from a
system that can handle it), they can use `dvc pull` as usual:

```
$ dvc pull data.xml.dvc -r tmp_remote
$ dvc pull data.xml.dvc -r tmp_remote
A data.xml
1 file added and 1 file fetched
```
28 changes: 14 additions & 14 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,20 +27,20 @@ target file or directory (found at `path` in `url`), and tracks it in the local
project. This makes it possible to update the import later, if the data source
has changed (see `dvc update`).

> See `dvc list` for a way to browse repository contents to find files or
> directories to import.
> Note that `dvc get` corresponds to the first step this command performs (just
> download the data).
> See `dvc list` for a way to browse repository contents to find files or
> directories to import.
The imported data is <abbr>cached</abbr>, and linked (or copied) to the current
working directory with its original file name e.g. `data.txt` (or to a location
provided with `--out`). An _import `.dvc` file_ is created in the same location
e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data.

(ℹ️) DVC won't push or pull data imported from other DVC repos to/from
[remote storage](/doc/command-reference/remote). `dvc pull` will download from
the original source instead.
(ℹ️) DVC won't push or pull imported data to/from
[remote storage](/doc/command-reference/remote), it will rely on it's original
source.

The `url` argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
Expand Down Expand Up @@ -70,19 +70,19 @@ enable DVC efficiently determining whether the local copy is out of date.
To actually [version the data](/doc/tutorials/get-started/data-versioning),
`git add` (and `git commit`) the import `.dvc` file.

Note that `dvc repro` doesn't check or update import `.dvc` files by default
(see `dvc freeze`), use `dvc update` to bring the import up to date from the
data source.
Note that `dvc repro` doesn't check or update import `.dvc` files (see
`dvc freeze`), use `dvc update` to bring the import up to date from the data
source.

Also note that chained imports (importing data that was imported into the source
repo at `url`) are not supported.

## Options

- `-o <path>`, `--out <path>` - destination `path` inside the workspace to place
the downloaded file or directory. By default the file basename name is used in
the current working directory (if this option isn't used). Directories in the
given `path` will be created.
- `-o <path>`, `--out <path>` - specify a path to the desired location in the
workspace to place the downloaded file or directory (instead of using the
current working directory). Directories specified in the path must already
exist, otherwise this command will fail.

- `--file <filename>` - specify a path and/or file name for the `.dvc` file
created by this command (e.g. `--file stages/stage.dvc`). This overrides the
Expand Down Expand Up @@ -154,7 +154,7 @@ outs:
cache: true
```
Several of the values above are obtained from the original `.dvc` file
Several of the values above are pulled from the original `.dvc` file
`model.pkl.dvc` in the external DVC repository. The `url` and `rev_lock`
subfields under `repo` are used to save the origin and version of the
dependency, respectively.
Expand Down
34 changes: 17 additions & 17 deletions content/docs/command-reference/list.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# list

List project contents, including files, models, and directories tracked by DVC
and by Git.

> Useful to find data to `dvc get`, `dvc import`, or for `dvc.api` functions.
List repository contents, including files, models, and directories tracked by
DVC (as <abbr>outputs</abbr>) and by Git.

## Synopsis

Expand All @@ -18,15 +16,17 @@ positional arguments:

## Description

Produces a view of a <abbr>DVC repository</abbr> (usually online), listing data
files and directories tracked by DVC alongside the remaining Git repo contents.
This is useful because when you browse a hosted repository (e.g. on GitHub or
with `git ls-remote`), you only see the `dvc.yaml` and `.dvc` files with your
code (files tracked by Git).
A side-effect of DVC is that it hides actual data paths, by effectively
replacing files and directories with <abbr>DVC files</abbr>. So you don't see
data files/dirs when you browse a <abbr>DVC repository</abbr> on Git hosting
(e.g. GitHub), you just see the `dvc.yaml` and `.dvc` files. This can make it
hard to navigate the project, for example to find files or directories for use
with `dvc get`, `dvc import`, or `dvc.api` functions.

This command's output is equivalent to cloning the repo and
[pulling](/doc/command-reference/pull) the data (except that nothing is
downloaded), like this:
This command produces a view of a DVC repository, as if files and directories
tracked by DVC were found directly in the Git repo. Its output is equivalent to
cloning the repo and [pulling](/doc/command-reference/pull) the data (except
that nothing is downloaded by `dvc list`), like this:

```dvc
$ git clone <url> example
Expand All @@ -35,17 +35,17 @@ $ dvc pull
$ ls <path>
```

Only the root directory is listed by default, but the `-R` option can be used to
list files recursively.

The `url` argument specifies the address of the DVC or Git repository containing
the data source. Both HTTP and SSH protocols are supported (e.g.
`[user@]server:project.git`). `url` can also be a local file system path
(including the current project e.g. `.`).

The optional `path` argument is used to specify a directory to list within the
Git repo at `url` (including paths inside tracked directories). It's similar to
providing a path to list to commands such as `ls` or `aws s3 ls`.

Only the root directory is listed by default, but the `-R` option can be used to
list files recursively.
source repository at `url` (including paths inside tracked directories). It's
similar to providing a path to list to commands such as `ls` or `aws s3 ls`.

Please note that `dvc list` doesn't check whether the listed data (tracked by
DVC) actually exists in remote storage, so it's not guaranteed whether it can be
Expand Down

0 comments on commit d10a572

Please sign in to comment.