From 64cdfa8919c4d08a870b378647893ca4eed74cff Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 10 Jan 2021 09:53:02 -0600 Subject: [PATCH 1/3] typo --- content/docs/command-reference/remote/modify.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 0f93345120..95ecce7bc4 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -799,7 +799,7 @@ by HDFS. Read more about by expanding the WebHDFS section in > written to a Git-ignored config file. > Note that `user/password` and `token` authentication are incompatible. You -> should authenticate against yout WebDAV remote by either `user/password` or +> should authenticate against your WebDAV remote by either `user/password` or > `token`. - `ask_password` - ask each time for the password to use for `user/password` From e69bdc3e3fad58d9377c8aac6e2de4217bf0d402 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 10 Jan 2021 09:58:50 -0600 Subject: [PATCH 2/3] cmd: update imports references --- content/docs/command-reference/import-url.md | 56 ++++++++--------- content/docs/command-reference/import.md | 63 ++++++++------------ 2 files changed, 50 insertions(+), 69 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 6719f202c7..bc068d1aba 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -2,7 +2,7 @@ Download a file or directory from a supported URL (for example `s3://`, `ssh://`, and other protocols) into the workspace, and track -changes in the remote data source. Creates a `.dvc` file. +changes in the external data source. Creates a `.dvc` file. > See `dvc import` to download and tack data/model files or directories from > other DVC repositories (e.g. hosted on GitHub). @@ -21,22 +21,21 @@ positional arguments: ## Description -In some cases it's convenient to add a data file or directory from a remote +In some cases it's convenient to add a data file or directory from an external location into the workspace, such that it can be updated later, if/when the external data source changes. Example scenarios: - A remote system may produce occasional data files that are used in other projects. - A batch process running regularly updates a data file to import. -- A shared dataset on a remote storage that is managed and updated outside DVC. +- A shared dataset on cloud storage that is managed and updated outside DVC. > Note that `dvc get-url` corresponds to the first step this command performs > (just download the file or directory). The `dvc import-url` command helps the user create such an external data -dependency without having to manually copying files from the supported remote -locations (listed below), which may require installing a different tool for each -type. +dependency without having to manually copying files from the supported locations +(listed below), which may require installing a different tool for each type. The `url` argument specifies the external location of the data to be imported, while `out` can be used to specify the directory and/or file name desired for @@ -45,15 +44,15 @@ directory will be placed inside. `.dvc` files support references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such an -import `.dvc` file, the `deps` field stores the remote URL, and the `outs` field -contains the corresponding local path in the workspace. It records -enough metadata about the imported data to enable DVC efficiently determining -whether the local copy is out of date. +import `.dvc` file, the `deps` field stores the external URL, and the `outs` +field contains the corresponding local path in the workspace. It +records enough metadata about the imported data to enable DVC efficiently +determining whether the local copy is out of date. Note that `dvc repro` doesn't check or update import `.dvc` files, use `dvc update` to bring the import up to date from the data source. -DVC supports several types of (local or) remote locations (protocols): +DVC supports several types of external locations (protocols): | Type | Description | `url` format example | | --------- | ---------------------------- | --------------------------------------------- | @@ -82,8 +81,7 @@ DVC supports several types of (local or) remote locations (protocols): - In case of HTTP, [ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) is - necessary to track if the specified remote file (URL) changed to download it - again. + necessary to track if the specified URL changed. - `remote://myremote/path/to/file` notation just means that a DVC [remote](/doc/command-reference/remote) `myremote` is defined and when DVC is @@ -110,12 +108,8 @@ $ dvc run -n download_data \ wget https://data.dvc.org/get-started/data.xml -O data.xml ``` -`dvc import-url` generates an _import stage_ `.dvc` file and `dvc run` a regular -stage (in `dvc.yaml`). - -⚠️ DVC won't push or pull imported data to/from -[remote storage](/doc/command-reference/remote), it will rely on it's original -source. +`dvc import-url` generates an _import `.dvc` file_ and `dvc run` a regular stage +(in `dvc.yaml`). ## Options @@ -163,7 +157,7 @@ $ git checkout 3-config-remote -## Example: Tracking a remote file +## Example: Tracking a file from the web An advanced alternate to the intro of the [Versioning Basics](/doc/tutorials/get-started/data-versioning) part of the _Get @@ -195,18 +189,18 @@ Let's take a look at the changes to the `data.xml.dvc`: The `etag` field in the `.dvc` file contains the [ETag](https://en.wikipedia.org/wiki/HTTP_ETag) recorded from the HTTP request. -If the remote file changes, its ETag will be different. This metadata allows DVC -to determine whether it's necessary to download it again. +If the imported file changes online, its ETag will be different. This metadata +allows DVC to determine whether it's necessary to download it again. > See `.dvc` files for more details on the format above. You may want to get out of and remove the `example-get-started/` directory after trying this example (especially if trying out the following one). -## Example: Detecting remote file changes +## Example: Detecting external file changes -What if that remote file is updated regularly? The project goals might include -regenerating some results based on the updated data source. +What if an imported file is updated regularly at it's source? The project goals +might include regenerating some results based on the updated data source. [Pipeline](/doc/command-reference/dag) reproduction can be triggered based on a changed external dependency. @@ -214,9 +208,9 @@ Let's use the [Get Started](/doc/tutorials/get-started) project again, simulating an updated external data source. (Remember to prepare the workspace, as explained in [Examples](#examples)) -To illustrate this scenario, let's use a local file system directory (external -to the workspace) to simulate a remote data source location. (In real life, the -data file will probably be on a remote server.) Run these commands: +To illustrate this scenario, let's use a local file system directory external to +the workspace (in real life, the data file could be on a remote server instead). +Run these commands: ```dvc $ mkdir /tmp/dvc-import-url-example @@ -319,15 +313,15 @@ Data and pipelines are up to date. In the data store directory, edit `data.xml`. It doesn't matter what you change, as long as it remains a valid XML file, because any change will result in a -different dependency file hash (`md5`) in the import stage `.dvc` file. Once we -do so, we can run `dvc update` to make sure the import is up to date: +different dependency file hash (`md5`) in the import `.dvc` file. Once we do so, +we can run `dvc update` to make sure the import is up to date: ```dvc $ dvc update data.xml.dvc Importing '.../tmp/dvc-import-url-example/data.xml' -> 'data/data.xml' ``` -DVC notices the "external" data source has changed, and updates the import stage +DVC notices the external data source has changed, and updates the `.dvc` file (reproduces it). In this case it's also necessary to run `dvc repro` so that the remaining pipeline results are also regenerated: diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 7d687ff3f0..c64f08ad4d 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -1,9 +1,7 @@ # import -Download a file or directory tracked by DVC or by Git into the -workspace. It also creates a `.dvc` file with information about the -data source, which can later be used to [update](/doc/command-reference/update) -the import. +Download a file or directory tracked by another DVC or Git repository into the +workspace, and track it (an import `.dvc` file is created). > See also our `dvc.api.open()` Python API function. @@ -25,9 +23,9 @@ positional arguments: Provides an easy way to reuse files or directories tracked in any DVC repository (e.g. datasets, intermediate results, ML models) or Git repository (e.g. code, small image/other files). `dvc import` downloads the -target file or directory (found at `path` in `url`) into the workspace and -tracks it in the project. This makes it possible to update the import later, if -it has changed in its data source (see `dvc update`). +target file or directory (found at `path` in `url`), and tracks it in the local +project. This makes it possible to update the import later, if the data source +has changed (see `dvc update`). > Note that `dvc get` corresponds to the first step this command performs (just > download the data). @@ -46,33 +44,22 @@ tracked by either Git or DVC (including paths inside tracked directories). Note that DVC-tracked targets must be found in a `dvc.yaml` or `.dvc` file of the repo. -⚠️ DVC repos should have a default [DVC remote](/doc/command-reference/remote) -containing the target actual for this command to work. The only exception is for -local repos, where DVC will try to copy the data from its cache -first. +⚠️ Source DVC repos should have a default +[DVC remote](/doc/command-reference/remote) containing the target data for this +command to work. The only exception is for local repos, where DVC will try to +copy the data from its cache first. > See `dvc import-url` to download and track data from other supported locations > such as S3, SSH, HTTP, etc. -After running this command successfully, the imported data is placed in the -current working directory (unless `-o` is used) with its original file name e.g. -`data.txt`. An _import stage_ (`.dvc` file) is also created in the same -location, extending the name of the imported data e.g. `data.txt.dvc` – similar -to having used `dvc run` to generate the data as a stage output. - `.dvc` files support references to data in an external DVC repository (hosted on -a Git server). In such a `.dvc` file, the `deps` field specifies the remote -`url` and data `path`, and the `outs` field contains the corresponding local -path in the workspace. It records enough metadata about the -imported data to enable DVC efficiently determining whether the local copy is -out of date. - -⚠️ DVC won't push or pull imported data to/from -[remote storage](/doc/command-reference/remote), it will rely on it's original -source. +a Git server). In such a `.dvc` file, the `deps` field specifies the `url` and +data `path`, and the `outs` field contains the corresponding local path in the +workspace. It records enough metadata about the imported data to +enable DVC efficiently determining whether the local copy is out of date. To actually [version the data](/doc/tutorials/get-started/data-versioning), -`git add` (and `git commit`) the import stage. +`git add` (and `git commit`) the import `.dvc` file. Note that `dvc repro` doesn't check or update import `.dvc` files (see `dvc freeze`), use `dvc update` to bring the import up to date from the data @@ -98,8 +85,8 @@ repo at `url`) are not supported. download the file or directory from. The latest commit in `master` (tip of the default branch) is used by default when this option is not specified. - > Note that this adds a `rev` field in the import stage that fixes it to the - > revision. This can impact the behavior of `dvc update` (see the + > Note that this adds a `rev` field in the import `.dvc` file that fixes it to + > the revision. This can impact the behavior of `dvc update` (see the > [Importing and updating fixed revisions](#example-importing-and-updating-fixed-revisions) > example below). @@ -140,8 +127,8 @@ Importing 'data/data.xml (git@github.com:iterative/example-get-started)' ``` In contrast with `dvc get`, this command doesn't just download the data file, -but it also creates an import stage (`.dvc` file) with a link to the data source -(as explained in the description above). (This `.dvc` file can later be used to +but it also creates an import `.dvc` file with a link to the data source (as +explained in the description above). (This `.dvc` file can later be used to [update](/doc/command-reference/update) the import.) Check `data.xml.dvc`: ```yaml @@ -176,8 +163,8 @@ Importing -> 'cats-dogs' ``` -When using this option, the import stage (`.dvc` file) will also have a `rev` -subfield under `repo`: +When using this option, the import `.dvc` file will also have a `rev` subfield +under `repo`: ```yaml deps: @@ -192,14 +179,14 @@ If `rev` is a Git branch or tag (where the underlying commit changes), the data source may have updates at a later time. To bring it up to date if so (and update `rev_lock` in the `.dvc` file), simply use `dvc update .dvc`. If `rev` is a specific commit hash (does not change), `dvc update` without options -will not have an effect on the import stage. You may force-update it to a +will not have an effect on the import `.dvc` file. You may force-update it to a different commit with `dvc update --rev`: ```dvc $ dvc update --rev cats-dogs-v2 ``` -> In the above example, the value for `rev` in the new import stage will be +> In the above example, the value for `rev` in the new `.dvc` file will be > `master` (a branch) so it will be able update normally going forward. ## Example: Data registry @@ -230,7 +217,7 @@ $ dvc import git@github.com:iterative/dataset-registry.git \ `dvc import` provides a better way to incorporate data files tracked in external DVC repositories because it saves the connection between the current project and the source repo. This means that enough information is -recorded in an import stage (`.dvc` file) in order to +recorded in an import `.dvc` file in order to [reproduce](/doc/command-reference/repro) downloading of this same data version in the future, where and when needed. This is achieved with the `repo` field, for example (matching the import command above): @@ -265,8 +252,8 @@ Importing ... > Note that Git-tracked files can be imported from DVC repos as well. -The file is imported, and along with it, an import stage (`.dvc` file) is -created. Check `it-standards.csv.dvc`: +The file is imported, and along with it, an import `.dvc` file is created. Check +`it-standards.csv.dvc`: ```yaml deps: From e720a8e236aa33d6ef4116485d478ea62e493cb6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 10 Jan 2021 10:10:39 -0600 Subject: [PATCH 3/3] cmd: explain caching (and remote storage) "semantics" for import[-url] per https://github.com/iterative/dvc/issues/4520#issuecomment-759617964 --- content/docs/command-reference/import-url.md | 25 +++++++++++++------- content/docs/command-reference/import.md | 9 +++++++ 2 files changed, 25 insertions(+), 9 deletions(-) diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index bc068d1aba..58564f7276 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -1,8 +1,8 @@ # import-url Download a file or directory from a supported URL (for example `s3://`, -`ssh://`, and other protocols) into the workspace, and track -changes in the external data source. Creates a `.dvc` file. +`ssh://`, and other protocols) into the workspace, and track it (an +import `.dvc` file is created). > See `dvc import` to download and tack data/model files or directories from > other DVC repositories (e.g. hosted on GitHub). @@ -33,14 +33,21 @@ external data source changes. Example scenarios: > Note that `dvc get-url` corresponds to the first step this command performs > (just download the file or directory). -The `dvc import-url` command helps the user create such an external data -dependency without having to manually copying files from the supported locations -(listed below), which may require installing a different tool for each type. +`dvc import-url` helps you create such an external data dependency, without +having to manually copy files from the supported locations (listed below), which +may require installing a different tool for each type. -The `url` argument specifies the external location of the data to be imported, -while `out` can be used to specify the directory and/or file name desired for -the downloaded data. If an existing directory is specified, the file or -directory will be placed inside. +The `url` argument specifies the external location of the data to be imported. +The imported data is cached, and linked (or copied) to the current +working directory with its original file name e.g. `data.txt` (or to a location +provided with `out`). + +An _import `.dvc` file_ is created in the same location e.g. `data.txt.dvc` – +similar to using `dvc add` after downloading the data. This makes it possible to +update the import later, if the data source has changed (see `dvc update`). + +> Note that the imported data can be [pushed](/doc/command-reference/push) to +> remote storage normally. `.dvc` files support references to data in an external location, see [External Dependencies](/doc/user-guide/external-dependencies). In such an diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index c64f08ad4d..525cbd9a6f 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -33,6 +33,15 @@ has changed (see `dvc update`). > See `dvc list` for a way to browse repository contents to find files or > directories to import. +The imported data is cached, and linked (or copied) to the current +working directory with its original file name e.g. `data.txt` (or to a location +provided with `--out`). An _import `.dvc` file_ is created in the same location +e.g. `data.txt.dvc` – similar to using `dvc add` after downloading the data. + +⚠️ DVC won't push or pull data imported from other DVC repos to/from +[remote storage](/doc/command-reference/remote). It will rely on it's original +source. + The `url` argument specifies the address of the DVC or Git repository containing the data source. Both HTTP and SSH protocols are supported (e.g. `[user@]server:project.git`). `url` can also be a local file system path