From 4742d9c80627c875aa3d8f8b5b8965a4d08ddc1e Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 3 Feb 2021 12:52:37 -0600 Subject: [PATCH 01/19] guide: disclaim x data (impro #2104) --- content/docs/command-reference/add.md | 8 +-- content/docs/command-reference/config.md | 20 +++---- content/docs/command-reference/run.md | 8 +-- content/docs/command-reference/version.md | 2 +- content/docs/sidebar.json | 4 +- content/docs/start/data-versioning.md | 6 +-- .../docs/user-guide/external-dependencies.md | 19 ++++--- ...g-external-data.md => external-outputs.md} | 53 +++++++++---------- .../user-guide/project-structure/dvc-files.md | 31 ++++++----- redirects-list.json | 1 + 10 files changed, 76 insertions(+), 76 deletions(-) rename content/docs/user-guide/{managing-external-data.md => external-outputs.md} (75%) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 984be5ce75..a32bedc56a 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -143,10 +143,12 @@ not. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` - `--external` - allow `targets` that are outside of the DVC repository. See - [Managing External Data](/doc/user-guide/managing-external-data). + [External Outputs](/doc/user-guide/external-outputs). - > Note that external outputs typically require an external cache setup. See - > link above for more details. + > ⚠️ Note that this is an advanced feature for very specific situations and + > not recommended except if there's absolutely no other alternative. + > Additionally, this typically requires an external cache setup (see link + > above). - `--desc ` - user description of the data (optional). This doesn't affect any DVC operations. diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index b39ee0731a..34dafabf72 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -179,30 +179,30 @@ This section contains the following options, which affect the project's directories, which is useful when you are using a a [shared development server](/doc/use-cases/shared-development-server). -- `cache.local` - name of a _local remote_ to use as a - [custom cache](/doc/user-guide/managing-external-data#examples) directory. +- `cache.local` - name of a _local remote_ to + [use as external cache](/doc/user-guide/external-outputs#examples) directory. (Refer to `dvc remote` for more information on "local remotes".) This will overwrite the value provided to `dvc config cache.dir` or `dvc cache dir`. - `cache.s3` - name of an Amazon S3 remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). + [use as external cache](/doc/user-guide/external-outputs#examples). - `cache.gs` - name of a Google Cloud Storage remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). + [use as external cache](/doc/user-guide/external-outputs#examples). - `cache.ssh` - name of an SSH remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). + [use as external cache](/doc/user-guide/external-outputs#examples). - `cache.hdfs` - name of an HDFS remote to use as - [external cache](/doc/user-guide/managing-external-data#examples). + [use as external cache](/doc/user-guide/external-outputs#examples). - `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as - [external cache](/doc/user-guide/managing-external-data#examples). + [use as external cache](/doc/user-guide/external-outputs#examples). -> Avoid using the same [DVC remote](/doc/command-reference/remote) (used for -> `dvc push`, `dvc pull`, etc.) as external cache, because it may cause file +> ⚠️ Avoid using the same [remote storage](/doc/command-reference/remote) used +> for `dvc push` and `dvc pull` as external cache, because it may cause file > hash overlaps: the hash of an external output could collide with -> a hash generated locally for another file with different content. +> that of a local file with different content. ### state diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index c6862d8850..254634238e 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -101,8 +101,9 @@ Relevant notes: for more info.) - [external dependencies](/doc/user-guide/external-dependencies) and - [external outputs](/doc/user-guide/managing-external-data) (outside of the - workspace) are also supported (except metrics and plots). + [external outputs](/doc/user-guide/external-outputs) (outside of the + workspace) are also supported (except metrics and plots), + although not usually recommended. - Outputs are deleted from the workspace before executing the command (including at `dvc repro`) if their paths are found as existing files/directories (unless @@ -259,7 +260,8 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR' > considered "always changed", so this option has no effect in those cases. - `--external` - allow writing outputs outside of the DVC repository. See - [Managing External Data](/doc/user-guide/managing-external-data). + [External Outputs](/doc/user-guide/external-outputs) — not usually + recommended. - `--desc ` - user description of the stage (optional). This doesn't affect any DVC operations. diff --git a/content/docs/command-reference/version.md b/content/docs/command-reference/version.md index 4a1dae6f2c..b8673569a4 100644 --- a/content/docs/command-reference/version.md +++ b/content/docs/command-reference/version.md @@ -19,7 +19,7 @@ usage: dvc version [-h] [-q | -v] | `Supports` | Types of [remote storage](/doc/command-reference/remote/add#supported-storage-types) supported by the current DVC setup (their required dependencies are installed) | | `Cache types` | [Types of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) supported (between workspace and cache) | | `Cache directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the cache directory is mounted | -| `Caches` | Cache [location types](/doc/user-guide/managing-external-data) configured in the repo (e.g. local, SSH, S3, etc.) | +| `Caches` | Cache [location types](/doc/user-guide/external-outputs) configured in the repo (e.g. local, SSH, S3, etc.) | | `Remotes` | Remote [location types](/doc/command-reference/remote/add#supported-storage-types) configured in the repo (e.g. SSH, S3, Google Drive, etc.) | | `Workspace directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the workspace is mounted | | `Repo` | Shows whether we are in a DVC repo and/or Git repo | diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 720f0f9a14..40745cacbd 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -131,8 +131,8 @@ "large-dataset-optimization", "external-dependencies", { - "label": "Managing External Data", - "slug": "managing-external-data" + "label": "External Outputs", + "slug": "external-outputs" }, { "label": "Contributing", diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-versioning.md index c26dc16619..619a507bca 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-versioning.md @@ -257,10 +257,10 @@ volume? While these cases are not covered in the Get Started, we recommend reading the following sections next to learn more about advanced workflows: -- A shared [external cache](/doc/use-cases/shared-development-server) can be set +- A [shared external cache](/doc/use-cases/shared-development-server) can be set up to store, version and access a lot of data on a large shared volume efficiently. - A quite advanced scenario is to track and version data directly on the remote storage (e.g. S3). Check out - [Managing External Data](https://dvc.org/doc/user-guide/managing-external-data) - to learn more. + [External Outputs](https://dvc.org/doc/user-guide/external-outputs) to learn + more. diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 76381e088b..51f6a36bf3 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -1,25 +1,24 @@ # External Dependencies There are cases when data is so large, or its processing is organized in such a -way, that its preferable to avoid moving it from its original location. For -example data on a network attached storage (NAS), processing data on HDFS, +way, that its preferable to avoid moving it from its current external location. +For example data on a network attached storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via SSH, or for a script that streams data from S3 to process it. -External dependencies and -[external outputs](/doc/user-guide/managing-external-data) provide ways to track -and version data outside of the project. +External dependencies (and [external outputs](/doc/user-guide/external-outputs)) +provide ways to track (and version) data outside of the project. ## How external dependencies work -External dependencies are considered part of the (extended) DVC -project: DVC will track them, detecting when they change (triggering stage -executions on `dvc repro`, for example). +External dependencies will be tracked by DVC, detecting when they +change (triggering stage executions on `dvc repro`, for example). To define files or directories in an external location as -[stage](/doc/command-reference/run) dependencies, put their remote URLs or +[stage](/doc/command-reference/run) dependencies, specify their remote URLs or external paths in `dvc.yaml` (`deps` field). Use the same format as the `url` of -certain `dvc remote` types. Currently, the following protocols are supported: +certain `dvc remote` types. Currently, the following supported `dvc remote` +types/protocols: - Amazon S3 - Microsoft Azure Blob Storage diff --git a/content/docs/user-guide/managing-external-data.md b/content/docs/user-guide/external-outputs.md similarity index 75% rename from content/docs/user-guide/managing-external-data.md rename to content/docs/user-guide/external-outputs.md index 34c511c89c..871c038314 100644 --- a/content/docs/user-guide/managing-external-data.md +++ b/content/docs/user-guide/external-outputs.md @@ -1,52 +1,49 @@ -# Managing External Data +# External Outputs -> ⚠️ This is an advanced feature that we don't recommend using unless you really -> know what you are doing. Artifacts added with --external are not affected by -> `dvc push/pull/status -c`. You are likely looking for -> [straight-to-remote/cache](https://github.com/iterative/dvc/issues/4520) -> functionality or `dvc import-url` +> ⚠️ This is an advanced feature for very specific situations and not +> recommended except if there's absolutely no other alternative. In most cases +> alternatives like the `--to-cache` or `--to-remote` options of `dvc add` and +> `dvc import-url` are more convenient. **Note** that external outputs are not +> pushed or pulled from/to [remote storage](/doc/command-reference/remote). There are cases when data is so large, or its processing is organized in such a -way, that its preferable to avoid moving it from its original location. For -example data on a network attached storage (NAS), processing data on HDFS, -running [Dask](https://dask.org/) via SSH, or for a script that streams data -from S3 to process it. +way, that its impossible to handle it in the local machine disk. For example +versioning existing data on a network attached storage (NAS), processing data on +HDFS, running [Dask](https://dask.org/) via SSH, or any code that generates +massive files directly to the cloud. -External outputs and -[external dependencies](/doc/user-guide/external-dependencies) provide ways to +External outputs (and +[external dependencies](/doc/user-guide/external-dependencies)) provide ways to track and version data outside of the project. ## How external outputs work -External outputs are considered part of the (extended) DVC project: -DVC will track them for +External outputs are considered part of the (extended) +workspace: DVC will track them for [versioning](/doc/use-cases/versioning-data-and-model-files), detecting when they change (reported by `dvc status`, for example). -To use existing files or directories in an external location as -[stage](/doc/command-reference/run) outputs, give their remote URLs or external -paths to `dvc add`, or put them in `dvc.yaml` (`deps` field). Use the same -format as the `url` of certain `dvc remote` types. Currently, the following -protocols are supported: +To use existing files or directories in an external location as outputs, give +their remote URLs or external paths to `dvc add`, or put them in `dvc.yaml` +(`deps` field). Use the same format as the `url` of the following supported +`dvc remote` types/protocols: - Amazon S3 - Google Cloud Storage - SSH - HDFS -- Local files and directories outside the workspace +- Local files and directories outside the workspace -External outputs require an +⚠️ External outputs require an [external cache](/doc/use-cases/shared-development-server#configure-the-external-shared-cache) in the same external/remote file. -> Note that [remote storage](/doc/command-reference/remote) is a different -> feature, and that external outputs are not pushed or pulled from/to DVC -> remotes. +> Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. as +> external cache, because it may cause data collisions: the hash of an external +> output could collide with that of a local file with different content. -> ⚠️ Avoid using the same DVC remote used for `dvc push`, `dvc pull`, etc. for -> external outputs, because it may cause data collisions: the hash of an -> external output could collide with that of a local file with different -> content. +> Note that [remote storage](/doc/command-reference/remote) is a different +> feature. ## Examples diff --git a/content/docs/user-guide/project-structure/dvc-files.md b/content/docs/user-guide/project-structure/dvc-files.md index 2683d512a3..ce488f5375 100644 --- a/content/docs/user-guide/project-structure/dvc-files.md +++ b/content/docs/user-guide/project-structure/dvc-files.md @@ -1,13 +1,12 @@ # `.dvc` Files You can use `dvc add` to track data files or directories located in your current -workspace, or in supported -[external locations](/doc/user-guide/managing-external-data). Additionally, -`dvc import` and `dvc import-url` let you bring data from external locations to -your project, and start tracking it locally. +workspace\*. Additionally, `dvc import` and `dvc import-url` let +you bring data from external locations to your project, and start tracking it +locally. See [Data Versioning](/doc/start/data-versioning) for more info. -> See [Data Versioning](/doc/start/data-versioning) and -> [Data Access](/doc/start/data-access) for more info. +> \* Certain [external locations](/doc/user-guide/external-outputs) are also +> supported. Files ending with the `.dvc` extension ("dot DVC file") are created by these commands as data placeholders that can be versioned with Git. They contain the @@ -55,16 +54,16 @@ Comments can be entered using the `# comment` format. The following subfields may be present under `outs` entries: -| Field | Description | -| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `path` | (Required) Path to the file or directory (relative to `wdir`, which defaults to the file's location) | -| `md5`
`etag`
`checksum` | Hash value for the file or directory being tracked with DVC. MD5 is used for most locations (local file system and SSH); [ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) for HTTP, S3, or Azure [external outputs](/doc/user-guide/managing-external-data); and a special _checksum_ for HDFS and WebHDFS. | -| `size` | Size of the file or directory (sum of all files). | -| `nfiles` | If this output is a directory, the number of files inside (recursive). | -| `isexec` | Whether this is an executable file. DVC preserves execute permissions upon `dvc checkout` and `dvc pull`. This has no effect on directories, or in general on Windows. | -| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | -| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts) | -| `desc` | (Optional) user description for this output (supported in metrics and plots too). This doesn't affect any DVC operations. | +| Field | Description | +| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | (Required) Path to the file or directory (relative to `wdir`, which defaults to the file's location) | +| `md5`
`etag`
`checksum` | Hash value for the file or directory being tracked with DVC. MD5 is used for most locations (local file system and SSH); [ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) for HTTP, S3, or Azure [external outputs](/doc/user-guide/external-outputs); and a special _checksum_ for HDFS and WebHDFS. | +| `size` | Size of the file or directory (sum of all files). | +| `nfiles` | If this output is a directory, the number of files inside (recursive). | +| `isexec` | Whether this is an executable file. DVC preserves execute permissions upon `dvc checkout` and `dvc pull`. This has no effect on directories, or in general on Windows. | +| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | +| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts) | +| `desc` | (Optional) user description for this output (supported in metrics and plots too). This doesn't affect any DVC operations. | ## Dependency entries diff --git a/redirects-list.json b/redirects-list.json index 52a08710d8..796c2b71a9 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -37,6 +37,7 @@ "^/doc/user-guide/dvc-files/.dvc$ /doc/user-guide/project-structure/dvc-files", "^/doc/user-guide/dvc-internals(/.*)?$ /doc/user-guide/project-structure/internal-files$1", "^/doc/user-guide/dvcignore$ /doc/user-guide/project-structure/dvcignore-files", + "^/doc/user-guide/managing-external-data$ /doc/user-guide/user-guide/external-outputs", "^/doc/understanding-dvc(/.*)?$ /doc/user-guide/what-is-dvc", "^/doc/commands-reference(/.*)?$ /doc/command-reference$1", "^/doc/command-reference/plot$ /doc/command-reference/plots", From 8dea9639424e3928ec83bb56716ae2dee022d873 Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Thu, 18 Feb 2021 13:36:35 +0300 Subject: [PATCH 02/19] Added changes from PR #2188 and modified paths & titles - Changes title of "Data Access" to "Data and Model Access" - Changes title of "Data Versioning" to "Data and Model Versioning" - Renames path of Data Access and Data Versioning to `data-and-model-access.md` and `data-and-model-versioning.md` respectively. - Adds redirects -- `/doc/start/data-access` -> `/doc/start/data-and-model-access` -- `/doc/start/data-versioning` -> `/doc/start/data-and-model-versioning` - Replaces links in `/doc/start` with the new links. --- ...ata-access.md => data-and-model-access.md} | 15 +++++++++------ ...ioning.md => data-and-model-versioning.md} | 14 ++++++++++++-- content/docs/start/data-pipelines.md | 2 +- content/docs/start/index.md | 19 ++++++++++--------- redirects-list.json | 2 ++ 5 files changed, 34 insertions(+), 18 deletions(-) rename content/docs/start/{data-access.md => data-and-model-access.md} (87%) rename content/docs/start/{data-versioning.md => data-and-model-versioning.md} (91%) diff --git a/content/docs/start/data-access.md b/content/docs/start/data-and-model-access.md similarity index 87% rename from content/docs/start/data-access.md rename to content/docs/start/data-and-model-access.md index 0c40d58df6..1f3417288e 100644 --- a/content/docs/start/data-access.md +++ b/content/docs/start/data-and-model-access.md @@ -1,13 +1,16 @@ --- -title: 'Get Started: Data Access' +title: 'Get Started: Data and Model Access' --- -# Get Started: Data Access +# Get Started: Data and Model Access -Okay, now that we've learned how to _track_ data and models with DVC and how to -version them with Git, next question is how can we _use_ these artifacts outside -of the project? How do I download a model to deploy it? How do I download a -specific version of a model? How do I reuse datasets across different projects? +Okay, now that we've learned how to _track_ data files in DVC and how to version +them with Git. _Models_ in a machine learning project are also files written and +read by programs and DVC can track and version them similar to data files. + +Next question is how can we _use_ these artifacts outside of the project? How do +I download a model to deploy it? How do I download a specific version of a +model? How do I reuse datasets across different projects? > These questions tend to come up when you browse the files that DVC saves to > remote storage, e.g. diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-and-model-versioning.md similarity index 91% rename from content/docs/start/data-versioning.md rename to content/docs/start/data-and-model-versioning.md index c26dc16619..3ca1b43bc8 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -1,6 +1,6 @@ --- -title: 'Get Started: Data Versioning' -description: 'Get started with data versioning in DVC. Learn how to use a +title: 'Get Started: Data and Model Versioning' +description: 'Get started with data and model versioning in DVC. Learn how to use a regular Git workflow for datasets and ML models, without storing large files in Git.' --- @@ -247,6 +247,16 @@ defines data file versions. Git itself provides the version control. DVC in turn creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in the workspace efficiently to match them. +## Model versioning + +Apart from data files, DVC eases the way you work with models. Models in a +project usually change more frequently than data files and they need to be kept +in sync with changes in other elements of a project. Model files are no +different than data files when it comes to tracking their versions. DVC also +provides means to track minor changes in model files without fully checking in +to underlying VCS. In later sections of this series, you'll see how DVC enables +to track changes in pipelines consisting of multiple model and data files. + ## Large datasets versioning In cases where you process very large datasets, you need an efficient mechanism diff --git a/content/docs/start/data-pipelines.md b/content/docs/start/data-pipelines.md index 863808c75f..f9fac2ddcf 100644 --- a/content/docs/start/data-pipelines.md +++ b/content/docs/start/data-pipelines.md @@ -143,7 +143,7 @@ stages: There's no need to use `dvc add` for DVC to track stage outputs (`data/prepared` in this case); `dvc run` already took care of this. You only need to run `dvc push` if you want to save them to -[remote storage](/doc/tutorials/get-started/data-versioning#storing-and-sharing), +[remote storage](/doc/start/data-and-model-versioning#storing-and-sharing), (usually along with `git commit` to version `dvc.yaml` itself). ## Dependency graphs (DAGs) diff --git a/content/docs/start/index.md b/content/docs/start/index.md index b92e72be3f..65a7959abd 100644 --- a/content/docs/start/index.md +++ b/content/docs/start/index.md @@ -53,15 +53,16 @@ Now you're ready to DVC! DVC's features can be grouped into functional components. We'll explore them one by one in the next few pages: -- [**Data versioning**](/doc/start/data-versioning) (try this next) is the base - layer of DVC for large files, datasets, and machine learning models. Use a - regular Git workflow, but without storing large files in the repo (think "Git - for data"). Data is stored separately, which allows for efficient sharing. - -- [**Data access**](/doc/start/data-access) shows how to use data artifacts from - outside of the project and how to import data artifacts from another DVC - project. This can help to download a specific version of an ML model to a - deployment server or import a model to another project. +- [**Data and model versioning**](/doc/start/data-and-model-versioning) (try + this next) is the base layer of DVC for large files, datasets, and machine + learning models. Use a regular Git workflow, but without storing large files + in the repo (think "Git for data"). Data is stored separately, which allows + for efficient sharing. + +- [**Data and model access**](/doc/start/data-and-model-access) shows how to use + data artifacts from outside of the project and how to import data artifacts + from another DVC project. This can help to download a specific version of an + ML model to a deployment server or import a model to another project. - [**Data pipelines**](/doc/start/data-pipelines) describe how models and other data artifacts are built, and provide an efficient way to reproduce them. diff --git a/redirects-list.json b/redirects-list.json index 52a08710d8..f870e50363 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -26,6 +26,8 @@ "^/doc/tutorials/get-started(/.*)?$ /doc/start", "^/doc/tutorials/versioning(/.*)?$ /doc/use-cases/versioning-data-and-model-files/tutorial", "^/doc/tutorials(/.*)? /doc/start", + "^/doc/start/data-versioning(/.*)?$ /doc/start/data-and-model-versioning", + "^/doc/start/data-access(/.*)?$ /doc/start/data-and-model-access", "^/doc/use-cases/data-and-model-files-versioning/?$ /doc/use-cases/versioning-data-and-model-files", "^/doc/user-guide/updating-tracked-files$ /doc/user-guide/how-to/update-tracked-data", From 45ba85126abb0b5d4c98c302258bd131bc4f500f Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 11:10:14 +0300 Subject: [PATCH 03/19] Update redirects-list.json with fixed subsection redirects. Co-authored-by: Jorge Orpinel --- redirects-list.json | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/redirects-list.json b/redirects-list.json index f870e50363..e3f6bd0529 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -26,8 +26,8 @@ "^/doc/tutorials/get-started(/.*)?$ /doc/start", "^/doc/tutorials/versioning(/.*)?$ /doc/use-cases/versioning-data-and-model-files/tutorial", "^/doc/tutorials(/.*)? /doc/start", - "^/doc/start/data-versioning(/.*)?$ /doc/start/data-and-model-versioning", - "^/doc/start/data-access(/.*)?$ /doc/start/data-and-model-access", + "^/doc/start/data-versioning$ /doc/start/data-and-model-versioning", + "^/doc/start/data-access$ /doc/start/data-and-model-access", "^/doc/use-cases/data-and-model-files-versioning/?$ /doc/use-cases/versioning-data-and-model-files", "^/doc/user-guide/updating-tracked-files$ /doc/user-guide/how-to/update-tracked-data", From 09dc8ca7c498ee0414c3b6bce7430cc82054f012 Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 11:42:05 +0300 Subject: [PATCH 04/19] Fixed incomplete looking sentence --- content/docs/start/data-and-model-access.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 1f3417288e..64e68e9a67 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -4,12 +4,13 @@ title: 'Get Started: Data and Model Access' # Get Started: Data and Model Access -Okay, now that we've learned how to _track_ data files in DVC and how to version -them with Git. _Models_ in a machine learning project are also files written and -read by programs and DVC can track and version them similar to data files. +Okay, now that we've learned how to _track_ data files in DVC and how to commit +their versions to Git. _Models_ in a machine learning project are also files +written and read by programs and DVC can track and version them similar to data +files. -Next question is how can we _use_ these artifacts outside of the project? How do -I download a model to deploy it? How do I download a specific version of a +Next questions are: How can we _use_ these artifacts outside of the project? How +do I download a model to deploy it? How do I download a specific version of a model? How do I reuse datasets across different projects? > These questions tend to come up when you browse the files that DVC saves to From a3b15ba4e4a463fe47884150fc25238b85853cbe Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:09:38 +0300 Subject: [PATCH 05/19] merged into a single paragraph --- content/docs/start/data-and-model-access.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 64e68e9a67..574cca47b7 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -7,11 +7,9 @@ title: 'Get Started: Data and Model Access' Okay, now that we've learned how to _track_ data files in DVC and how to commit their versions to Git. _Models_ in a machine learning project are also files written and read by programs and DVC can track and version them similar to data -files. - -Next questions are: How can we _use_ these artifacts outside of the project? How -do I download a model to deploy it? How do I download a specific version of a -model? How do I reuse datasets across different projects? +files. Next questions are: How can we _use_ these artifacts outside of the +project? How do I download a model to deploy it? How do I download a specific +version of a model? How do I reuse datasets across different projects? > These questions tend to come up when you browse the files that DVC saves to > remote storage, e.g. From 7731587eb58c58805f5ec7c5f398dc385b1dce9c Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:12:52 +0300 Subject: [PATCH 06/19] Divided models sentence and added "large files" phrase. --- content/docs/start/data-and-model-access.md | 11 ++++++----- 1 file changed, 6 insertions(+), 5 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 574cca47b7..594e253b2f 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -5,11 +5,12 @@ title: 'Get Started: Data and Model Access' # Get Started: Data and Model Access Okay, now that we've learned how to _track_ data files in DVC and how to commit -their versions to Git. _Models_ in a machine learning project are also files -written and read by programs and DVC can track and version them similar to data -files. Next questions are: How can we _use_ these artifacts outside of the -project? How do I download a model to deploy it? How do I download a specific -version of a model? How do I reuse datasets across different projects? +their versions to Git. _Models_ in a machine learning project are typically +large files written and read by programs. DVC can track and version model files +similar to data files. Next questions are: How can we _use_ these artifacts +outside of the project? How do I download a model to deploy it? How do I +download a specific version of a model? How do I reuse datasets across different +projects? > These questions tend to come up when you browse the files that DVC saves to > remote storage, e.g. From bb84a996042c3abfa0e039cfbc96392bdc1f6617 Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:48:15 +0300 Subject: [PATCH 07/19] Adds new paths to sidebar --- content/docs/sidebar.json | 4 ++-- redirects-list.json | 1 + 2 files changed, 3 insertions(+), 2 deletions(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index e6fa86237d..9c769dff50 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -35,13 +35,13 @@ }, "children": [ { - "slug": "data-versioning", + "slug": "data-and-model-versioning", "tutorials": { "katacoda": "https://katacoda.com/dvc/courses/get-started/versioning" } }, { - "slug": "data-access", + "slug": "data-and-model-access", "tutorials": { "katacoda": "https://katacoda.com/dvc/courses/get-started/accessing" } diff --git a/redirects-list.json b/redirects-list.json index e3f6bd0529..9b28f2977a 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -26,6 +26,7 @@ "^/doc/tutorials/get-started(/.*)?$ /doc/start", "^/doc/tutorials/versioning(/.*)?$ /doc/use-cases/versioning-data-and-model-files/tutorial", "^/doc/tutorials(/.*)? /doc/start", + "^/doc/start/data-versioning$ /doc/start/data-and-model-versioning", "^/doc/start/data-access$ /doc/start/data-and-model-access", From 9ef97c6c1fa4076e818f1213fd3aa8644df0b952 Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:49:22 +0300 Subject: [PATCH 08/19] Updated links to data-access and data-versioning cmd ref --- content/docs/command-reference/diff.md | 5 +++-- content/docs/command-reference/get.md | 2 +- content/docs/command-reference/import-url.md | 2 +- content/docs/command-reference/import.md | 4 ++-- 4 files changed, 7 insertions(+), 6 deletions(-) diff --git a/content/docs/command-reference/diff.md b/content/docs/command-reference/diff.md index dc2103f2ae..cef6de9a1c 100644 --- a/content/docs/command-reference/diff.md +++ b/content/docs/command-reference/diff.md @@ -123,8 +123,9 @@ $ dvc diff Let's checkout the [2-track-data](https://github.com/iterative/example-get-started/releases/tag/2-track-data) -tag, corresponding to the [Data Versioning](/doc/start/data-versioning) _Get -Started_ chapter, right after we added `data.xml` file with DVC: +tag, corresponding to the +[Data Versioning](/doc/start/data-and-model-versioning) _Get Started_ chapter, +right after we added `data.xml` file with DVC: ```dvc $ git checkout 2-track-data diff --git a/content/docs/command-reference/get.md b/content/docs/command-reference/get.md index 05312cca53..083327b647 100644 --- a/content/docs/command-reference/get.md +++ b/content/docs/command-reference/get.md @@ -151,7 +151,7 @@ file or directory from. It also has the `--out` option to specify the location to place the target data within the workspace. Combining these two options allows us to do something we can't achieve with the regular `git checkout` + `dvc checkout` process – see for example the -[Get Older Data Version](/doc/tutorials/get-started/data-versioning#navigate-versions) +[Get Older Data Version](/doc/start/data-and-model-versioning#switching-between-versions) chapter of our _Get Started_. Let's use the diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index c00998fabb..5eba467734 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -187,7 +187,7 @@ $ git checkout 3-config-remote ## Example: Tracking a file from the web An advanced alternate to the intro of the -[Versioning Basics](/doc/tutorials/get-started/data-versioning) part of the _Get +[Versioning Basics](/doc/start/data-and-model-versioning) part of the _Get Started_ is to use `dvc import-url`: ```dvc diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index b16a14b367..46da6e4039 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -67,8 +67,8 @@ data `path`, and the `outs` field contains the corresponding local path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -To actually [version the data](/doc/tutorials/get-started/data-versioning), -`git add` (and `git commit`) the import `.dvc` file. +To actually [version the data](/doc/start/data-and-model-versioning), `git add` +(and `git commit`) the import `.dvc` file. Note that `dvc repro` doesn't check or update import `.dvc` files (see `dvc freeze`), use `dvc update` to bring the import up to date from the data From 2593bb7923715b43802acccdaf82227faa075e6a Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:50:12 +0300 Subject: [PATCH 09/19] updated links to data-access and data-versioning in blog --- content/blog/2020-10-12-october-20-dvc-heartbeat.md | 2 +- content/blog/2020-11-11-november-20-dvc-heartbeat.md | 2 +- content/blog/2020-12-18-december-20-dvc-heartbeat.md | 8 ++++---- 3 files changed, 6 insertions(+), 6 deletions(-) diff --git a/content/blog/2020-10-12-october-20-dvc-heartbeat.md b/content/blog/2020-10-12-october-20-dvc-heartbeat.md index f34c309029..8a4697d612 100644 --- a/content/blog/2020-10-12-october-20-dvc-heartbeat.md +++ b/content/blog/2020-10-12-october-20-dvc-heartbeat.md @@ -107,7 +107,7 @@ few weeks, so stay tuned. Another big initative is adding videos to our docs: since video seems like a popular format for a lot of learners, we're working to supplement our official docs with embedded videos. Check out our first installment on the -[Getting Started with Data Versioning](https://dvc.org/doc/start/data-versioning). +[Getting Started with Data Versioning](https://dvc.org/doc/start/data-and-model-versioning). https://youtu.be/kLKBcPonMYw diff --git a/content/blog/2020-11-11-november-20-dvc-heartbeat.md b/content/blog/2020-11-11-november-20-dvc-heartbeat.md index 6613dea11f..b5ddd08750 100644 --- a/content/blog/2020-11-11-november-20-dvc-heartbeat.md +++ b/content/blog/2020-11-11-november-20-dvc-heartbeat.md @@ -64,7 +64,7 @@ welcome referrals if you know a good candidate)! We're continuing to develop our video docs, and now half of our "Getting Started" section has video accompaniments. Check out our latest release on -[data access with DVC](https://dvc.org/doc/start/data-access): +[data access with DVC](https://dvc.org/doc/start/data-and-model-access): https://youtu.be/EE7Gk84OZY8 diff --git a/content/blog/2020-12-18-december-20-dvc-heartbeat.md b/content/blog/2020-12-18-december-20-dvc-heartbeat.md index 1822be48c4..271725f8b8 100644 --- a/content/blog/2020-12-18-december-20-dvc-heartbeat.md +++ b/content/blog/2020-12-18-december-20-dvc-heartbeat.md @@ -53,11 +53,11 @@ As you may have heard on adding complete video docs to the "Getting Started" section of the DVC site. We now have 100% coverage! We have videos that mirror the tutorials for: -- [Data versioning](https://dvc.org/doc/start/data-versioning) - how to use Git - and DVC together to track different versions of a dataset +- [Data versioning](https://dvc.org/doc/start/data-and-model-versioning) - how + to use Git and DVC together to track different versions of a dataset -- [Data access](https://dvc.org/doc/start/data-access) - how to share models and - datasets across projects and environments +- [Data access](https://dvc.org/doc/start/data-and-model-access) - how to share + models and datasets across projects and environments - [Pipelines](https://dvc.org/doc/start/data-pipelines) - how to create reproducible pipelines to transform datasets to features to models From 9ed0867cb7eb7572eba2036e77955c7e9ec2c74a Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:50:55 +0300 Subject: [PATCH 10/19] Updated links to data-access and data-versioning in UC --- content/docs/use-cases/data-registries.md | 8 ++++---- .../use-cases/versioning-data-and-model-files/index.md | 2 +- 2 files changed, 5 insertions(+), 5 deletions(-) diff --git a/content/docs/use-cases/data-registries.md b/content/docs/use-cases/data-registries.md index b9b1adef1c..e2c9882af8 100644 --- a/content/docs/use-cases/data-registries.md +++ b/content/docs/use-cases/data-registries.md @@ -2,10 +2,10 @@ One of the main uses of DVC repositories is the [versioning of data and model files](/doc/use-cases/data-and-model-files-versioning). -DVC also enables cross-project [reusability](/doc/start/data-access) of these -data artifacts. This means that your projects can depend on data -from other DVC repositories — like a **package management system for data -science**. +DVC also enables cross-project [reusability](/doc/start/data-and-model-access) +of these data artifacts. This means that your projects can depend +on data from other DVC repositories — like a **package management system for +data science**. ![](/img/data-registry.png) _Data management middleware_ diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 9eff871022..099a4b9905 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -65,7 +65,7 @@ Benefits of our approach include: - **Collaboration**: Easily distribute your project development and share its data [internally](/doc/use-cases/shared-development-server) and [remotely](/doc/use-cases/sharing-data-and-model-files), or - [reuse](/doc/start/data-access) it in other places. + [reuse](/doc/start/data-and-model-access) it in other places. - **Data compliance**: Review data modification attempts as Git [pull requests](https://www.dummies.com/web-design-development/what-are-github-pull-requests/). From 3d7d61defce67be42121c5788fe8f7d93d8fd43f Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Sat, 20 Feb 2021 12:51:21 +0300 Subject: [PATCH 11/19] Updated links to data-access and data-versioning in UG --- content/docs/user-guide/project-structure/dvc-files.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/project-structure/dvc-files.md b/content/docs/user-guide/project-structure/dvc-files.md index 2683d512a3..fe5740459d 100644 --- a/content/docs/user-guide/project-structure/dvc-files.md +++ b/content/docs/user-guide/project-structure/dvc-files.md @@ -6,8 +6,8 @@ You can use `dvc add` to track data files or directories located in your current `dvc import` and `dvc import-url` let you bring data from external locations to your project, and start tracking it locally. -> See [Data Versioning](/doc/start/data-versioning) and -> [Data Access](/doc/start/data-access) for more info. +> See [Data Versioning](/doc/start/data-and-model-versioning) and +> [Data Access](/doc/start/data-and-model-access) for more info. Files ending with the `.dvc` extension ("dot DVC file") are created by these commands as data placeholders that can be versioned with Git. They contain the From b65de408e12c5c8b1854c130db6da685878e26ac Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Mon, 22 Feb 2021 21:14:45 +0300 Subject: [PATCH 12/19] updated yarn.lock --- yarn.lock | 7 ++++++- 1 file changed, 6 insertions(+), 1 deletion(-) diff --git a/yarn.lock b/yarn.lock index af66d9231b..e9e5b57e02 100644 --- a/yarn.lock +++ b/yarn.lock @@ -13378,11 +13378,16 @@ prettier-linter-helpers@^1.0.0: dependencies: fast-diff "^1.1.2" -prettier@^2.0.4, prettier@^2.0.5: +prettier@^2.0.5: version "2.0.5" resolved "https://registry.yarnpkg.com/prettier/-/prettier-2.0.5.tgz#d6d56282455243f2f92cc1716692c08aa31522d4" integrity sha512-7PtVymN48hGcO4fGjybyBSIWDsLU4H4XlvOHfq91pz9kkGlonzwTfYkaIEwiRg/dAJF9YlbsduBAgtYLi+8cFg== +prettier@^2.2.1: + version "2.2.1" + resolved "https://registry.yarnpkg.com/prettier/-/prettier-2.2.1.tgz#795a1a78dd52f073da0cd42b21f9c91381923ff5" + integrity sha512-PqyhM2yCjg/oKkFPtTGUojv7gnZAoG80ttl45O6x2Ug/rMJw4wcc9k6aaf2hibP7BGVCCM33gZoGjyvt9mm16Q== + pretty-bytes@^5.3.0: version "5.3.0" resolved "https://registry.yarnpkg.com/pretty-bytes/-/pretty-bytes-5.3.0.tgz#f2849e27db79fb4d6cfe24764fc4134f165989f2" From 3555c5ee36a892fa8082d8da3261880c3faf9c99 Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Tue, 23 Feb 2021 11:23:21 +0300 Subject: [PATCH 13/19] Update content/docs/start/data-and-model-versioning.md Co-authored-by: Jorge Orpinel --- content/docs/start/data-and-model-versioning.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 3ca1b43bc8..7b982e1c33 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -249,7 +249,7 @@ the workspace efficiently to match them. ## Model versioning -Apart from data files, DVC eases the way you work with models. Models in a +DVC helps you handle model files as well. Models in a project usually change more frequently than data files and they need to be kept in sync with changes in other elements of a project. Model files are no different than data files when it comes to tracking their versions. DVC also From b83d00d1a58143c10eee548d8abec44ea1dbb59c Mon Sep 17 00:00:00 2001 From: "Restyled.io" Date: Wed, 24 Feb 2021 11:21:07 +0000 Subject: [PATCH 14/19] Restyled by prettier --- content/docs/start/data-and-model-versioning.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 7b982e1c33..f8d8f36f3d 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -249,13 +249,13 @@ the workspace efficiently to match them. ## Model versioning -DVC helps you handle model files as well. Models in a -project usually change more frequently than data files and they need to be kept -in sync with changes in other elements of a project. Model files are no -different than data files when it comes to tracking their versions. DVC also -provides means to track minor changes in model files without fully checking in -to underlying VCS. In later sections of this series, you'll see how DVC enables -to track changes in pipelines consisting of multiple model and data files. +DVC helps you handle model files as well. Models in a project usually change +more frequently than data files and they need to be kept in sync with changes in +other elements of a project. Model files are no different than data files when +it comes to tracking their versions. DVC also provides means to track minor +changes in model files without fully checking in to underlying VCS. In later +sections of this series, you'll see how DVC enables to track changes in +pipelines consisting of multiple model and data files. ## Large datasets versioning From e6d6bf7518df4ec09e3818829ea4708bd15a3f3f Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Wed, 24 Feb 2021 14:51:09 +0300 Subject: [PATCH 15/19] fixes hardcoded links to data-and-model-access in the blog --- .../blog/2020-10-12-october-20-dvc-heartbeat.md | 2 +- .../blog/2020-11-11-november-20-dvc-heartbeat.md | 2 +- .../blog/2020-12-18-december-20-dvc-heartbeat.md | 16 ++++++++-------- 3 files changed, 10 insertions(+), 10 deletions(-) diff --git a/content/blog/2020-10-12-october-20-dvc-heartbeat.md b/content/blog/2020-10-12-october-20-dvc-heartbeat.md index 8a4697d612..6a2e031020 100644 --- a/content/blog/2020-10-12-october-20-dvc-heartbeat.md +++ b/content/blog/2020-10-12-october-20-dvc-heartbeat.md @@ -107,7 +107,7 @@ few weeks, so stay tuned. Another big initative is adding videos to our docs: since video seems like a popular format for a lot of learners, we're working to supplement our official docs with embedded videos. Check out our first installment on the -[Getting Started with Data Versioning](https://dvc.org/doc/start/data-and-model-versioning). +[Getting Started with Data Versioning](/doc/start/data-and-model-versioning). https://youtu.be/kLKBcPonMYw diff --git a/content/blog/2020-11-11-november-20-dvc-heartbeat.md b/content/blog/2020-11-11-november-20-dvc-heartbeat.md index b5ddd08750..bf058e973b 100644 --- a/content/blog/2020-11-11-november-20-dvc-heartbeat.md +++ b/content/blog/2020-11-11-november-20-dvc-heartbeat.md @@ -64,7 +64,7 @@ welcome referrals if you know a good candidate)! We're continuing to develop our video docs, and now half of our "Getting Started" section has video accompaniments. Check out our latest release on -[data access with DVC](https://dvc.org/doc/start/data-and-model-access): +[data access with DVC](/doc/start/data-and-model-access): https://youtu.be/EE7Gk84OZY8 diff --git a/content/blog/2020-12-18-december-20-dvc-heartbeat.md b/content/blog/2020-12-18-december-20-dvc-heartbeat.md index 271725f8b8..ab4599b4ff 100644 --- a/content/blog/2020-12-18-december-20-dvc-heartbeat.md +++ b/content/blog/2020-12-18-december-20-dvc-heartbeat.md @@ -53,17 +53,17 @@ As you may have heard on adding complete video docs to the "Getting Started" section of the DVC site. We now have 100% coverage! We have videos that mirror the tutorials for: -- [Data versioning](https://dvc.org/doc/start/data-and-model-versioning) - how - to use Git and DVC together to track different versions of a dataset +- [Data versioning](/doc/start/data-and-model-versioning) - how to use Git and + DVC together to track different versions of a dataset -- [Data access](https://dvc.org/doc/start/data-and-model-access) - how to share - models and datasets across projects and environments +- [Data access](/doc/start/data-and-model-access) - how to share models and + datasets across projects and environments -- [Pipelines](https://dvc.org/doc/start/data-pipelines) - how to create - reproducible pipelines to transform datasets to features to models +- [Pipelines](/doc/start/data-pipelines) - how to create reproducible pipelines + to transform datasets to features to models -- [Experiments](https://dvc.org/doc/start/experiments) - how to do a `git diff` - for models that compares and visualizes metrics +- [Experiments](/doc/start/experiments) - how to do a `git diff` for models that + compares and visualizes metrics https://media.giphy.com/media/L4ZZNbDpOCfiX8uYSd/giphy.gif From b46cca318f07ee9bf04e8259107ac2c2e3edc0db Mon Sep 17 00:00:00 2001 From: Emre Sahin Date: Wed, 24 Feb 2021 15:11:16 +0300 Subject: [PATCH 16/19] minor fixes --- content/docs/start/data-and-model-access.md | 13 ++++++------- content/docs/start/data-and-model-versioning.md | 8 ++++---- redirects-list.json | 5 ++--- 3 files changed, 12 insertions(+), 14 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index 594e253b2f..a7266d2c82 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -4,13 +4,12 @@ title: 'Get Started: Data and Model Access' # Get Started: Data and Model Access -Okay, now that we've learned how to _track_ data files in DVC and how to commit -their versions to Git. _Models_ in a machine learning project are typically -large files written and read by programs. DVC can track and version model files -similar to data files. Next questions are: How can we _use_ these artifacts -outside of the project? How do I download a model to deploy it? How do I -download a specific version of a model? How do I reuse datasets across different -projects? +We've learned how to _track_ data files in DVC and how to commit their versions +to Git. Machine learning models are typically large files written and read by +programs. DVC can track and version model files similar to data files. The next +questions are: How can we _use_ these artifacts outside of the project? How do I +download a model to deploy it? How do I download a specific version of a model? +How do I reuse datasets across different projects? > These questions tend to come up when you browse the files that DVC saves to > remote storage, e.g. diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index f8d8f36f3d..57583e016a 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -249,13 +249,13 @@ the workspace efficiently to match them. ## Model versioning -DVC helps you handle model files as well. Models in a project usually change +DVC helps you to handle model files as well. Models in a project usually change more frequently than data files and they need to be kept in sync with changes in other elements of a project. Model files are no different than data files when it comes to tracking their versions. DVC also provides means to track minor -changes in model files without fully checking in to underlying VCS. In later -sections of this series, you'll see how DVC enables to track changes in -pipelines consisting of multiple model and data files. +changes in model files without fully checking in to Git. In later sections of +this series, you'll see how DVC enables to track changes to synchronize multiple +model and data files. ## Large datasets versioning diff --git a/redirects-list.json b/redirects-list.json index 9b28f2977a..6778768558 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -22,14 +22,13 @@ "^/(?:docs|documentation)(/.*)?$ /doc$1", "^/doc/get-started(/.*)?$ /doc/start", + "^/doc/start/data-versioning$ /doc/start/data-and-model-versioning", + "^/doc/start/data-access$ /doc/start/data-and-model-access", "^/doc/tutorial(/.*)?$ /doc/start", "^/doc/tutorials/get-started(/.*)?$ /doc/start", "^/doc/tutorials/versioning(/.*)?$ /doc/use-cases/versioning-data-and-model-files/tutorial", "^/doc/tutorials(/.*)? /doc/start", - "^/doc/start/data-versioning$ /doc/start/data-and-model-versioning", - "^/doc/start/data-access$ /doc/start/data-and-model-access", - "^/doc/use-cases/data-and-model-files-versioning/?$ /doc/use-cases/versioning-data-and-model-files", "^/doc/user-guide/updating-tracked-files$ /doc/user-guide/how-to/update-tracked-data", "^/doc/user-guide/how-to/update-tracked-files$ /doc/user-guide/how-to/update-tracked-data", From 49eefb0689799e82da741bff91563e578a6d8a01 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 13 Mar 2021 21:55:44 -0700 Subject: [PATCH 17/19] guide: revert Exp Outs guide rename per https://github.com/iterative/dvc.org/pull/2154#pullrequestreview-610353476 --- content/docs/command-reference/add.md | 2 +- content/docs/command-reference/run.md | 8 +++---- content/docs/command-reference/version.md | 2 +- content/docs/sidebar.json | 4 ++-- content/docs/start/data-versioning.md | 6 ++--- .../docs/user-guide/external-dependencies.md | 5 ++-- ...l-outputs.md => managing-external-data.md} | 9 ++++--- .../user-guide/project-structure/dvc-files.md | 24 +++++++++---------- redirects-list.json | 1 - 9 files changed, 31 insertions(+), 30 deletions(-) rename content/docs/user-guide/{external-outputs.md => managing-external-data.md} (93%) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 4b9c1d52d9..d59ded757c 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -146,7 +146,7 @@ not. Shell style wildcards supported: `*`, `?`, `[seq]`, `[!seq]`, and `**` - `--external` - allow `targets` that are outside of the DVC repository. See - [External Outputs](/doc/user-guide/external-outputs). + [Managing External Data](/doc/user-guide/managing-external-data). > ⚠️ Note that this is an advanced feature for very specific situations and > not recommended except if there's absolutely no other alternative. diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index b3a871d01e..b78d37e76c 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -101,9 +101,8 @@ Relevant notes: for more info.) - [external dependencies](/doc/user-guide/external-dependencies) and - [external outputs](/doc/user-guide/external-outputs) (outside of the - workspace) are also supported (except metrics and plots), - although not usually recommended. + [external outputs](/doc/user-guide/managing-external-data) (outside of the + workspace) are also supported (except metrics and plots). - Outputs are deleted from the workspace before executing the command (including at `dvc repro`) if their paths are found as existing files/directories (unless @@ -264,8 +263,7 @@ $ dvc run -n second_stage './another_script.sh $MYENVVAR' > considered "always changed", so this option has no effect in those cases. - `--external` - allow writing outputs outside of the DVC repository. See - [External Outputs](/doc/user-guide/external-outputs) — not usually - recommended. + [Managing External Data](/doc/user-guide/managing-external-data). - `--desc ` - user description of the stage (optional). This doesn't affect any DVC operations. diff --git a/content/docs/command-reference/version.md b/content/docs/command-reference/version.md index b8673569a4..4a1dae6f2c 100644 --- a/content/docs/command-reference/version.md +++ b/content/docs/command-reference/version.md @@ -19,7 +19,7 @@ usage: dvc version [-h] [-q | -v] | `Supports` | Types of [remote storage](/doc/command-reference/remote/add#supported-storage-types) supported by the current DVC setup (their required dependencies are installed) | | `Cache types` | [Types of links](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) supported (between workspace and cache) | | `Cache directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the cache directory is mounted | -| `Caches` | Cache [location types](/doc/user-guide/external-outputs) configured in the repo (e.g. local, SSH, S3, etc.) | +| `Caches` | Cache [location types](/doc/user-guide/managing-external-data) configured in the repo (e.g. local, SSH, S3, etc.) | | `Remotes` | Remote [location types](/doc/command-reference/remote/add#supported-storage-types) configured in the repo (e.g. SSH, S3, Google Drive, etc.) | | `Workspace directory` | Filesystem type (e.g. ext4, FAT, etc.) and drive on which the workspace is mounted | | `Repo` | Shows whether we are in a DVC repo and/or Git repo | diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 0b0f8d7332..2bd1f2fb3e 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -136,8 +136,8 @@ "large-dataset-optimization", "external-dependencies", { - "label": "External Outputs", - "slug": "external-outputs" + "label": "Managing External Data", + "slug": "managing-external-data" }, { "label": "Contributing", diff --git a/content/docs/start/data-versioning.md b/content/docs/start/data-versioning.md index 4b1718af0e..49f8801246 100644 --- a/content/docs/start/data-versioning.md +++ b/content/docs/start/data-versioning.md @@ -256,10 +256,10 @@ volume? While these cases are not covered in the Get Started, we recommend reading the following sections next to learn more about advanced workflows: -- A [shared external cache](/doc/use-cases/shared-development-server) can be set +- A shared [external cache](/doc/use-cases/shared-development-server) can be set up to store, version and access a lot of data on a large shared volume efficiently. - A quite advanced scenario is to track and version data directly on the remote storage (e.g. S3). Check out - [External Outputs](https://dvc.org/doc/user-guide/external-outputs) to learn - more. + [Managing External Data](https://dvc.org/doc/user-guide/managing-external-data) + to learn more. diff --git a/content/docs/user-guide/external-dependencies.md b/content/docs/user-guide/external-dependencies.md index 8260dd6bac..87b0645c54 100644 --- a/content/docs/user-guide/external-dependencies.md +++ b/content/docs/user-guide/external-dependencies.md @@ -6,8 +6,9 @@ For example data on a network attached storage (NAS), processing data on HDFS, running [Dask](https://dask.org/) via SSH, or for a script that streams data from S3 to process it. -External dependencies (and [external outputs](/doc/user-guide/external-outputs)) -provide ways to track (and version) data outside of the project. +External dependencies and +[external outputs](/doc/user-guide/managing-external-data) provide ways to track +and version data outside of the project. ## How external dependencies work diff --git a/content/docs/user-guide/external-outputs.md b/content/docs/user-guide/managing-external-data.md similarity index 93% rename from content/docs/user-guide/external-outputs.md rename to content/docs/user-guide/managing-external-data.md index e8a62b2c9e..b7779ea02a 100644 --- a/content/docs/user-guide/external-outputs.md +++ b/content/docs/user-guide/managing-external-data.md @@ -2,9 +2,12 @@ > ⚠️ This is an advanced feature for very specific situations and not > recommended except if there's absolutely no other alternative. In most cases -> alternatives like the `--to-cache` or `--to-remote` options of `dvc add` and -> `dvc import-url` are more convenient. **Note** that external outputs are not -> pushed or pulled from/to [remote storage](/doc/command-reference/remote). +> alternatives like the +> [to-cache](/doc/command-reference/add#example-transfer-to-the-cache) or +> [to-remote](/doc/command-reference/add#example-transfer-to-remote-storage) +> strategies of `dvc add` and `dvc import-url` are more convenient. **Note** +> that external outputs are not pushed or pulled from/to +> [remote storage](/doc/command-reference/remote). There are cases when data is so large, or its processing is organized in such a way, that its impossible to handle it in the local machine disk. For example diff --git a/content/docs/user-guide/project-structure/dvc-files.md b/content/docs/user-guide/project-structure/dvc-files.md index ce488f5375..fd60691f8b 100644 --- a/content/docs/user-guide/project-structure/dvc-files.md +++ b/content/docs/user-guide/project-structure/dvc-files.md @@ -5,8 +5,8 @@ You can use `dvc add` to track data files or directories located in your current you bring data from external locations to your project, and start tracking it locally. See [Data Versioning](/doc/start/data-versioning) for more info. -> \* Certain [external locations](/doc/user-guide/external-outputs) are also -> supported. +> \* Certain [external locations](/doc/user-guide/managing-external-data) are +> also supported. Files ending with the `.dvc` extension ("dot DVC file") are created by these commands as data placeholders that can be versioned with Git. They contain the @@ -54,16 +54,16 @@ Comments can be entered using the `# comment` format. The following subfields may be present under `outs` entries: -| Field | Description | -| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | -| `path` | (Required) Path to the file or directory (relative to `wdir`, which defaults to the file's location) | -| `md5`
`etag`
`checksum` | Hash value for the file or directory being tracked with DVC. MD5 is used for most locations (local file system and SSH); [ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) for HTTP, S3, or Azure [external outputs](/doc/user-guide/external-outputs); and a special _checksum_ for HDFS and WebHDFS. | -| `size` | Size of the file or directory (sum of all files). | -| `nfiles` | If this output is a directory, the number of files inside (recursive). | -| `isexec` | Whether this is an executable file. DVC preserves execute permissions upon `dvc checkout` and `dvc pull`. This has no effect on directories, or in general on Windows. | -| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | -| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts) | -| `desc` | (Optional) user description for this output (supported in metrics and plots too). This doesn't affect any DVC operations. | +| Field | Description | +| ------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | +| `path` | (Required) Path to the file or directory (relative to `wdir`, which defaults to the file's location) | +| `md5`
`etag`
`checksum` | Hash value for the file or directory being tracked with DVC. MD5 is used for most locations (local file system and SSH); [ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) for HTTP, S3, or Azure [external outputs](/doc/user-guide/managing-external-data); and a special _checksum_ for HDFS and WebHDFS. | +| `size` | Size of the file or directory (sum of all files). | +| `nfiles` | If this output is a directory, the number of files inside (recursive). | +| `isexec` | Whether this is an executable file. DVC preserves execute permissions upon `dvc checkout` and `dvc pull`. This has no effect on directories, or in general on Windows. | +| `cache` | Whether or not this file or directory is cached (`true` by default). See the `--no-commit` option of `dvc add`. | +| `persist` | Whether the output file/dir should remain in place while `dvc repro` runs (`false` by default: outputs are deleted when `dvc repro` starts) | +| `desc` | (Optional) user description for this output (supported in metrics and plots too). This doesn't affect any DVC operations. | ## Dependency entries diff --git a/redirects-list.json b/redirects-list.json index 796c2b71a9..52a08710d8 100644 --- a/redirects-list.json +++ b/redirects-list.json @@ -37,7 +37,6 @@ "^/doc/user-guide/dvc-files/.dvc$ /doc/user-guide/project-structure/dvc-files", "^/doc/user-guide/dvc-internals(/.*)?$ /doc/user-guide/project-structure/internal-files$1", "^/doc/user-guide/dvcignore$ /doc/user-guide/project-structure/dvcignore-files", - "^/doc/user-guide/managing-external-data$ /doc/user-guide/user-guide/external-outputs", "^/doc/understanding-dvc(/.*)?$ /doc/user-guide/what-is-dvc", "^/doc/commands-reference(/.*)?$ /doc/command-reference$1", "^/doc/command-reference/plot$ /doc/command-reference/plots", From c1433426b79921b2802b0426f12282cca99888ff Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 28 Mar 2021 23:58:13 -0600 Subject: [PATCH 18/19] start: roll back unnecessary changes unnecessary for #2214 --- content/docs/start/data-and-model-access.md | 10 ++++------ content/docs/start/data-and-model-versioning.md | 10 ---------- 2 files changed, 4 insertions(+), 16 deletions(-) diff --git a/content/docs/start/data-and-model-access.md b/content/docs/start/data-and-model-access.md index a7266d2c82..776db281f2 100644 --- a/content/docs/start/data-and-model-access.md +++ b/content/docs/start/data-and-model-access.md @@ -4,12 +4,10 @@ title: 'Get Started: Data and Model Access' # Get Started: Data and Model Access -We've learned how to _track_ data files in DVC and how to commit their versions -to Git. Machine learning models are typically large files written and read by -programs. DVC can track and version model files similar to data files. The next -questions are: How can we _use_ these artifacts outside of the project? How do I -download a model to deploy it? How do I download a specific version of a model? -How do I reuse datasets across different projects? +Okay, we've learned how to _track_ data and models with DVC, and how to commit +their versions to Git. The next questions are: How can we _use_ these artifacts +outside of the project? How do I download a model to deploy it? How to download +a specific version of a model? Or reuse datasets across different projects? > These questions tend to come up when you browse the files that DVC saves to > remote storage, e.g. diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 57583e016a..aff32104b1 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -247,16 +247,6 @@ defines data file versions. Git itself provides the version control. DVC in turn creates these `.dvc` files, updates them, and synchronizes DVC-tracked data in the workspace efficiently to match them. -## Model versioning - -DVC helps you to handle model files as well. Models in a project usually change -more frequently than data files and they need to be kept in sync with changes in -other elements of a project. Model files are no different than data files when -it comes to tracking their versions. DVC also provides means to track minor -changes in model files without fully checking in to Git. In later sections of -this series, you'll see how DVC enables to track changes to synchronize multiple -model and data files. - ## Large datasets versioning In cases where you process very large datasets, you need an efficient mechanism From a4ed206686844a8e5e36030e346e7c077231f6d7 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 28 Mar 2021 21:14:34 -0600 Subject: [PATCH 19/19] start: emphasize models are files (assumption) --- .../docs/start/data-and-model-versioning.md | 22 +++++++++---------- 1 file changed, 11 insertions(+), 11 deletions(-) diff --git a/content/docs/start/data-and-model-versioning.md b/content/docs/start/data-and-model-versioning.md index 3de8a103fa..9f61fc1e05 100644 --- a/content/docs/start/data-and-model-versioning.md +++ b/content/docs/start/data-and-model-versioning.md @@ -14,8 +14,8 @@ to a different version of a 100Gb file in less than a second with a `git checkout`. The foundation of DVC consists of a few commands that you can run along with -`git` to track large files, directories, or ML models. Think "Git for data". -Read on or watch our video to learn about versioning data with DVC! +`git` to track large files, directories, or ML model files. Think "Git for +data". Read on or watch our video to learn about versioning data with DVC! https://youtu.be/kLKBcPonMYw @@ -34,8 +34,8 @@ $ dvc get https://github.com/iterative/dataset-registry \ ``` We use the fancy `dvc get` command to jump ahead a bit and show how Git repo -becomes a source for datasets or models - what we call "data registry" or "model -registry". `dvc get` can download any file or directory tracked in a DVC +becomes a source for datasets or models - what we call "data/model registry". +`dvc get` can download any file or directory tracked in a DVC repository. It's like `wget`, but for DVC or Git repos. In this case we download the latest version of the `data.xml` file from the [dataset registry](https://github.com/iterative/dataset-registry) repo as the @@ -90,10 +90,10 @@ outs: ## Storing and sharing -You can upload DVC-tracked data or models with `dvc push`, so they're safely -stored [remotely](/doc/command-reference/remote). This also means they can be -retrieved on other environments later with `dvc pull`. First, we need to setup a -storage: +You can upload DVC-tracked data or model files with `dvc push`, so they're +safely stored [remotely](/doc/command-reference/remote). This also means they +can be retrieved on other environments later with `dvc pull`. First, we need to +setup a storage: ```dvc $ dvc remote add -d storage s3://mybucket/dvcstore @@ -154,9 +154,9 @@ a3 ## Retrieving -Having DVC-tracked data stored remotely, it can be downloaded when needed in -other copies of this project with `dvc pull`. Usually, we run it -after `git clone` and `git pull`. +Having DVC-tracked data and models stored remotely, it can be downloaded when +needed in other copies of this project with `dvc pull`. Usually, we +run it after `git clone` and `git pull`.