From 7d13ab249a08be1f146a3e26fc17df0aba6f52cf Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 23 Aug 2020 01:15:26 -0500 Subject: [PATCH 01/10] cases: restructure and copy edit Versioning, et al. --- .../versioning-data-and-model-files/index.md | 92 ++++++++++--------- 1 file changed, 48 insertions(+), 44 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 4ef8955a59..9a40306666 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -2,28 +2,32 @@ > This document provides an overview the file versioning workflow with DVC. To > get more hands-on experience on this, we recommend following along the -> [Versioning](/doc/tutorials/versioning) tutorial. +> [versioning tutorial](/doc/tutorials/versioning). -DVC allows versioning data files and directories, intermediate results, and ML -models using Git, but without storing the file contents in the Git repository. -It's useful when dealing with files that are too large for Git to handle -properly in general. DVC saves information about your data in special `.dvc` -files, and these files can be used for versioning. To actually store the data, -DVC supports various types of [remote storage](/doc/command-reference/remote). -This allows easily saving and sharing data alongside code. +DVC enables versioning large files and directories such as datasets, data +science features, and machine learning models with Git, without storing the file +contents in Git. DVC saves information about your data in special +[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in +the repository. These can versioned with regular Git workflows (commits, +branches, pull requests, etc.) To actually store the data, DVC uses a built-in +cache, and supports synchronizing it with various types of +[remote storage](/doc/command-reference/remote). This allows easily storing and +sharing data alongside code. -![](/img/model-versioning-diagram.png) +![](/img/model-versioning-diagram.png) _Code and data flows in DVC_ -In this basic scenario, DVC is a better replacement for Git-LFS (see -[Related Technologies](/doc/user-guide/related-technologies)) and for ad-hoc -scripts on top of Amazon S3 (or any other cloud) used to manage ML data -artifacts like raw data, models, etc. Unlike Git-LFS, DVC doesn't require -installing a dedicated server; It can be used on-premises (e.g. SSH, NAS) or -with any major cloud storage provider (Amazon S3, Microsoft Azure Blob Storage, -Google Drive, Google Cloud Storage, etc). +In a basic scenario, DVC is a better replacement for Git-LFS (and +[the like](/doc/user-guide/related-technologies)) and for ad-hoc scripts on top +of cloud storage that are used to manage ML artifacts like training +data, models, etc. DVC doesn't depend on 3rd party services and can leverage +on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider +(Amazon S3, Microsoft Azure, Google Drive, +[among others](/doc/command-reference/remote/add#supported-storage-types)). -Let's say you already have a Git repository and put a bunch of images in the -`images/` directory, and build a `model.pkl` ML model file using them. +Let's say you already have a Git repo and put a bunch of images in the `images/` +directory, Then you build a `model.pkl` using them. + +## Track data (DVC) and version it (Git) ```dvc $ ls images @@ -33,24 +37,25 @@ $ ls model.pkl ``` -To start using DVC we need to [initialize](/doc/command-reference/init) a -DVC project on top of the existing Git repo: +To start using DVC, we need to [initialize](/doc/command-reference/init) a +DVC project in the existing repo: ```dvc $ dvc init ``` -Start tracking the images directory and the model with `dvc add`: +Start tracking the data directory and the model with `dvc add`: ```dvc $ dvc add images $ dvc add model.pkl ``` -> Refer also to `dvc run` for more advanced ways to version data and data -> processes. +> See [Data Pipelines](/doc/start/data-pipelines) for more advanced ways to +> version ML projects. -Commit your changes: +This generates `.dvc` files, and puts the originals in `.gitignore`. Commit this +project's version: ```dvc $ git status @@ -61,17 +66,16 @@ Untracked files: model.pkl.dvc $ git add images.dvc model.pkl.dvc .gitignore -$ git commit -m "Track images and model with DVC" +$ git commit -m "Track images and model with DVC." +$ git tag -a "v1.0" -m "images and model 1.0" ``` -There are two ways to get to the previous version of the dataset or model: a -full workspace checkout, or checkout of a specific data or model -file. Let's consider the full checkout first. It's quite straightforward: +## Switching versions -> `v1.0` below is a Git tag that identifies the dataset version you are -> interested in. Any -> [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) -> (for example `HEAD^` or a commit hash) can be used instead. +After iterating on this process and producing several versions, there are two +ways to get the original version of the dataset or model, using `dvc checkout`. +You can either do a full workspace checkout, or checkout specific +parts of the project. Let's consider the full checkout first: ```dvc $ git checkout v1.0 @@ -82,13 +86,16 @@ M model.pkl These commands will restore the workspace to the first snapshot we made - code, dataset and model files all matching each other. DVC can -[optimize](/doc/user-guide/large-dataset-optimization) this operation to avoid -copying files each time, so `dvc checkout` is quick even if you have large -dataset or model files. +[optimize](/doc/user-guide/large-dataset-optimization) this operation by +avoiding copying files each time, so checking out data is quick even if you have +large dataset or model files. + +> See `dvc install` to auto-checkout data after `git checkout`, and other useful +> hooks. On the other hand, if we want to keep the current version of code and go back to -the previous dataset only, we can do something like this (make sure that you -don't have uncommitted changes in the `data.dvc`): +the previous dataset only, we can do something like this (assuming no +uncommitted changes in `images.dvc`): ```dvc $ git checkout v1.0 images.dvc @@ -96,16 +103,13 @@ $ dvc checkout images.dvc M images ``` -If you run `git status` you will see that `data.dvc` is modified and currently -points to the `v1.0` version of the cached data. Meanwhile, code -and model files are their latest versions. +If you run `git status` you will see that `images.dvc` matches the `v1.0` +version of the cached images. Meanwhile, code and model files +remain on their latest versions. ![](/img/versioning.png) To share your data with others you need to setup a [data storage](/doc/command-reference/remote). See the -[Sharing Data And Model Files](/doc/use-cases/sharing-data-and-model-files) use +[Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) use case to get an overview on how to do this. - -Please also don't forget to see the [Versioning](/doc/tutorials/versioning) -example to get a hands-on experience with datasets and models versioning. From 72be3e725bc71b57e3fc3fecc5812da0288f3e42 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 23 Aug 2020 16:45:29 -0500 Subject: [PATCH 02/10] cases: be very clear and explicit about the role of DVC vs Git (Versioning) --- .../versioning-data-and-model-files/index.md | 52 +++++++++++++------ 1 file changed, 36 insertions(+), 16 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 9a40306666..30c073df50 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -1,19 +1,18 @@ # Versioning Data and Model Files -> This document provides an overview the file versioning workflow with DVC. To -> get more hands-on experience on this, we recommend following along the -> [versioning tutorial](/doc/tutorials/versioning). - DVC enables versioning large files and directories such as datasets, data science features, and machine learning models with Git, without storing the file contents in Git. DVC saves information about your data in special [metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in -the repository. These can versioned with regular Git workflows (commits, +the repository. These can be versioned with regular Git workflows (commits, branches, pull requests, etc.) To actually store the data, DVC uses a built-in cache, and supports synchronizing it with various types of [remote storage](/doc/command-reference/remote). This allows easily storing and sharing data alongside code. +> To get more hands-on experience on this, we recommend following along the +> [versioning tutorial](/doc/tutorials/versioning). + ![](/img/model-versioning-diagram.png) _Code and data flows in DVC_ In a basic scenario, DVC is a better replacement for Git-LFS (and @@ -22,12 +21,33 @@ of cloud storage that are used to manage ML artifacts like training data, models, etc. DVC doesn't depend on 3rd party services and can leverage on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider (Amazon S3, Microsoft Azure, Google Drive, -[among others](/doc/command-reference/remote/add#supported-storage-types)). +[among others](/doc/command-reference/remote/add#supported-storage-types)) that +you manage separately. -Let's say you already have a Git repo and put a bunch of images in the `images/` -directory, Then you build a `model.pkl` using them. +## DVC is not Git! + +DVC metafiles such as `dvc.yaml` and `.dvc` files serve various purposes. They +work as placeholders to track data files and directories needed by your project. +DVC also provides basic versioning by storing file hash values inside them, +corresponding to specific data contents (versions). + +However, we don't aim to reinvent the wheel. Git is a mature and well known +[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) +tool that provides multiple ways to manage a commit history: branches and tags, +merging or rebasing, etc. Widely used hosting services on op of Git enhance the +experience even further (GitHub, GitLab) — you can keep all of these +capabilities when using DVC. + +Git is however, designed for source code management (SCM), and thus ill-equipped +to support data science needs. That's where DVC comes in: implementing a +built-in data cache, allowing reproducible +[pipelines](/doc/start/data-pipelines), among several other novel feature layers +(please see [Get Started](/doc/start/) for more info.) -## Track data (DVC) and version it (Git) +## Track and version data and models + +Let's say you already have a Git repo and put a bunch of images in the `images/` +directory. Then you build a `model.pkl` based on them. ```dvc $ ls images @@ -37,23 +57,20 @@ $ ls model.pkl ``` -To start using DVC, we need to [initialize](/doc/command-reference/init) a -DVC project in the existing repo: +To start using DVC, [initialize](/doc/command-reference/init) a DVC +project in the existing repo: ```dvc $ dvc init ``` -Start tracking the data directory and the model with `dvc add`: +Start tracking the data directory and the model file with `dvc add`: ```dvc -$ dvc add images +$ dvc add images/ $ dvc add model.pkl ``` -> See [Data Pipelines](/doc/start/data-pipelines) for more advanced ways to -> version ML projects. - This generates `.dvc` files, and puts the originals in `.gitignore`. Commit this project's version: @@ -70,6 +87,9 @@ $ git commit -m "Track images and model with DVC." $ git tag -a "v1.0" -m "images and model 1.0" ``` +> See [Data Pipelines](/doc/start/data-pipelines) for more advanced ways to +> version ML projects. + ## Switching versions After iterating on this process and producing several versions, there are two From d7825b2ae3daa0d44d711f1f67c65e4877ed39a6 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 25 Aug 2020 17:09:30 -0500 Subject: [PATCH 03/10] Expand on run/dag (pipelines) and/or dvc.yaml/lock versioning. --- .../versioning-data-and-model-files/index.md | 45 +++++++++++++++++-- 1 file changed, 41 insertions(+), 4 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 30c073df50..8fa2bc0d34 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -44,7 +44,7 @@ built-in data cache, allowing reproducible [pipelines](/doc/start/data-pipelines), among several other novel feature layers (please see [Get Started](/doc/start/) for more info.) -## Track and version data and models +## Track data and models for versioning Let's say you already have a Git repo and put a bunch of images in the `images/` directory. Then you build a `model.pkl` based on them. @@ -84,11 +84,48 @@ Untracked files: $ git add images.dvc model.pkl.dvc .gitignore $ git commit -m "Track images and model with DVC." -$ git tag -a "v1.0" -m "images and model 1.0" +$ git tag -a "v1.0a" -m "First images and model" ``` -> See [Data Pipelines](/doc/start/data-pipelines) for more advanced ways to -> version ML projects. +## Track pipeline artifacts for versioning + +In the example above, the process to build the model file is omitted for +simplicity. But in fact some of DVC's most important features allow for defining +one or many such processes in simple `dvc.yaml` files, in order to run them and +reproduce them later. + +> See [Data Pipelines](/doc/start/data-pipelines) for more information. + +Instead of training the model file on your own and adding the `model.pkl` to DVC +manually, we can add only the images directory as a previous step, and then use +this `dvc.yaml`: + +```yaml +stages: + train: + cmd: python train.py images/ + deps: + - images + outs: + - model.pkl +``` + +> Note that `dvc.yaml` can have multiple stages, forming a pipeline. + +DVC can now execute the above pipeline for you (see `dvc run` and `dvc repro`) +and track all of its outputs (`outs`) automatically. These get listed in +`.gitignore`. This project version can be committed like this: + +```dvc +$ dvc repro +Running stage 'train' with command: + python train.py images/ +Updating lock file 'dvc.lock' +... +$ git add dvc.yaml dvc.lock .gitignore +$ git commit -m "Train model via DVC." +$ git tag -a "v1.0b" -m "Fist model" +``` ## Switching versions From bcdb60483da796fcc7d174101ab0b78f0be3d560 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 25 Aug 2020 17:37:53 -0500 Subject: [PATCH 04/10] cases: simplify index --- content/docs/use-cases/index.md | 23 +++++++++-------------- 1 file changed, 9 insertions(+), 14 deletions(-) diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md index 879a41f5d8..971ad2e99f 100644 --- a/content/docs/use-cases/index.md +++ b/content/docs/use-cases/index.md @@ -1,15 +1,9 @@ # Use Cases We provide short articles on common ML workflow or data management scenarios -that DVC can help with or improve. These include a motivation (usually from -real-life cases), and approaches which combine several features of DVC. Use -cases are not written to be run end-to-end like tutorials. For more general, -hands-on experience with DVC, please see our -[Get Started](/doc/tutorials/get-started) instead. - -> We keep reviewing our docs and will include interesting scenarios that surface -> in the community. Please, [contact us](/support) if you need help or have -> suggestions! +that DVC can help with or improve. Our use cases are not written to be run +end-to-end like tutorials. For more general, hands-on experience with DVC, +please see our [Get Started](/doc/tutorials/get-started) instead. ## Why DVC? @@ -29,12 +23,13 @@ learning models, and you want to - track and switch between different versions of data or models easily; - understand how data or models were built in the first place; - be able to compare models and metrics to each other; -- bring software engineering best practices to your data science team; -- among other [use cases](/doc/use-cases) +- bring software engineering best practices to your data science team DVC is for you! ---- +> We keep reviewing our docs and will include interesting scenarios that surface +> in the community. Please, [contact us](/support) if you need help or have +> suggestions! -Our use case pages range from basic to more advanced. Please choose from the -navigation sidebar to the left, or click the `Next` button below ↘ +Please choose from the navigation sidebar to the left, or click the `Next` +button below ↘ From 7b550fc4e3bfad9e26c170806ec8c1b2530af032 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 26 Aug 2020 14:39:36 -0500 Subject: [PATCH 05/10] cases: higher level explanation (intro and first section) per https://github.com/iterative/dvc.org/pull/1716#issuecomment-679316702 --- .../versioning-data-and-model-files/index.md | 58 +++++++++---------- 1 file changed, 28 insertions(+), 30 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 8fa2bc0d34..8cdce611e7 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -1,48 +1,46 @@ # Versioning Data and Model Files DVC enables versioning large files and directories such as datasets, data -science features, and machine learning models with Git, without storing the file -contents in Git. DVC saves information about your data in special +science features, and machine learning models using Git, but without storing the +contents in Git. + +This is achieved by saving information about the data in special [metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in -the repository. These can be versioned with regular Git workflows (commits, -branches, pull requests, etc.) To actually store the data, DVC uses a built-in -cache, and supports synchronizing it with various types of -[remote storage](/doc/command-reference/remote). This allows easily storing and -sharing data alongside code. +the repository. These can be versioned with regular Git workflows (branches, +pull requests, etc.) -> To get more hands-on experience on this, we recommend following along the -> [versioning tutorial](/doc/tutorials/versioning). +To actually store the data, DVC uses a built-in cache, and supports +synchronizing it with various types of +[remote storage](/doc/command-reference/remote). This allows storing and sharing +data easily, and alongside code. ![](/img/model-versioning-diagram.png) _Code and data flows in DVC_ -In a basic scenario, DVC is a better replacement for Git-LFS (and -[the like](/doc/user-guide/related-technologies)) and for ad-hoc scripts on top -of cloud storage that are used to manage ML artifacts like training -data, models, etc. DVC doesn't depend on 3rd party services and can leverage +In this basic use case, DVC is a better alternative to +[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc +scripts used to manage ML artifacts (training data, models, etc.) +on cloud storage. DVC doesn't require special services, and works with on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider (Amazon S3, Microsoft Azure, Google Drive, -[among others](/doc/command-reference/remote/add#supported-storage-types)) that -you manage separately. +[among others](/doc/command-reference/remote/add#supported-storage-types)). + +> For hands-on experience, we recommend following the +> [versioning tutorial](/doc/tutorials/versioning). ## DVC is not Git! -DVC metafiles such as `dvc.yaml` and `.dvc` files serve various purposes. They -work as placeholders to track data files and directories needed by your project. -DVC also provides basic versioning by storing file hash values inside them, -corresponding to specific data contents (versions). +DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track +data files and directories (among other purposes). They point to specific data +contents in the cache, providing the ability to store multiple data +versions out-of-the-box. -However, we don't aim to reinvent the wheel. Git is a mature and well known +Full-fledged [version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control) -tool that provides multiple ways to manage a commit history: branches and tags, -merging or rebasing, etc. Widely used hosting services on op of Git enhance the -experience even further (GitHub, GitLab) — you can keep all of these -capabilities when using DVC. - -Git is however, designed for source code management (SCM), and thus ill-equipped -to support data science needs. That's where DVC comes in: implementing a -built-in data cache, allowing reproducible -[pipelines](/doc/start/data-pipelines), among several other novel feature layers -(please see [Get Started](/doc/start/) for more info.) +is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These +are designed for source code management (SCM) however, and thus ill-equipped to +support data science needs. That's where DVC comes in: with its built-in data +cache, reproducible [pipelines](/doc/start/data-pipelines), among +several other novel features (see [Get Started](/doc/start/) for a primer.) ## Track data and models for versioning From 5f801dfdec0d6393825f297777cb1723364452ba Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Wed, 26 Aug 2020 15:22:11 -0500 Subject: [PATCH 06/10] docs: fix and improve links to Versioning use case --- content/docs/command-reference/add.md | 14 ++++++++------ content/docs/command-reference/import.md | 5 +++-- .../versioning-data-and-model-files/index.md | 2 +- content/docs/user-guide/related-technologies.md | 6 +++--- content/docs/user-guide/what-is-dvc.md | 2 +- 5 files changed, 16 insertions(+), 13 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index e2146257b1..c91edf2006 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -19,17 +19,18 @@ The `dvc add` command is analogous to `git add`, in that it makes DVC aware of the target data, in order to start versioning it. It creates a `.dvc` file to track the added data. -This command can be used to -[version control](/doc/use-cases/versioning-data-and-model-files) large files, -models, dataset directories, etc. that are too big for Git. +This command can be used to track large files, models, dataset directories, etc. +that are too big for Git to handle directly. This enables +[versioning](/doc/use-cases/versioning-data-and-model-files) them indirectly +with Git. The `targets` are the files or directories to add, which are turned into data artifacts of the project. These are stored in the cache by default (use the `--no-commit` option to avoid this, and `dvc commit` to finish the process when needed). -> See also `dvc run` for more advanced ways to version intermediate and final -> results (like ML models). +> See also `dvc.yaml` and `dvc run` for more advanced ways to track & version +> intermediate and final results (like ML models). After checking that each `target` file (or directory) hasn't been added before (or tracked with other DVC commands), a few actions are taken under the hood for @@ -208,7 +209,8 @@ $ dvc run -n train \ python train.py ``` -> To try this example, see the [Versioning](/doc/tutorials/versioning) tutorial. +> To try this example, see the +> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial). If instead we use the `--recursive` (`-R`) option, the output looks like this: diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md index 900377cada..d580868cc8 100644 --- a/content/docs/command-reference/import.md +++ b/content/docs/command-reference/import.md @@ -65,7 +65,7 @@ path in the workspace. It records enough metadata about the imported data to enable DVC efficiently determining whether the local copy is out of date. -To actually [track the data](/doc/tutorials/get-started/data-versioning), +To actually [version the data](/doc/tutorials/get-started/data-versioning), `git add` (and `git commit`) the import stage. Note that import stages are considered always @@ -191,7 +191,8 @@ $ dvc get https://github.com/iterative/dataset-registry \ tutorial/ver/data.zip ``` -> Used in our [versioning tutorial](/doc/tutorials/versioning) +> Used in our +> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial) Or diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 8cdce611e7..4849f33105 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -25,7 +25,7 @@ on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider [among others](/doc/command-reference/remote/add#supported-storage-types)). > For hands-on experience, we recommend following the -> [versioning tutorial](/doc/tutorials/versioning). +> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files). ## DVC is not Git! diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md index be4b8e2cfd..3a7d72c704 100644 --- a/content/docs/user-guide/related-technologies.md +++ b/content/docs/user-guide/related-technologies.md @@ -9,11 +9,11 @@ bringing best practices from software engineering into the data science field - DVC builds upon Git by introducing the concept of data files – large files that should not be stored in a Git repository, but still need to be tracked and versioned. It leverages Git's features to enable managing different - versions of data itself, data pipelines, and experiments. + versions of data, data pipelines, and experiments. - DVC is not fundamentally bound to Git, and can work without it (except - versioning-related features). This also applies to Git-LFS and Git-annex, - below. + [versioning-related](/doc/use-cases/versioning-data-and-model-files) + features). ## Git-LFS (Large File Storage) diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index bddf12cc5d..3351f84802 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -19,7 +19,7 @@ software engineers. - DVC works **on top of Git repositories** and has a similar command line interface and flow as Git. DVC can also work stand-alone, but without - versioning capabilities. + [versioning](/doc/use-cases/versioning-data-and-model-files) capabilities. - **Data versioning** is enabled by replacing large files], dataset directories, ML models, etc. with small From e9b70aeb39302bb8b1075c3ed03cc7f89a7e8f42 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 29 Aug 2020 13:51:25 -0500 Subject: [PATCH 07/10] cases: shorten the more instructional part of Versioning per https://github.com/iterative/dvc.org/pull/1716#issuecomment-681229160 --- .../versioning-data-and-model-files/index.md | 85 +++++++------------ 1 file changed, 32 insertions(+), 53 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 4849f33105..6a2f6ef00f 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -44,25 +44,17 @@ several other novel features (see [Get Started](/doc/start/) for a primer.) ## Track data and models for versioning -Let's say you already have a Git repo and put a bunch of images in the `images/` -directory. Then you build a `model.pkl` based on them. +Let's say you already have a DVC repository and put a bunch of +images in the `images/` directory. Then you build a `model.pkl` based on them. ```dvc -$ ls images +$ ls images/ 0001.jpg 0002.jpg 0003.jpg 0004.jpg ... - $ ls -model.pkl -``` - -To start using DVC, [initialize](/doc/command-reference/init) a DVC -project in the existing repo: - -```dvc -$ dvc init +images model.pkl ``` -Start tracking the data directory and the model file with `dvc add`: +Start tracking the dataset and the model file with `dvc add`: ```dvc $ dvc add images/ @@ -73,46 +65,37 @@ This generates `.dvc` files, and puts the originals in `.gitignore`. Commit this project's version: ```dvc -$ git status -... -Untracked files: - .gitignore - images.dvc - model.pkl.dvc - $ git add images.dvc model.pkl.dvc .gitignore $ git commit -m "Track images and model with DVC." -$ git tag -a "v1.0a" -m "First images and model" ``` ## Track pipeline artifacts for versioning -In the example above, the process to build the model file is omitted for -simplicity. But in fact some of DVC's most important features allow for defining -one or many such processes in simple `dvc.yaml` files, in order to run them and +Some of DVC's most important features allow for defining the processes to build +artifacts such as ML models in a simple `dvc.yaml` file, in order to run and reproduce them later. > See [Data Pipelines](/doc/start/data-pipelines) for more information. Instead of training the model file on your own and adding the `model.pkl` to DVC -manually, we can add only the images directory as a previous step, and then use -this `dvc.yaml`: +manually, we can add only the images dataset in the previous step, and use this +`dvc.yaml`: ```yaml stages: train: cmd: python train.py images/ deps: - - images + - images # Already tracked by DVC outs: - model.pkl ``` -> Note that `dvc.yaml` can have multiple stages, forming a pipeline. +> The file can be written manually or generated with `dvc run`. -DVC can now execute the above pipeline for you (see `dvc run` and `dvc repro`) -and track all of its outputs (`outs`) automatically. These get listed in -`.gitignore`. This project version can be committed like this: +`dvc repro` can now execute the above stage for you. DVC will track all of its +outputs (`outs`) automatically, which get listed in `.gitignore`. Let's do that, +and commit this project version: ```dvc $ dvc repro @@ -120,17 +103,19 @@ Running stage 'train' with command: python train.py images/ Updating lock file 'dvc.lock' ... + $ git add dvc.yaml dvc.lock .gitignore $ git commit -m "Train model via DVC." -$ git tag -a "v1.0b" -m "Fist model" +$ git tag -a "v1.0" -m "Fist model via DVC" # We'll use this soon ;) ``` +> See also `dvc.lock`. + ## Switching versions After iterating on this process and producing several versions, there are two -ways to get the original version of the dataset or model, using `dvc checkout`. -You can either do a full workspace checkout, or checkout specific -parts of the project. Let's consider the full checkout first: +ways to get previous version of data or models using `dvc checkout`: either a +full or a partial project checkout. ```dvc $ git checkout v1.0 @@ -139,18 +124,18 @@ M images M model.pkl ``` -These commands will restore the workspace to the first snapshot we made - code, -dataset and model files all matching each other. DVC can -[optimize](/doc/user-guide/large-dataset-optimization) this operation by +These commands will restore the full workspace to the first +snapshot we made — code, dataset and model files all match each other. DVC +[optimizes](/doc/user-guide/large-dataset-optimization) this operation by avoiding copying files each time, so checking out data is quick even if you have -large dataset or model files. +large data files. -> See `dvc install` to auto-checkout data after `git checkout`, and other useful -> hooks. +![](/img/versioning.png) _Code and data checkout_ -On the other hand, if we want to keep the current version of code and go back to -the previous dataset only, we can do something like this (assuming no -uncommitted changes in `images.dvc`): +> See also `dvc install` to auto-checkout data after `git checkout`. + +On the other hand, if we want to keep the latest source code and model, but +rewind to the previous dataset only, we can do a partial checkout like this: ```dvc $ git checkout v1.0 images.dvc @@ -158,13 +143,7 @@ $ dvc checkout images.dvc M images ``` -If you run `git status` you will see that `images.dvc` matches the `v1.0` -version of the cached images. Meanwhile, code and model files -remain on their latest versions. - -![](/img/versioning.png) +--- -To share your data with others you need to setup a -[data storage](/doc/command-reference/remote). See the -[Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files) use -case to get an overview on how to do this. +A typical next step is +[Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files). From 835ebf7fefb304b3c7123999185246244ef0e5d2 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 29 Aug 2020 15:43:15 -0500 Subject: [PATCH 08/10] cmd: &->and in add per https://github.com/iterative/dvc.org/pull/1716#pullrequestreview-478151254 --- content/docs/command-reference/add.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index c91edf2006..e7ed472bf8 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -29,7 +29,7 @@ The `targets` are the files or directories to add, which are turned into cache by default (use the `--no-commit` option to avoid this, and `dvc commit` to finish the process when needed). -> See also `dvc.yaml` and `dvc run` for more advanced ways to track & version +> See also `dvc.yaml` and `dvc run` for more advanced ways to track and version > intermediate and final results (like ML models). After checking that each `target` file (or directory) hasn't been added before From d224c9857f79f7bd8ca559206193bad446bcc778 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 29 Aug 2020 19:26:17 -0500 Subject: [PATCH 09/10] cases: further simplify technical sections of Versioning per https://github.com/iterative/dvc.org/pull/1716#pullrequestreview-478151716 --- .../versioning-data-and-model-files/index.md | 83 +++++++------------ 1 file changed, 32 insertions(+), 51 deletions(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index 6a2f6ef00f..f4ba8e69d7 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -44,58 +44,43 @@ several other novel features (see [Get Started](/doc/start/) for a primer.) ## Track data and models for versioning -Let's say you already have a DVC repository and put a bunch of -images in the `images/` directory. Then you build a `model.pkl` based on them. +Let's say you have an empty DVC repository and put a dataset of +images in the `images/` directory. You can start tracking it with `dvc add`. +This generate a `.dvc` file, which can be committed to Git in order to save the +project's version: ```dvc $ ls images/ 0001.jpg 0002.jpg 0003.jpg 0004.jpg ... -$ ls -images model.pkl -``` - -Start tracking the dataset and the model file with `dvc add`: -```dvc $ dvc add images/ -$ dvc add model.pkl -``` - -This generates `.dvc` files, and puts the originals in `.gitignore`. Commit this -project's version: +... -```dvc -$ git add images.dvc model.pkl.dvc .gitignore -$ git commit -m "Track images and model with DVC." +$ git add images.dvc .gitignore +$ git commit -m "Track images dataset with DVC." ``` -## Track pipeline artifacts for versioning +DVC's also allows to define the processes that build artifacts based on tracked +data, such as an ML model, by writing a simple `dvc.yaml` file that connects the +pieces together: -Some of DVC's most important features allow for defining the processes to build -artifacts such as ML models in a simple `dvc.yaml` file, in order to run and -reproduce them later. - -> See [Data Pipelines](/doc/start/data-pipelines) for more information. - -Instead of training the model file on your own and adding the `model.pkl` to DVC -manually, we can add only the images dataset in the previous step, and use this -`dvc.yaml`: +> `dvc.yaml` files can be written manually or generated with `dvc run`. ```yaml stages: train: cmd: python train.py images/ deps: - - images # Already tracked by DVC + - images outs: - model.pkl ``` -> The file can be written manually or generated with `dvc run`. +> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to +> this feature. -`dvc repro` can now execute the above stage for you. DVC will track all of its -outputs (`outs`) automatically, which get listed in `.gitignore`. Let's do that, -and commit this project version: +`dvc repro` can now execute the `train` stage for you. DVC will track all of its +outputs (`outs`) automatically. Let's do that, and commit this project version: ```dvc $ dvc repro @@ -106,16 +91,23 @@ Updating lock file 'dvc.lock' $ git add dvc.yaml dvc.lock .gitignore $ git commit -m "Train model via DVC." -$ git tag -a "v1.0" -m "Fist model via DVC" # We'll use this soon ;) +$ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;) ``` > See also `dvc.lock`. ## Switching versions -After iterating on this process and producing several versions, there are two -ways to get previous version of data or models using `dvc checkout`: either a -full or a partial project checkout. +After iterating on this process and producing several versions, you can combine +`git checkout` and `dvc checkout` to perform full or partial +workspace restorations. + +![](/img/versioning.png) _Code and data checkout_ + +> Note that `dvc install` enables auto-checkouts of data after `git checkout`. + +A full checkout brings the whole project back to a previous version +— code, dataset and model files all match each other: ```dvc $ git checkout v1.0 @@ -124,18 +116,8 @@ M images M model.pkl ``` -These commands will restore the full workspace to the first -snapshot we made — code, dataset and model files all match each other. DVC -[optimizes](/doc/user-guide/large-dataset-optimization) this operation by -avoiding copying files each time, so checking out data is quick even if you have -large data files. - -![](/img/versioning.png) _Code and data checkout_ - -> See also `dvc install` to auto-checkout data after `git checkout`. - -On the other hand, if we want to keep the latest source code and model, but -rewind to the previous dataset only, we can do a partial checkout like this: +However, we can checkout certain parts only, for example if we want to keep the +latest source code and model but rewind to the previous dataset only: ```dvc $ git checkout v1.0 images.dvc @@ -143,7 +125,6 @@ $ dvc checkout images.dvc M images ``` ---- - -A typical next step is -[Sharing Data and Model Files](/doc/use-cases/sharing-data-and-model-files). +DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by +avoiding copying files each time, so checking out data is quick even if you have +large data files. From 54d82c0e2daa98179646078bb3de9266c4c9b1ee Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 29 Aug 2020 19:54:44 -0500 Subject: [PATCH 10/10] cases: remove ... from code block in Versioning --- content/docs/use-cases/versioning-data-and-model-files/index.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md index f4ba8e69d7..be28edd905 100644 --- a/content/docs/use-cases/versioning-data-and-model-files/index.md +++ b/content/docs/use-cases/versioning-data-and-model-files/index.md @@ -54,7 +54,6 @@ $ ls images/ 0001.jpg 0002.jpg 0003.jpg 0004.jpg ... $ dvc add images/ -... $ git add images.dvc .gitignore $ git commit -m "Track images dataset with DVC."