Merge pull request #1716 from iterative/use-cases

cases: Versioning Data and Model Files (review)
iterative · Sep 2, 2020 · cb44c7c · cb44c7c
2 parents 527ff87 + 990f42f
commit cb44c7c
Show file tree

Hide file tree

Showing 6 changed files with 118 additions and 102 deletions.
diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md
@@ -19,17 +19,18 @@ The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
 the target data, in order to start versioning it. It creates a `.dvc` file to
 track the added data.
 
-This command can be used to
-[version control](/doc/use-cases/versioning-data-and-model-files) large files,
-models, dataset directories, etc. that are too big for Git.
+This command can be used to track large files, models, dataset directories, etc.
+that are too big for Git to handle directly. This enables
+[versioning](/doc/use-cases/versioning-data-and-model-files) them indirectly
+with Git.
 
 The `targets` are the files or directories to add, which are turned into
 <abbr>data artifacts</abbr> of the <abbr>project</abbr>. These are stored in the
 <abbr>cache</abbr> by default (use the `--no-commit` option to avoid this, and
 `dvc commit` to finish the process when needed).
 
-> See also `dvc run` for more advanced ways to version intermediate and final
-> results (like ML models).
+> See also `dvc.yaml` and `dvc run` for more advanced ways to track and version
+> intermediate and final results (like ML models).
 
 After checking that each `target` file (or directory) hasn't been added before
 (or tracked with other DVC commands), a few actions are taken under the hood for
@@ -208,7 +209,8 @@ $ dvc run -n train \
           python train.py
 ```
 
-> To try this example, see the [Versioning](/doc/tutorials/versioning) tutorial.
+> To try this example, see the
+> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial).
 
 If instead we use the `--recursive` (`-R`) option, the output looks like this:
 

diff --git a/content/docs/command-reference/import.md b/content/docs/command-reference/import.md
@@ -66,7 +66,7 @@ path in the <abbr>workspace</abbr>. It records enough metadata about the
 imported data to enable DVC efficiently determining whether the local copy is
 out of date.
 
-To actually [track the data](/doc/tutorials/get-started/data-versioning),
+To actually [version the data](/doc/tutorials/get-started/data-versioning),
 `git add` (and `git commit`) the import stage.
 
 Note that import stages are considered always
@@ -192,7 +192,8 @@ $ dvc get https://github.com/iterative/dataset-registry \
           tutorial/ver/data.zip
 ```
 
-> Used in our [versioning tutorial](/doc/tutorials/versioning)
+> Used in our
+> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial)
 
 Or
 

diff --git a/content/docs/use-cases/index.md b/content/docs/use-cases/index.md
@@ -1,15 +1,9 @@
 # Use Cases
 
 We provide short articles on common ML workflow or data management scenarios
-that DVC can help with or improve. These include a motivation (usually from
-real-life cases), and approaches which combine several features of DVC. Use
-cases are not written to be run end-to-end like tutorials. For more general,
-hands-on experience with DVC, please see our
-[Get Started](/doc/tutorials/get-started) instead.
-
-> We keep reviewing our docs and will include interesting scenarios that surface
-> in the community. Please, [contact us](/support) if you need help or have
-> suggestions!
+that DVC can help with or improve. Our use cases are not written to be run
+end-to-end like tutorials. For more general, hands-on experience with DVC,
+please see our [Get Started](/doc/tutorials/get-started) instead.
 
 ## Why DVC?
 
@@ -29,12 +23,13 @@ learning models, and you want to
 - track and switch between different versions of data or models easily;
 - understand how data or models were built in the first place;
 - be able to compare models and metrics to each other;
-- bring software engineering best practices to your data science team;
-- among other [use cases](/doc/use-cases)
+- bring software engineering best practices to your data science team
 
 DVC is for you!
 
----
+> We keep reviewing our docs and will include interesting scenarios that surface
+> in the community. Please, [contact us](/support) if you need help or have
+> suggestions!
 
-Our use case pages range from basic to more advanced. Please choose from the
-navigation sidebar to the left, or click the `Next` button below ↘
+Please choose from the navigation sidebar to the left, or click the `Next`
+button below ↘
diff --git a/content/docs/use-cases/versioning-data-and-model-files/index.md b/content/docs/use-cases/versioning-data-and-model-files/index.md
@@ -1,77 +1,112 @@
 # Versioning Data and Model Files
 
-> This document provides an overview the file versioning workflow with DVC. To
-> get more hands-on experience on this, we recommend following along the
-> [Versioning](/doc/tutorials/versioning) tutorial.
-
-DVC allows versioning data files and directories, intermediate results, and ML
-models using Git, but without storing the file contents in the Git repository.
-It's useful when dealing with files that are too large for Git to handle
-properly in general. DVC saves information about your data in special `.dvc`
-files, and these files can be used for versioning. To actually store the data,
-DVC supports various types of [remote storage](/doc/command-reference/remote).
-This allows easily saving and sharing data alongside code.
-
-![](/img/model-versioning-diagram.png)
-
-In this basic scenario, DVC is a better replacement for Git-LFS (see
-[Related Technologies](/doc/user-guide/related-technologies)) and for ad-hoc
-scripts on top of Amazon S3 (or any other cloud) used to manage ML <abbr>data
-artifacts</abbr> like raw data, models, etc. Unlike Git-LFS, DVC doesn't require
-installing a dedicated server; It can be used on-premises (e.g. SSH, NAS) or
-with any major cloud storage provider (Amazon S3, Microsoft Azure Blob Storage,
-Google Drive, Google Cloud Storage, etc).
-
-Let's say you already have a Git repository and put a bunch of images in the
-`images/` directory, and build a `model.pkl` ML model file using them.
+DVC enables versioning large files and directories such as datasets, data
+science features, and machine learning models using Git, but without storing the
+contents in Git.
+
+This is achieved by saving information about the data in special
+[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in
+the repository. These can be versioned with regular Git workflows (branches,
+pull requests, etc.)
+
+To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
+synchronizing it with various types of
+[remote storage](/doc/command-reference/remote). This allows storing and sharing
+data easily, and alongside code.
+
+![](/img/model-versioning-diagram.png) _Code and data flows in DVC_
+
+In this basic use case, DVC is a better alternative to
+[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc
+scripts used to manage ML <abbr>artifacts</abbr> (training data, models, etc.)
+on cloud storage. DVC doesn't require special services, and works with
+on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
+(Amazon S3, Microsoft Azure, Google Drive,
+[among others](/doc/command-reference/remote/add#supported-storage-types)).
+
+> For hands-on experience, we recommend following the
+> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files).
+
+## DVC is not Git!
+
+DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track
+data files and directories (among other purposes). They point to specific data
+contents in the <abbr>cache</abbr>, providing the ability to store multiple data
+versions out-of-the-box.
+
+Full-fledged
+[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
+is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These
+are designed for source code management (SCM) however, and thus ill-equipped to
+support data science needs. That's where DVC comes in: with its built-in data
+<abbr>cache</abbr>, reproducible [pipelines](/doc/start/data-pipelines), among
+several other novel features (see [Get Started](/doc/start/) for a primer.)
+
+## Track data and models for versioning
+
+Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
+images in the `images/` directory. You can start tracking it with `dvc add`.
+This generate a `.dvc` file, which can be committed to Git in order to save the
+project's version:
 
 ```dvc
-$ ls images
+$ ls images/
 0001.jpg 0002.jpg 0003.jpg 0004.jpg ...
 
-$ ls
-model.pkl
-```
-
-To start using DVC we need to [initialize](/doc/command-reference/init) a
-<abbr>DVC project</abbr> on top of the existing Git repo:
+$ dvc add images/
 
-```dvc
-$ dvc init
+$ git add images.dvc .gitignore
+$ git commit -m "Track images dataset with DVC."
 ```
 
-Start tracking the images directory and the model with `dvc add`:
+DVC's also allows to define the processes that build artifacts based on tracked
+data, such as an ML model, by writing a simple `dvc.yaml` file that connects the
+pieces together:
 
-```dvc
-$ dvc add images
-$ dvc add model.pkl
+> `dvc.yaml` files can be written manually or generated with `dvc run`.
+
+```yaml
+stages:
+  train:
+    cmd: python train.py images/
+    deps:
+      - images
+    outs:
+      - model.pkl
 ```
 
-> Refer also to `dvc run` for more advanced ways to version data and data
-> processes.
+> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to
+> this feature.
 
-Commit your changes:
+`dvc repro` can now execute the `train` stage for you. DVC will track all of its
+outputs (`outs`) automatically. Let's do that, and commit this project version:
 
 ```dvc
-$ git status
+$ dvc repro
+Running stage 'train' with command:
+        python train.py images/
+Updating lock file 'dvc.lock'
 ...
-Untracked files:
-    .gitignore
-    images.dvc
-    model.pkl.dvc
 
-$ git add images.dvc model.pkl.dvc .gitignore
-$ git commit -m "Track images and model with DVC"
+$ git add dvc.yaml dvc.lock .gitignore
+$ git commit -m "Train model via DVC."
+$ git tag -a "v1.0" -m "Fist model"   # We'll use this soon ;)
 ```
 
-There are two ways to get to the previous version of the dataset or model: a
-full <abbr>workspace</abbr> checkout, or checkout of a specific data or model
-file. Let's consider the full checkout first. It's quite straightforward:
+> See also `dvc.lock`.
+
+## Switching versions
+
+After iterating on this process and producing several versions, you can combine
+`git checkout` and `dvc checkout` to perform full or partial
+<abbr>workspace</abbr> restorations.
 
-> `v1.0` below is a Git tag that identifies the dataset version you are
-> interested in. Any
-> [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
-> (for example `HEAD^` or a commit hash) can be used instead.
+![](/img/versioning.png) _Code and data checkout_
+
+> Note that `dvc install` enables auto-checkouts of data after `git checkout`.
+
+A full checkout brings the whole <abbr>project</abbr> back to a previous version
+— code, dataset and model files all match each other:
 
 ```dvc
 $ git checkout v1.0
@@ -80,32 +115,15 @@ M       images
 M       model.pkl
 ```
 
-These commands will restore the workspace to the first snapshot we made - code,
-dataset and model files all matching each other. DVC can
-[optimize](/doc/user-guide/large-dataset-optimization) this operation to avoid
-copying files each time, so `dvc checkout` is quick even if you have large
-dataset or model files.
-
-On the other hand, if we want to keep the current version of code and go back to
-the previous dataset only, we can do something like this (make sure that you
-don't have uncommitted changes in the `images.dvc`):
+However, we can checkout certain parts only, for example if we want to keep the
+latest source code and model but rewind to the previous dataset only:
 
 ```dvc
 $ git checkout v1.0 images.dvc
 $ dvc checkout images.dvc
 M       images
 ```
 
-If you run `git status` you will see that `data.dvc` is modified and currently
-points to the `v1.0` version of the <abbr>cached</abbr> data. Meanwhile, code
-and model files are their latest versions.
-
-![](/img/versioning.png)
-
-To share your data with others you need to setup a
-[data storage](/doc/command-reference/remote). See the
-[Sharing Data And Model Files](/doc/use-cases/sharing-data-and-model-files) use
-case to get an overview on how to do this.
-
-Please also don't forget to see the [Versioning](/doc/tutorials/versioning)
-example to get a hands-on experience with datasets and models versioning.
+DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
+avoiding copying files each time, so checking out data is quick even if you have
+large data files.
diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md
@@ -9,11 +9,11 @@ bringing best practices from software engineering into the data science field
 - DVC builds upon Git by introducing the concept of data files – large files
   that should not be stored in a Git repository, but still need to be tracked
   and versioned. It leverages Git's features to enable managing different
-  versions of data itself, data pipelines, and experiments.
+  versions of data, data pipelines, and experiments.
 
 - DVC is not fundamentally bound to Git, and can work without it (except
-  versioning-related features). This also applies to Git-LFS and Git-annex,
-  below.
+  [versioning-related](/doc/use-cases/versioning-data-and-model-files)
+  features).
 
 ## Git-LFS (Large File Storage)
 

diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md
@@ -19,7 +19,7 @@ software engineers.
 
 - DVC works **on top of Git repositories** and has a similar command line
   interface and flow as Git. DVC can also work stand-alone, but without
-  versioning capabilities.
+  [versioning](/doc/use-cases/versioning-data-and-model-files) capabilities.
 
 - **Data versioning** is enabled by replacing large files], dataset directories,
   ML models, etc. with small