Skip to content

Commit

Permalink
Merge pull request #1716 from iterative/use-cases
Browse files Browse the repository at this point in the history
cases: Versioning Data and Model Files (review)
  • Loading branch information
jorgeorpinel authored Sep 2, 2020
2 parents 527ff87 + 990f42f commit cb44c7c
Show file tree
Hide file tree
Showing 6 changed files with 118 additions and 102 deletions.
14 changes: 8 additions & 6 deletions content/docs/command-reference/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,17 +19,18 @@ The `dvc add` command is analogous to `git add`, in that it makes DVC aware of
the target data, in order to start versioning it. It creates a `.dvc` file to
track the added data.

This command can be used to
[version control](/doc/use-cases/versioning-data-and-model-files) large files,
models, dataset directories, etc. that are too big for Git.
This command can be used to track large files, models, dataset directories, etc.
that are too big for Git to handle directly. This enables
[versioning](/doc/use-cases/versioning-data-and-model-files) them indirectly
with Git.

The `targets` are the files or directories to add, which are turned into
<abbr>data artifacts</abbr> of the <abbr>project</abbr>. These are stored in the
<abbr>cache</abbr> by default (use the `--no-commit` option to avoid this, and
`dvc commit` to finish the process when needed).

> See also `dvc run` for more advanced ways to version intermediate and final
> results (like ML models).
> See also `dvc.yaml` and `dvc run` for more advanced ways to track and version
> intermediate and final results (like ML models).
After checking that each `target` file (or directory) hasn't been added before
(or tracked with other DVC commands), a few actions are taken under the hood for
Expand Down Expand Up @@ -208,7 +209,8 @@ $ dvc run -n train \
python train.py
```

> To try this example, see the [Versioning](/doc/tutorials/versioning) tutorial.
> To try this example, see the
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial).

If instead we use the `--recursive` (`-R`) option, the output looks like this:

Expand Down
5 changes: 3 additions & 2 deletions content/docs/command-reference/import.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,7 +66,7 @@ path in the <abbr>workspace</abbr>. It records enough metadata about the
imported data to enable DVC efficiently determining whether the local copy is
out of date.

To actually [track the data](/doc/tutorials/get-started/data-versioning),
To actually [version the data](/doc/tutorials/get-started/data-versioning),
`git add` (and `git commit`) the import stage.

Note that import stages are considered always
Expand Down Expand Up @@ -192,7 +192,8 @@ $ dvc get https://github.com/iterative/dataset-registry \
tutorial/ver/data.zip
```

> Used in our [versioning tutorial](/doc/tutorials/versioning)
> Used in our
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files/tutorial)

Or

Expand Down
23 changes: 9 additions & 14 deletions content/docs/use-cases/index.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,9 @@
# Use Cases

We provide short articles on common ML workflow or data management scenarios
that DVC can help with or improve. These include a motivation (usually from
real-life cases), and approaches which combine several features of DVC. Use
cases are not written to be run end-to-end like tutorials. For more general,
hands-on experience with DVC, please see our
[Get Started](/doc/tutorials/get-started) instead.

> We keep reviewing our docs and will include interesting scenarios that surface
> in the community. Please, [contact us](/support) if you need help or have
> suggestions!
that DVC can help with or improve. Our use cases are not written to be run
end-to-end like tutorials. For more general, hands-on experience with DVC,
please see our [Get Started](/doc/tutorials/get-started) instead.

## Why DVC?

Expand All @@ -29,12 +23,13 @@ learning models, and you want to
- track and switch between different versions of data or models easily;
- understand how data or models were built in the first place;
- be able to compare models and metrics to each other;
- bring software engineering best practices to your data science team;
- among other [use cases](/doc/use-cases)
- bring software engineering best practices to your data science team

DVC is for you!

---
> We keep reviewing our docs and will include interesting scenarios that surface
> in the community. Please, [contact us](/support) if you need help or have
> suggestions!
Our use case pages range from basic to more advanced. Please choose from the
navigation sidebar to the left, or click the `Next` button below ↘
Please choose from the navigation sidebar to the left, or click the `Next`
button below ↘
170 changes: 94 additions & 76 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
@@ -1,77 +1,112 @@
# Versioning Data and Model Files

> This document provides an overview the file versioning workflow with DVC. To
> get more hands-on experience on this, we recommend following along the
> [Versioning](/doc/tutorials/versioning) tutorial.
DVC allows versioning data files and directories, intermediate results, and ML
models using Git, but without storing the file contents in the Git repository.
It's useful when dealing with files that are too large for Git to handle
properly in general. DVC saves information about your data in special `.dvc`
files, and these files can be used for versioning. To actually store the data,
DVC supports various types of [remote storage](/doc/command-reference/remote).
This allows easily saving and sharing data alongside code.

![](/img/model-versioning-diagram.png)

In this basic scenario, DVC is a better replacement for Git-LFS (see
[Related Technologies](/doc/user-guide/related-technologies)) and for ad-hoc
scripts on top of Amazon S3 (or any other cloud) used to manage ML <abbr>data
artifacts</abbr> like raw data, models, etc. Unlike Git-LFS, DVC doesn't require
installing a dedicated server; It can be used on-premises (e.g. SSH, NAS) or
with any major cloud storage provider (Amazon S3, Microsoft Azure Blob Storage,
Google Drive, Google Cloud Storage, etc).

Let's say you already have a Git repository and put a bunch of images in the
`images/` directory, and build a `model.pkl` ML model file using them.
DVC enables versioning large files and directories such as datasets, data
science features, and machine learning models using Git, but without storing the
contents in Git.

This is achieved by saving information about the data in special
[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in
the repository. These can be versioned with regular Git workflows (branches,
pull requests, etc.)

To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
synchronizing it with various types of
[remote storage](/doc/command-reference/remote). This allows storing and sharing
data easily, and alongside code.

![](/img/model-versioning-diagram.png) _Code and data flows in DVC_

In this basic use case, DVC is a better alternative to
[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc
scripts used to manage ML <abbr>artifacts</abbr> (training data, models, etc.)
on cloud storage. DVC doesn't require special services, and works with
on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
(Amazon S3, Microsoft Azure, Google Drive,
[among others](/doc/command-reference/remote/add#supported-storage-types)).

> For hands-on experience, we recommend following the
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files).
## DVC is not Git!

DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track
data files and directories (among other purposes). They point to specific data
contents in the <abbr>cache</abbr>, providing the ability to store multiple data
versions out-of-the-box.

Full-fledged
[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
is left for Git and its hosting platforms (e.g. GitHub, GitLab) to handle. These
are designed for source code management (SCM) however, and thus ill-equipped to
support data science needs. That's where DVC comes in: with its built-in data
<abbr>cache</abbr>, reproducible [pipelines](/doc/start/data-pipelines), among
several other novel features (see [Get Started](/doc/start/) for a primer.)

## Track data and models for versioning

Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
images in the `images/` directory. You can start tracking it with `dvc add`.
This generate a `.dvc` file, which can be committed to Git in order to save the
project's version:

```dvc
$ ls images
$ ls images/
0001.jpg 0002.jpg 0003.jpg 0004.jpg ...
$ ls
model.pkl
```

To start using DVC we need to [initialize](/doc/command-reference/init) a
<abbr>DVC project</abbr> on top of the existing Git repo:
$ dvc add images/
```dvc
$ dvc init
$ git add images.dvc .gitignore
$ git commit -m "Track images dataset with DVC."
```

Start tracking the images directory and the model with `dvc add`:
DVC's also allows to define the processes that build artifacts based on tracked
data, such as an ML model, by writing a simple `dvc.yaml` file that connects the
pieces together:

```dvc
$ dvc add images
$ dvc add model.pkl
> `dvc.yaml` files can be written manually or generated with `dvc run`.
```yaml
stages:
train:
cmd: python train.py images/
deps:
- images
outs:
- model.pkl
```
> Refer also to `dvc run` for more advanced ways to version data and data
> processes.
> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to
> this feature.
Commit your changes:
`dvc repro` can now execute the `train` stage for you. DVC will track all of its
outputs (`outs`) automatically. Let's do that, and commit this project version:

```dvc
$ git status
$ dvc repro
Running stage 'train' with command:
python train.py images/
Updating lock file 'dvc.lock'
...
Untracked files:
.gitignore
images.dvc
model.pkl.dvc
$ git add images.dvc model.pkl.dvc .gitignore
$ git commit -m "Track images and model with DVC"
$ git add dvc.yaml dvc.lock .gitignore
$ git commit -m "Train model via DVC."
$ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;)
```

There are two ways to get to the previous version of the dataset or model: a
full <abbr>workspace</abbr> checkout, or checkout of a specific data or model
file. Let's consider the full checkout first. It's quite straightforward:
> See also `dvc.lock`.

## Switching versions

After iterating on this process and producing several versions, you can combine
`git checkout` and `dvc checkout` to perform full or partial
<abbr>workspace</abbr> restorations.

> `v1.0` below is a Git tag that identifies the dataset version you are
> interested in. Any
> [Git reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References)
> (for example `HEAD^` or a commit hash) can be used instead.
![](/img/versioning.png) _Code and data checkout_

> Note that `dvc install` enables auto-checkouts of data after `git checkout`.

A full checkout brings the whole <abbr>project</abbr> back to a previous version
— code, dataset and model files all match each other:

```dvc
$ git checkout v1.0
Expand All @@ -80,32 +115,15 @@ M images
M model.pkl
```

These commands will restore the workspace to the first snapshot we made - code,
dataset and model files all matching each other. DVC can
[optimize](/doc/user-guide/large-dataset-optimization) this operation to avoid
copying files each time, so `dvc checkout` is quick even if you have large
dataset or model files.

On the other hand, if we want to keep the current version of code and go back to
the previous dataset only, we can do something like this (make sure that you
don't have uncommitted changes in the `images.dvc`):
However, we can checkout certain parts only, for example if we want to keep the
latest source code and model but rewind to the previous dataset only:

```dvc
$ git checkout v1.0 images.dvc
$ dvc checkout images.dvc
M images
```

If you run `git status` you will see that `data.dvc` is modified and currently
points to the `v1.0` version of the <abbr>cached</abbr> data. Meanwhile, code
and model files are their latest versions.

![](/img/versioning.png)

To share your data with others you need to setup a
[data storage](/doc/command-reference/remote). See the
[Sharing Data And Model Files](/doc/use-cases/sharing-data-and-model-files) use
case to get an overview on how to do this.

Please also don't forget to see the [Versioning](/doc/tutorials/versioning)
example to get a hands-on experience with datasets and models versioning.
DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
avoiding copying files each time, so checking out data is quick even if you have
large data files.
6 changes: 3 additions & 3 deletions content/docs/user-guide/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,11 +9,11 @@ bringing best practices from software engineering into the data science field
- DVC builds upon Git by introducing the concept of data files – large files
that should not be stored in a Git repository, but still need to be tracked
and versioned. It leverages Git's features to enable managing different
versions of data itself, data pipelines, and experiments.
versions of data, data pipelines, and experiments.

- DVC is not fundamentally bound to Git, and can work without it (except
versioning-related features). This also applies to Git-LFS and Git-annex,
below.
[versioning-related](/doc/use-cases/versioning-data-and-model-files)
features).

## Git-LFS (Large File Storage)

Expand Down
2 changes: 1 addition & 1 deletion content/docs/user-guide/what-is-dvc.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@ software engineers.

- DVC works **on top of Git repositories** and has a similar command line
interface and flow as Git. DVC can also work stand-alone, but without
versioning capabilities.
[versioning](/doc/use-cases/versioning-data-and-model-files) capabilities.

- **Data versioning** is enabled by replacing large files], dataset directories,
ML models, etc. with small
Expand Down

0 comments on commit cb44c7c

Please sign in to comment.