Skip to content

Commit

Permalink
cases: [WIP] befin rewriting Versioning:
Browse files Browse the repository at this point in the history
explain why versioning large files is important/a thing
per #1716 (comment)
  • Loading branch information
jorgeorpinel committed Sep 1, 2020
1 parent 39d4400 commit 87264eb
Showing 1 changed file with 19 additions and 105 deletions.
124 changes: 19 additions & 105 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
@@ -1,28 +1,28 @@
# Versioning Data and Model Files

DVC enables versioning large files and directories such as datasets, data
science features, and machine learning models using Git, but without storing the
contents in Git.
SCM or _version control_ was a disruptive introduction to software development
because it allows effective collaboration on source code by all the stakeholders
of a project. In [Git](https://git-scm.com/), this means commits, branches and
tags, merging or rebasing, etc.

This is achieved by saving information about the data in special
[metafiles](/doc/user-guide/dvc-files-and-directories) that replace the data in
the repository. These can be versioned with regular Git workflows (branches,
pull requests, etc.)
Source code versioning features require storing text files and other small
assets in the code repository, but **storage itself** is not the goal of SCM. In
fact, having large and binary files in code repos can be considered a
side-effect, and its severely limited by Git hosting
([e.g. GitHub](https://docs.github.com/en/github/managing-large-files/what-is-my-disk-quota)).

To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
synchronizing it with various types of
[remote storage](/doc/command-reference/remote). This allows storing and sharing
data easily, and alongside code.
Traditional storage solutions like hard drives or NAS, as well as cloud storage
services like Amazon S3 and Google Drive, are much more optimal platforms for
storing big data files and folders. So what if we could combine their advantages
with the versioning capabilities of Git?

![](/img/model-versioning-diagram.png) _Code and data flows in DVC_
![](/img/model-versioning-diagram.png) _DVC's hybrid versioned storage model_

In this basic use case, DVC is a better alternative to
[Git-LFS / Git-annex](/doc/user-guide/related-technologies) and to ad-hoc
scripts used to manage ML <abbr>artifacts</abbr> (training data, models, etc.)
on cloud storage. DVC doesn't require special services, and works with
on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
(Amazon S3, Microsoft Azure, Google Drive,
[among others](/doc/command-reference/remote/add#supported-storage-types)).
...

## How it Looks

...

> For hands-on experience, we recommend following the
> [versioning tutorial](/doc/use-cases/versioning-data-and-model-files).
Expand All @@ -41,89 +41,3 @@ are designed for source code management (SCM) however, and thus ill-equipped to
support data science needs. That's where DVC comes in: with its built-in data
<abbr>cache</abbr>, reproducible [pipelines](/doc/start/data-pipelines), among
several other novel features (see [Get Started](/doc/start/) for a primer.)

## Track data and models for versioning

Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
images in the `images/` directory. You can start tracking it with `dvc add`.
This generate a `.dvc` file, which can be committed to Git in order to save the
project's version:

```dvc
$ ls images/
0001.jpg 0002.jpg 0003.jpg 0004.jpg ...
$ dvc add images/
$ git add images.dvc .gitignore
$ git commit -m "Track images dataset with DVC."
```

DVC's also allows to define the processes that build artifacts based on tracked
data, such as an ML model, by writing a simple `dvc.yaml` file that connects the
pieces together:

> `dvc.yaml` files can be written manually or generated with `dvc run`.
```yaml
stages:
train:
cmd: python train.py images/
deps:
- images
outs:
- model.pkl
```
> See [Data Pipelines](/doc/start/data-pipelines) for a comprehensive intro to
> this feature.
`dvc repro` can now execute the `train` stage for you. DVC will track all of its
outputs (`outs`) automatically. Let's do that, and commit this project version:

```dvc
$ dvc repro
Running stage 'train' with command:
python train.py images/
Updating lock file 'dvc.lock'
...
$ git add dvc.yaml dvc.lock .gitignore
$ git commit -m "Train model via DVC."
$ git tag -a "v1.0" -m "Fist model" # We'll use this soon ;)
```

> See also `dvc.lock`.

## Switching versions

After iterating on this process and producing several versions, you can combine
`git checkout` and `dvc checkout` to perform full or partial
<abbr>workspace</abbr> restorations.

![](/img/versioning.png) _Code and data checkout_

> Note that `dvc install` enables auto-checkouts of data after `git checkout`.

A full checkout brings the whole <abbr>project</abbr> back to a previous version
— code, dataset and model files all match each other:

```dvc
$ git checkout v1.0
$ dvc checkout
M images
M model.pkl
```

However, we can checkout certain parts only, for example if we want to keep the
latest source code and model but rewind to the previous dataset only:

```dvc
$ git checkout v1.0 images.dvc
$ dvc checkout images.dvc
M images
```

DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
avoiding copying files each time, so checking out data is quick even if you have
large data files.

0 comments on commit 87264eb

Please sign in to comment.