Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-cases: second iteration of Data Registry case #818

Merged
merged 32 commits into from
Dec 16, 2019
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
32 commits
Select commit Hold shift + click to select a range
c31d971
use-cases: address smaller points from review (#795)
jorgeorpinel Nov 20, 2019
6002cba
use-cases: reinforce hypothetical phrasing in data registry intro par…
jorgeorpinel Nov 21, 2019
47ebae5
use-cases: partitioned->split in data registry case
jorgeorpinel Nov 21, 2019
a578c15
use-cases: geatly simplify mention about project inter-dependency in …
jorgeorpinel Nov 21, 2019
d9ad1ab
use-cases: improve intro to example in data registry case
jorgeorpinel Nov 22, 2019
50b772e
use-cases: rephrase much of the data registry example to improve its …
jorgeorpinel Nov 23, 2019
55ab757
review usage of ellipses thoughout docs
jorgeorpinel Nov 24, 2019
d125437
use-cases: remove remark about imports getting messy
jorgeorpinel Nov 25, 2019
283eef5
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Nov 25, 2019
3cba8f8
use-cases: further simplify intro of data registry case
jorgeorpinel Nov 25, 2019
131a27e
use-cases: separate example into 2 sections, expand on them
jorgeorpinel Nov 25, 2019
a7dc465
use-cases: comlpete "Building a data registry" section in data-registry
jorgeorpinel Nov 25, 2019
57d4059
use-cases: provide high level abstract overview of the Git and DVC co…
jorgeorpinel Nov 26, 2019
c49bc0c
use-cases: simplify intro and 2nd section in data-registry
jorgeorpinel Nov 26, 2019
8c300a2
use-cases: fix typo in data-registry
jorgeorpinel Nov 26, 2019
6854a8b
WIP: use-cases: simplofy middle sections per discussion with Ivan, by
jorgeorpinel Nov 28, 2019
e2d93c7
WIP: use-cases: rewrite middle section of data registry without cats-…
jorgeorpinel Nov 28, 2019
faeb057
use-cases: review Construction and workflow section per private revie…
jorgeorpinel Nov 30, 2019
f4997cb
use-cases: more updates to data registry per private discussion
jorgeorpinel Dec 1, 2019
707a507
use-cases: draft of new Usage section in data registry
jorgeorpinel Dec 3, 2019
f30c1e7
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Dec 10, 2019
7954f59
use-cases: add diagram to data registry
jorgeorpinel Dec 10, 2019
51ee72b
use-cases: improve usage section (adding API section) and
jorgeorpinel Dec 11, 2019
485fc49
use-cases: add note about deployment via dvc.api.open to data registr…
jorgeorpinel Dec 11, 2019
6ccc49f
use-cases: Some updates per private discussion with Ivan
jorgeorpinel Dec 11, 2019
b42c9cf
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Dec 12, 2019
de65290
use-cases: more feedback per private chat with Ivan
jorgeorpinel Dec 12, 2019
53ea7c6
use-cases: updated img subscript for data registry
jorgeorpinel Dec 12, 2019
7887ca2
use-cases: address Alex' feedback on data registry 2nd iteration
jorgeorpinel Dec 13, 2019
175b75a
use-cases: addressing more feedback from Ivan
jorgeorpinel Dec 16, 2019
7a395f8
use-cases: address Alex's feedback from
jorgeorpinel Dec 16, 2019
f9c1a74
Merge branch 'master' into use-cases/data-registry
jorgeorpinel Dec 16, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion src/Documentation/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@
"slug": "sharing-data-and-model-files"
},
"shared-development-server",
"data-registry"
"data-registries"
]
},
{
Expand Down
2 changes: 1 addition & 1 deletion static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -175,7 +175,7 @@ different names, and not currently tracked by Git:
$ git status
...
Untracked files:
(use "git add <file>..." to include in what will be committed)
(use "git add <file> ..." to include in what will be committed)

model.bigrams.pkl
model.monograms.pkl
Expand Down
7 changes: 3 additions & 4 deletions static/docs/command-reference/install.md
Original file line number Diff line number Diff line change
Expand Up @@ -155,7 +155,7 @@ checkout the `6-featurization` tag:
$ git checkout 6-featurization
Note: checking out '6-featurization'.

You are in 'detached HEAD' state. ...
You are in 'detached HEAD' state...

$ dvc status

Expand Down Expand Up @@ -216,7 +216,7 @@ We can now repeat the command run earlier, to see the difference.
$ git checkout 6-featurization
Note: checking out '6-featurization'.

You are in 'detached HEAD' state. ...
You are in 'detached HEAD' state...

HEAD is now at d13ba9a add featurization stage

Expand Down Expand Up @@ -257,8 +257,7 @@ helpfully informs us the workspace is out of sync. We should therefore run the

```dvc
$ dvc repro evaluate.dvc

... much output
...
To track the changes with git run:

git add featurize.dvc train.dvc evaluate.dvc
Expand Down
2 changes: 1 addition & 1 deletion static/docs/tutorials/deep/reproducibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ $ dvc repro model.p.dvc
$ dvc repro
```

Tries to reproduce the same pipeline... But there is still nothing to reproduce.
Tries to reproduce the same pipeline, but there is still nothing to reproduce.

## Adding bigrams

Expand Down
210 changes: 210 additions & 0 deletions static/docs/use-cases/data-registries.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,210 @@
# Data Registries
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

One of the main uses of <abbr>DVC repositories</abbr> is the
[versioning of data and model files](/doc/use-cases/data-and-model-files-versioning),
with commands such as `dvc add`. With the aim to enable reusability of these
<abbr>data artifacts</abbr> between different projects, DVC also provides the
`dvc import` and `dvc get` commands, among others. This means that a project can
depend on data from an external <abbr>DVC project</abbr>, **similar to package
management systems, but for data science projects**.

![](/static/img/data-registry.png) _Data and models as code_

Keeping this in mind, we could build a <abbr>DVC project</abbr> dedicated to
tracking and versioning _datasets_ (or any large data, even ML models). This way
we would have a repository with all the metadata and history of changes of
different datasets. We could see who updated what, and when, and use pull
requests to update data (the same way we do with code). This is what we call a
**data registry**, which can work as data management _middleware_ between ML
projects and cloud storage.

> Note that a single dedicated repository is just one possible pattern to create
> data registries with DVC.

Advantages of using a DVC **data registry** project:

- Data as code: Improve _lifecycle management_ with versioning of simple
directory structures (like Git on cloud storage), without ad-hoc conventions.
Leverage Git and Git hosting features such as commits, branching, pull
requests, reviews, and even continuous deployment of ML models.
- Reusability: Reproduce and organize _feature stores_ with a simple CLI
(`dvc get` and `dvc import` commands, similar to software package management
systems like `pip`).
- Persistence: The DVC registry-controlled
[remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves
data security. There are less chances someone can delete or rewrite a model,
for example.
- Storage Optimization: Track data
[shared](/doc/use-cases/share-data-and-model-files) by multiple projects
centralized in a single location (with the ability to create distributed
copies on other remotes). This simplifies data management and optimizes space
requirements.
- Security: Registries can be setup to have read-only remote storage (e.g. an
HTTP location). Git versioning of [DVC-files](/doc/user-guide/dvc-file-format)
allows us to track and audit data changes.

## Building registries

Data registries can be created like any other <abbr>DVC repository</abbr> with
`git init` and `dvc init`. A good way to organize them is with different
directories, to group the data into separate uses, such as `images/`,
`natural-language/`, etc. For example, our
[dataset-registry](https://github.com/iterative/dataset-registry) uses a
directory for each section in our website documentation, like `get-started/`,
`use-cases/`, etc.

Adding datasets to a registry can be as simple as placing the data file or
directory in question inside the <abbr>workspace</abbr>, and telling DVC to
track it, with `dvc add`. For example:

```dvc
$ mkdir -p music/Beatles
$ cp ~/Downloads/millionsongsubset_full music/songs
$ dvc add music/songs
```

> This example dataset actually exists. See
> [MillionSongSubset](http://millionsongdataset.com/pages/getting-dataset/#subset).

A regular Git workflow can be followed with the tiny
[DVC-files](/doc/user-guide/dvc-file-format) that substitute the actual data
(`music/songs.dvc` in this example). This enables team collaboration on data at
the same level as with source code (commit history, branching, pull requests,
reviews, etc.):

```dvc
$ git add music/songs.dvc music/.gitignore
$ git commit -m "Track 1.8 GB 10,000 song dataset in music/"
```

The actual data is stored in the project's <abbr>cache</abbr> and should be
[pushed](/doc/command-reference/push) to one or more
[remote storage](/doc/command-reference/remote) locations, so the registry can
be accessed from other locations or by other people:

```
$ dvc remote add myremote s3://bucket/path
$ dvc push
```

## Using registries

The main methods to consume <abbr>data artifacts</abbr> from a **data registry**
are the `dvc import` and `dvc get` commands, as well as the `dvc.api` Python
API.

### Simple download (get)

This is analogous to using direct download tools like
[`wget`](https://www.gnu.org/software/wget/) (HTTP),
[`aws s3 cp`](https://docs.aws.amazon.com/cli/latest/reference/s3/cp.html) (S3),
etc. To get a dataset for example, we can run something like:

```dvc
$ dvc get https://github.com/example/registry \
music/songs/
```

This downloads `music/songs/` from the <abbr>project</abbr>'s
[default remote](/doc/command-reference/remote/default) and places it in the
current working directory (anywhere in the file system with user write access).

> Note that this command (as well as `dvc import`) has a `--rev` option to
> download specific versions of the data.

### Import workflow

`dvc import` uses the same syntax as `dvc get`:

```dvc
$ dvc import https://github.com/example/registry \
images/faces/
```

> Note that unlike `dvc get`, which can be used from any directory, `dvc import`
> needs to run within an [initialized](/doc/command-reference/init) DVC project.

Besides downloading, importing saves the dependency of the local project towards
the data source (registry repository). This is achieved by creating a particular
kind of [DVC-file](/doc/user-guide/dvc-file-format) (a.k.a. _import stage_).
This file can be used staged and committed with Git.

As an addition to the import workflow, and enabled the saved dependency, we can
easily bring it up to date in our consumer project with `dvc update` whenever
the the dataset changes in the source project (data registry):

```dvc
$ dvc update dataset.dvc
```

`dvc update` downloads new and changed files, or removes deleted ones, from
`images/faces/`, based on the latest version of the source project. It also
updates the project dependency metadata in the import stage (DVC-file).

### Programatic reusability of DVC data

Our Python API, included with the `dvc` package installed with DVC, includes the
`open` function to load/stream data directly from external DVC projects:

```python
import dvc.api.open

model_path = 'model.pkl'
repo_url = 'https://github.com/example/registry'

with dvc.api.open(model_path, repo_url) as fd:
model = pickle.load(fd)
# ... Use the model!
```

This opens `model.pkl` as a file descriptor. The example above tries to
illustrate a hardcoded ML model **deployment** method.

## Updating registries

Datasets evolve, and DVC is prepared to handle it. Just change the data in the
registry, and apply the updates by running `dvc add` again:

```dvc
$ cp /path/to/1000/image/dir music/songs
$ dvc add music/songs
```

DVC then modifies the corresponding DVC-file to reflect the changes in the data,
and this will be noticed by Git:

```dvc
$ git status
Changes not staged for commit:
...
modified: music/songs.dvc
$ git commit -am "Add 1,000 more songs to music/ dataset."
```

Iterating on this process for several datasets can give shape to a robust
registry, which are basically repositories that mainly version a bunch of
DVC-files, as you can see in the hypothetical example below.

```dvc
$ tree --filelimit=100
.
├── images
│ ├── .gitignore
│ ├── cats-dogs [2800 entries] # Listed in .gitignore
│ ├── faces [10000 entries] # Listed in .gitignore
│ ├── cats-dogs.dvc
│ └── faces.dvc
├── music
│ ├── .gitignore
│ ├── songs [11000 entries] # Listed in .gitignore
│ └── songs.dvc
├── text
...
```

And let's not forget to `dvc push` data changes to the
[remote storage](/doc/command-reference/remote), so others can obtain them!

```
$ dvc push
```
Loading