Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add key terms to use case intros/tutorial and what is dvc? docs [SEO] #1806

Merged
merged 16 commits into from
Oct 8, 2020
Merged
Changes from 1 commit
Commits
Show all changes
16 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Added model and data versioning references to expand search terms
jeremydesroches committed Sep 24, 2020
commit 916ca5fa861e8847053cd6da86f0125a5082ed4e
19 changes: 10 additions & 9 deletions content/docs/use-cases/versioning-data-and-model-files/index.md
Original file line number Diff line number Diff line change
@@ -11,8 +11,8 @@ pull requests, etc.)

To actually store the data, DVC uses a built-in <abbr>cache</abbr>, and supports
synchronizing it with various types of
[remote storage](/doc/command-reference/remote). This allows storing and sharing
data easily, and alongside code.
[remote storage](/doc/command-reference/remote). This allows for easy data and
model versioning, storage, and sharing — right alongside code.

![](/img/model-versioning-diagram.png) _Code and data flows in DVC_

@@ -30,9 +30,9 @@ on-premises storage (e.g. SSH, NAS) as well as any major cloud storage provider
## DVC is not Git!

DVC metafiles such as `dvc.yaml` and `.dvc` files serve as placeholders to track
data files and directories (among other purposes). They point to specific data
contents in the <abbr>cache</abbr>, providing the ability to store multiple data
versions out-of-the-box.
the version of data files and directories (among other purposes). They point to
jeremydesroches marked this conversation as resolved.
Show resolved Hide resolved
specific data contents in the <abbr>cache</abbr>, providing the ability to store
multiple data versions out-of-the-box.

Full-fledged
[version control](https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control)
@@ -46,7 +46,7 @@ several other novel features (see [Get Started](/doc/start/) for a primer.)

Let's say you have an empty <abbr>DVC repository</abbr> and put a dataset of
images in the `images/` directory. You can start tracking it with `dvc add`.
This generate a `.dvc` file, which can be committed to Git in order to save the
This generates a `.dvc` file, which can be committed to Git in order to save the
project's version:

```dvc
@@ -116,7 +116,8 @@ M model.pkl
```

However, we can checkout certain parts only, for example if we want to keep the
latest source code and model but rewind to the previous dataset only:
latest source code and model versions, but rewind to the previous version of the
dataset:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ git checkout v1.0 images.dvc
@@ -125,5 +126,5 @@ M images
```

DVC [optimizes](/doc/user-guide/large-dataset-optimization) this operation by
avoiding copying files each time, so checking out data is quick even if you have
large data files.
avoiding copying files each time, so checking out data is quick even if you are
versioning large data files.