Skip to content

Commit

Permalink
guide: Data Mgmt intro + note updates
Browse files Browse the repository at this point in the history
  • Loading branch information
jorgeorpinel committed Oct 29, 2022
1 parent 2f31bb6 commit a84c442
Showing 1 changed file with 42 additions and 16 deletions.
58 changes: 42 additions & 16 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
@@ -1,29 +1,55 @@
# Data Management with DVC

<!--
Focus on (changed) workflows (e.g. from aws s3 cp to dvc get)
> It's a big paradigm shift.
Storing and transferring datasets and ML models can vary depending on project
needs, available infrastructure, etc. DVC helps you avoid logistics like object
permissions on cloud storage, sync tools and schedules, back up snapshots, etc.
and focus on machine learning.

Understand why its important to "pay this price" (codify, separate storage, go through git repo)
-->
You work with data normally in a local <abbr>workspace</abbr>. DVC tracks,
restores, and synchronize everything with a few, straightforward commands
(similar to Git) that do not change regardless of the underlying file systems,
transfer protocols, etc.

Managing datasets and ML models tends to be a manual and different process for
each team and project.
![]() _Separating data from code_

<!--
Benefits (similar to use cases); Indirection: DVC orgs objects into dirs, you deal with project-specific refs; deduplication
-->
To achieve this, DVC relies on data _codification_: replacing large files and
directories with small [metafiles] that describe the assets. Data files are
moved to a separate <abbr>cache</abbr> but kept virtually (linked) in the
workspace. This **separates your data from code** (including metafiles).

With DVC, you manipulate the project files normally in your local workspace; DVC
tracks, restores, and synchronizes them across locations.
<admon type="tip">

This also allows you to [version] all project files with Git, a battle-tested
[SCM] tool.

</admon>

DVC operations stay the same because they work [indirectly], by going through
the metafiles and [configuration] of your <abbr>project</abbr> to find out where
and how to handle files. This is transparent to you as user, but it's important
to understand the mechanics in general.

## Workflow and benefits

<!-- Focus on (changed) workflows (e.g. from aws s3 cp to dvc get); It's a big paradigm shift. -->

...

<!-- Benefits (similar to use cases); Indirection: DVC orgs objects into dirs, you deal with project-specific refs; deduplication -->

[metafiles]: /doc/user-guide/project-structure
[indirectly]: https://en.wikipedia.org/wiki/Indirection
[configuration]: /doc/command-reference/config
[version]: /doc/user-guide/data-management/data-versioning
[scm]: https://www.atlassian.com/git/tutorials/source-code-management

## How it works
## Storage locations

<!-- Too abstract -->

DVC helps you manage and share arbitrarily large files anywhere: cloud storage,
SSH servers, network resources (e.g. NAS), mounted drives, local file systems,
etc. To do so, several storage locations can be defined.
DVC can manage data anywhere: cloud storage, SSH servers, network resources
(e.g. NAS), mounted drives, local file systems, etc. These locations can be
separated into three groups.

<!-- (Relevant) implementation detail? -->

Expand Down

0 comments on commit a84c442

Please sign in to comment.