Skip to content

Commit

Permalink
guide: draft of all contents +
Browse files Browse the repository at this point in the history
+ remove comments
  • Loading branch information
jorgeorpinel committed Oct 29, 2022
1 parent a84c442 commit ab55389
Showing 1 changed file with 44 additions and 26 deletions.
70 changes: 44 additions & 26 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,53 +6,71 @@ permissions on cloud storage, sync tools and schedules, back up snapshots, etc.
and focus on machine learning.

You work with data normally in a local <abbr>workspace</abbr>. DVC tracks,
restores, and synchronize everything with a few, straightforward commands
(similar to Git) that do not change regardless of the underlying file systems,
transfer protocols, etc.
restores, and synchronize everything with a few, straightforward operations that
do not change regardless of the underlying file systems, transfer protocols,
etc.

![]() _Separating data from code_
![]() _Separating data from code (codification)_

To achieve this, DVC relies on data _codification_: replacing large files and
directories with small [metafiles] that describe the assets. Data files are
moved to a separate <abbr>cache</abbr> but kept virtually (linked) in the
workspace. This **separates your data from code** (including metafiles).
<details>

<admon type="tip">
## Click to learn more about data _codification_

This also allows you to [version] all project files with Git, a battle-tested
[SCM] tool.
To achieve this, DVC replaces large files and directories with small [metafiles]
that describe the assets. Data files are moved to a separate <abbr>cache</abbr>
but kept virtually (linked) in the workspace. This separates your data from code
(including metafiles).

</admon>
<admon type="tip">

DVC operations stay the same because they work [indirectly], by going through
the metafiles and [configuration] of your <abbr>project</abbr> to find out where
and how to handle files. This is transparent to you as user, but it's important
to understand the mechanics in general.
This also allows you to [version] project files with Git, a battle-tested [SCM]
tool.

## Workflow and benefits
</admon>

<!-- Focus on (changed) workflows (e.g. from aws s3 cp to dvc get); It's a big paradigm shift. -->
[version]: /doc/user-guide/data-management/data-versioning
[scm]: https://www.atlassian.com/git/tutorials/source-code-management

...
</details>

<!-- Benefits (similar to use cases); Indirection: DVC orgs objects into dirs, you deal with project-specific refs; deduplication -->
Your experience can stay consistent because DVC works [indirectly], by checking
the [metafiles] and [configuration] of your <abbr>project</abbr> to find out
where and how to handle files. This is transparent to you as user, but it's
important to understand the mechanics in general.

[metafiles]: /doc/user-guide/project-structure
[indirectly]: https://en.wikipedia.org/wiki/Indirection
[configuration]: /doc/command-reference/config
[version]: /doc/user-guide/data-management/data-versioning
[scm]: https://www.atlassian.com/git/tutorials/source-code-management

## Storage locations
## Workflow and benefits

**Before**: Files are scattered in the cloud; You use low-level operations
specific to each storage (e.g. `aws s3 cp`); Ad hoc file names are used to save
versions; It's easy to lose track of which data produced which results; Everyone
can read and write.

**After**: Stored objects are organized by DVC and you don't need to touch them
directly; DVC exposes a few commands to manage them; Everything is happening
though a code repository that can be controlled with Git; Project versions (Git
commits) guarantee reproducibility of ML processes (e.g. training models with
the same datasets, hyperparametes, features, etc.).

<!-- Optionally in cloud versioning etc can be accessed directly -->

<!-- Too abstract -->
**Benefits**: You always work with project-specific paths; Efficient usage of
storage space (file deduplication); Small repository; [Data versioning]; [Fast
caching], [GitOps].

[data versioning]: /doc/use-cases/versioning-data-and-models
[fast caching]: /doc/use-cases/fast-data-caching-hub
[gitops]: https://www.gitops.tech/

## Storage locations

DVC can manage data anywhere: cloud storage, SSH servers, network resources
(e.g. NAS), mounted drives, local file systems, etc. These locations can be
separated into three groups.

<!-- (Relevant) implementation detail? -->

![Storage locations](/img/storage-locations.png) _Local, external, and remote
storage locations_

Expand Down

0 comments on commit ab55389

Please sign in to comment.