Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Data Management #4042

Closed
wants to merge 88 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
7350938
guide: draft structure of Data Mgmt and
jorgeorpinel Oct 13, 2022
203f6a6
guide: full text for draft intro to DM
jorgeorpinel Oct 14, 2022
90eaa5d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 17, 2022
eb246bb
guide: hide cloud versioning info
jorgeorpinel Oct 17, 2022
a3687ec
guide: clarify Data Mgmt parts and
jorgeorpinel Oct 18, 2022
fad0bad
guide: add figure drafts to Data Mgmt
jorgeorpinel Oct 19, 2022
4e3c3da
guide: SCM->VC (Data Mgmt)
jorgeorpinel Oct 19, 2022
7f02c15
guide: update 2 figs and add 1 more (Data Mgmt)
jorgeorpinel Oct 19, 2022
f41d16e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
3a9a045
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
c0b92f1
guide: roll back unrelated changes
jorgeorpinel Oct 21, 2022
c2303c0
guide: mention clouds first (DM) and
jorgeorpinel Oct 22, 2022
62997ab
guide: flatten DM index
jorgeorpinel Oct 22, 2022
fc74c53
guide: udpates to DM/ DV
jorgeorpinel Oct 22, 2022
8c40a03
guide: add DM/ Data Versioning page
jorgeorpinel Oct 22, 2022
1a8ca61
guide: update outdated link
jorgeorpinel Oct 22, 2022
27be87f
guide: revert more unrelatedly chaqnged files
jorgeorpinel Oct 22, 2022
aaee7af
guide: remove unused ref link
jorgeorpinel Oct 22, 2022
24c331a
guide: remove a comment
jorgeorpinel Oct 22, 2022
73e2f55
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 27, 2022
ec1af6d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 28, 2022
2f31bb6
guide: splits and notes around Data Mgmt index page
jorgeorpinel Oct 28, 2022
a84c442
guide: Data Mgmt intro + note updates
jorgeorpinel Oct 29, 2022
ab55389
guide: draft of all contents +
jorgeorpinel Oct 29, 2022
31d5288
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 1, 2022
a13f989
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 2, 2022
601c99e
guide: small impros to Data Mgmt
jorgeorpinel Nov 2, 2022
a8bad84
guide: rewrite Data Mgmt index in before/after form
jorgeorpinel Nov 3, 2022
c8cc17b
guide: add draft figure for Data Mgmt
jorgeorpinel Nov 4, 2022
3cb84cb
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 8, 2022
a13cb0f
guide: simplify/refocus data mgmt index
jorgeorpinel Nov 8, 2022
e3ba70b
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 17, 2022
c29d9ec
work around commented header bug
jorgeorpinel Nov 17, 2022
875fba3
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 23, 2022
831ad1d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 25, 2022
8ddda9c
guide: drop DM/ DV page
jorgeorpinel Nov 25, 2022
28322e5
guide: rewrite DM intro and
jorgeorpinel Nov 25, 2022
179d172
guide: use DM table instead of figure for now
jorgeorpinel Nov 25, 2022
d979a5e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 30, 2022
74bc156
guide: rewrite Data Mgmt story
jorgeorpinel Nov 30, 2022
e138096
guide: add draft figures to Data Mgmt
jorgeorpinel Nov 30, 2022
f904038
guide: simplify Data Mgmt story and benefits
jorgeorpinel Dec 1, 2022
e1772ea
guide: remove unused images (DM)
jorgeorpinel Dec 1, 2022
cc0390e
guide: update Data Mgmt figures (v1)
jorgeorpinel Dec 2, 2022
4ee3223
guide: rewrite text of Data Mgmt index
jorgeorpinel Dec 8, 2022
149599b
Merge branch 'main' of github.com:iterative/dvc.org into guide/data-m…
rogermparent Dec 8, 2022
f2acb66
guide: update Data Mgmt figures
jorgeorpinel Dec 8, 2022
723eb50
guide: iterate on Data Mgmt again
jorgeorpinel Dec 14, 2022
4b67b64
guide: update Data Mgmt figs
jorgeorpinel Dec 14, 2022
9eb7143
guide: more supporting info about Data Mgmt
jorgeorpinel Dec 18, 2022
e598839
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 21, 2022
dd4466e
guide: update figures (much more concrete) and
jorgeorpinel Dec 21, 2022
d637179
guide: edits to How it works (Data Mgmt)
jorgeorpinel Dec 21, 2022
c007817
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
5a0fd57
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
3eb81ff
guide: update Data Mgmt figures
jorgeorpinel Dec 22, 2022
98e73ff
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 23, 2022
67b1717
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 27, 2022
f3af183
guide: emphaisze dataset versions in UG fig 1
jorgeorpinel Dec 27, 2022
206ce77
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 4, 2023
075aaf3
guide: update Data Mgmt figures (with notes),
jorgeorpinel Jan 5, 2023
7377500
guide: more updates to text and figure styles,
jorgeorpinel Jan 5, 2023
baf5b4c
guide: update figures and text (Data Mgmt) ...
jorgeorpinel Jan 9, 2023
fb35df5
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 11, 2023
4475f78
guide: Data Management text (section 1)
jorgeorpinel Jan 11, 2023
20fbaae
guide: Data Management (main text)
jorgeorpinel Jan 11, 2023
1da7b8a
guide: Data Management (secondary text)
jorgeorpinel Jan 12, 2023
61e2865
Merge branch 'guide/data-mgmt-flows' of github.com:iterative/dvc.org …
jorgeorpinel Jan 12, 2023
ed63127
guide: add DVC data mgmt technical diagram &
jorgeorpinel Jan 12, 2023
0109cf3
guide: update Data Mgmt text
jorgeorpinel Jan 18, 2023
77330cc
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 18, 2023
956b03d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 19, 2023
7152ad3
guide: udpate text and 2nd figure (Data Mgmt)
jorgeorpinel Jan 19, 2023
f29da1e
guide: draft 2nd and 3rd figures
jorgeorpinel Jan 19, 2023
8f49a72
guide: rewrite Data Mgmt/ How it works &
jorgeorpinel Jan 20, 2023
f876c17
guide: update drafts of Data Mgmt figures 2, 3
jorgeorpinel Jan 20, 2023
ee3f721
guide: Data Mgmt improvements and
jorgeorpinel Jan 24, 2023
061a918
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 24, 2023
c10bda6
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 26, 2023
c3ca226
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Feb 17, 2023
d341645
guide: update Data Mgmt figures
jorgeorpinel Feb 17, 2023
311dd3c
guide: 2 typos
jorgeorpinel Feb 17, 2023
0299ebd
guide: Data Mgmt/ Tradeoff section
jorgeorpinel Feb 17, 2023
185f78d
guide: mention remote storage in Data Mgmt
jorgeorpinel Feb 17, 2023
22fde5a
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Feb 18, 2023
d1d54f6
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Feb 21, 2023
3ef82c1
guide: shorten Data Mgmt intro, hide...
jorgeorpinel Mar 24, 2023
cf11bd6
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Mar 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@
},
{
"slug": "data-management",
"source": false,
"source": "data-management/index.md",
"children": [
"large-dataset-optimization",
{
Expand Down
188 changes: 188 additions & 0 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,188 @@
# Data Management for Machine Learning

<!--
## Data Management for Machine Learning
-->

Where and how to store data and ML model files is one of the first decisions
your team will face, but traditional back up strategies do not fit the Data
Science lifecycle. Large files end up scattered around multiple buckets;
Overlapping dataset versions coexist, causing data leakage and inefficient use
of space; The project evolution is harder to track. What was the name of the
best model? Is it safe to delete `2020-dset_v2.zip`? Can others reproduce my
results?

![Direct access storage](/img/direct_access_storage.png) _The S3 bucket on the
right is shared (and bloated) by several people and projects. You need to know
the exact location of the correct files, and use cloud-specific tools (e.g. AWS
CLI) to access them directly._

To maintain control and visibility over all your data and models, DVC stores
large files and directories for you in a structured way. It tracks them by
logging their locations and unique descriptions in YAML files. Committing these
to Git along ML source code creates reproducible project versions (no need for
special file naming schemes to identify data or model variants). The project
history becomes easy to review, rewind, and repeat.

![DVC-cached storage](/img/dvc_managed_storage.png) _DVC writes `.dvc` files
with YAML content next to large files. A data cache indexes them with `md5`
checksums. Mass storage holds all unique files pushed with DVC for back up or
sharing._

## How it works

<!--
![Versioning data with Git](/img/project_versioning.png) _You can use Git
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
history to store different datasets and model versions without renaming any
files in your workspace. The project cache grows as more relevant versions are
tracked._
-->

Let's consider a simple ML project that looks like this:

```
training.csv
validation.xml
model.bin
src/train.py
```

![]() _The first two data files are very large (multiple Gigabytes). The model
file is not as large (several Megabytes) but still large enough to avoid storing
it in Git. The `.py` code file (last) is safe to commit to Git (some
Kilobytes)._

DVC appends unique large files to a hidden <abbr>cache</abbr>, organized by
content hashes (similar to an index). As the data changes, its full history can
be preserved this way, while preventing accidental file deletions.

```cli
.dvc/cache
├── 0a/aa77e # training.csv
├── 3f/db533 # validation.xml before
├── 6a/2aa4b # validation.xml now
├── a7/28107 # first model.bin
...
```

Now that they're cached safely, DVC-tracked files in your <abbr>workspace</abbr>
can be replaced with [file links], so you continue seeing and using them as
usual. File hashes (usually MD5) are written in human-readable YAML [metafiles]
next to the original data.

```git
training.csv -> .dvc/cache/0a/aa77e
+ training.csv.dvc
validation.xml -> .dvc/cache/6a/2aa4b
+ validation.xml.dvc
model.bin
src/train.py
```

```yaml
# validation.xml.dvc
md5: 6a2aa4b # Note: actual hashes are longer
path: validation.xml
```

[metafiles]: /doc/user-guide/project-structure
[file links]: /doc/user-guide/data-management/large-dataset-optimization

<admon type="tip" title="Remote storage">

Data tracked by DVC can be stored in more than one location. You get a project
cache by default, but it's possible to synchronize all or parts of it with
[remote storage]. The same content-addressable file structure is used remotely
unless you enable [cloud versioning], which lets you see a similar directory
structure in your cloud buckets as in the local project.

[remote storage]: /doc/user-guide/data-management/remote-storage
[cloud versioning]: /doc/user-guide/data-management/cloud-versioning

</admon>

To keep track of relevant versions of the data, models, etc. cached by DVC, the
corresponding metafiles should be [versioned with Git] (or any SCM) along with
the rest of the code. This also means that a single file name can represent
different contents, keeping your project structure clean (use branches or tags
to organize data versions instead).

[versioned with git]:
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control

```cli
$ git checkout dev-branch
$ dvc checkout
$ ls
training.csv 2 G # old data
model.bin 2.7 M # old model
src/train.py 214 K

$ git checkout latest-tag
$ dvc checkout
$ ls
training.csv 3 G # latest data
validation.xml 1 G
model.bin 3.2 M # better model
src/train.py 354 K
src/evaluate.py 175 K # more code
```

<admon type="info" title="Data codification">

DVC replaces data assets in the project with code-like YAML [metafiles] (and
links). Codifying data lets you treat it as a first-class citizen in any code
repository.

</admon>

<!-- ## Tradeoff

Adopting DVC's approach requires a few key changes to your workflow:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these changes are hard to understand if you are not familiar with the tool

we need to rethink make them more explicit, to the ground ...

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Feb 17, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Created a separate Tradeoff section for now but it still has that problem (too long, not grounded)... Hopefully I can summarize and simplify it ⏳

Rel. #4042 (review) below.


1. Relevant data and models are registered in a code repository (typically Git).
1. Data operations (add, remove, move, etc.) happen [indirectly]: DVC checks the
metadata to locate files in both sides.
1. Stored objects managed with DVC are not intended for handling manually.

[indirectly]: https://en.wikipedia.org/wiki/Indirection

At the same time, it comes with many benefits:

- Easily manage **data as code** and [optimize space usage][file links]

This comment was marked as resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here and everywhere, review carefully these items:

Easily manage **data as code** and [optimize space usage][file links]

  • two benefits at once?
  • one is very abstract - data as code - it won't resonate
  • second is about optimization - fine, but it comes first in the list - is it the main thing?
  • I would need to follow the link to understand the benefit?

do we need the whole benefits section? may be How it works can be enough as it describes how you change your workflow ... ?

not sure about the structure here tbh

This comment was marked as outdated.

automatically.
- DVC keeps track of large files and directories for you, mapping them between
your <abbr>workspace</abbr> and storage.
- Easily share, distribute, and migrate data among one or more storage locations
([multiple providers supported]).
- Your <abbr>repository</abbr> stays small and easy **collaborate** on (using
regular [Git workflows]).
- [Data versioning] guarantees ML **reproducibility**.
- Use a **consistent interface** to access and sync data anywhere (via [CLI],
[API], [IDE], or [web]), regardless of the storage platform (S3, GDrive, NAS,
etc.).
- Data **integrity** based on a Git-based storage; Data **security** through an
authored project history that can be audited.
- Advanced features: [Data registries], [ML pipelines], [CI/CD for ML],
[productize] your ML models, and more!

[multiple providers supported]:
/doc/command-reference/remote/add#supported-storage-types
[git workflows]:
https://git-scm.com/book/en/v2/Distributed-Git-Distributed-Workflows
[data versioning]: /doc/use-cases/versioning-data-and-models
[cli]: /doc/command-reference
[api]: /doc/api-reference
[ide]: /doc/vs-code-extension
[web]: /doc/studio
[data registries]: /doc/use-cases/data-registry
[ml pipelines]: /doc/user-guide/pipelines
[ci/cd for ml]: https://cml.dev/
[productize]: https://mlem.ai/

---

In summary, DVC establishes a mature method to manage data assets for ML
projects, letting you focus on more important tasks like exploration,
preparation, cross validation, etc.
-->
38 changes: 38 additions & 0 deletions content/docs/user-guide/data-management/storage-locations.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,38 @@
# Storage locations
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure this needs its own page, but just noticed it is not included in the sidebar.


DVC can manage data anywhere: cloud storage, SSH servers, network resources
(e.g. NAS), mounted drives, local file systems, etc. These locations can be put
into three groups.

![Storage locations](/img/storage-locations.png) _Local, external, and remote
storage locations_

Every <abbr>DVC project</abbr> starts with 2 locations. The
<abbr>workspace</abbr> is the main project directory, containing your data,
models, source code, etc. DVC also creates a <abbr>data cache</abbr> (found
locally in `.dvc/cache` by default), which will be used as fast-access storage
for DVC operations.

<admon type="tip">

The cache can be moved to an external location in the file system or network,
for example to [share it] among several projects. It could even be set up in a
remote system (Internet access), but this is typically too slow for working with
data regularly.

</admon>

[share it]: /doc/user-guide/how-to/share-a-dvc-cache

DVC supports additional storage locations such as cloud services (Amazon S3,
Google Drive, Azure Blob Storage, etc.), SSH servers, network-attached storage,
etc. These are called [DVC remotes], and help you to share or back up copies of
your data assets.

<admon type="info">

DVC remotes are similar to Git remotes, but for <abbr>cached</abbr> data.

</admon>

[dvc remotes]: /doc/command-reference/remote
Binary file added static/img/direct_access_storage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/dvc_managed_storage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/project_versioning.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added static/img/storage-locations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.