Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Data Management #4042

Closed
wants to merge 88 commits into from
Closed
Show file tree
Hide file tree
Changes from 5 commits
Commits
Show all changes
88 commits
Select commit Hold shift + click to select a range
7350938
guide: draft structure of Data Mgmt and
jorgeorpinel Oct 13, 2022
203f6a6
guide: full text for draft intro to DM
jorgeorpinel Oct 14, 2022
90eaa5d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 17, 2022
eb246bb
guide: hide cloud versioning info
jorgeorpinel Oct 17, 2022
a3687ec
guide: clarify Data Mgmt parts and
jorgeorpinel Oct 18, 2022
fad0bad
guide: add figure drafts to Data Mgmt
jorgeorpinel Oct 19, 2022
4e3c3da
guide: SCM->VC (Data Mgmt)
jorgeorpinel Oct 19, 2022
7f02c15
guide: update 2 figs and add 1 more (Data Mgmt)
jorgeorpinel Oct 19, 2022
f41d16e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
3a9a045
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
c0b92f1
guide: roll back unrelated changes
jorgeorpinel Oct 21, 2022
c2303c0
guide: mention clouds first (DM) and
jorgeorpinel Oct 22, 2022
62997ab
guide: flatten DM index
jorgeorpinel Oct 22, 2022
fc74c53
guide: udpates to DM/ DV
jorgeorpinel Oct 22, 2022
8c40a03
guide: add DM/ Data Versioning page
jorgeorpinel Oct 22, 2022
1a8ca61
guide: update outdated link
jorgeorpinel Oct 22, 2022
27be87f
guide: revert more unrelatedly chaqnged files
jorgeorpinel Oct 22, 2022
aaee7af
guide: remove unused ref link
jorgeorpinel Oct 22, 2022
24c331a
guide: remove a comment
jorgeorpinel Oct 22, 2022
73e2f55
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 27, 2022
ec1af6d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 28, 2022
2f31bb6
guide: splits and notes around Data Mgmt index page
jorgeorpinel Oct 28, 2022
a84c442
guide: Data Mgmt intro + note updates
jorgeorpinel Oct 29, 2022
ab55389
guide: draft of all contents +
jorgeorpinel Oct 29, 2022
31d5288
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 1, 2022
a13f989
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 2, 2022
601c99e
guide: small impros to Data Mgmt
jorgeorpinel Nov 2, 2022
a8bad84
guide: rewrite Data Mgmt index in before/after form
jorgeorpinel Nov 3, 2022
c8cc17b
guide: add draft figure for Data Mgmt
jorgeorpinel Nov 4, 2022
3cb84cb
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 8, 2022
a13cb0f
guide: simplify/refocus data mgmt index
jorgeorpinel Nov 8, 2022
e3ba70b
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 17, 2022
c29d9ec
work around commented header bug
jorgeorpinel Nov 17, 2022
875fba3
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 23, 2022
831ad1d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 25, 2022
8ddda9c
guide: drop DM/ DV page
jorgeorpinel Nov 25, 2022
28322e5
guide: rewrite DM intro and
jorgeorpinel Nov 25, 2022
179d172
guide: use DM table instead of figure for now
jorgeorpinel Nov 25, 2022
d979a5e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 30, 2022
74bc156
guide: rewrite Data Mgmt story
jorgeorpinel Nov 30, 2022
e138096
guide: add draft figures to Data Mgmt
jorgeorpinel Nov 30, 2022
f904038
guide: simplify Data Mgmt story and benefits
jorgeorpinel Dec 1, 2022
e1772ea
guide: remove unused images (DM)
jorgeorpinel Dec 1, 2022
cc0390e
guide: update Data Mgmt figures (v1)
jorgeorpinel Dec 2, 2022
4ee3223
guide: rewrite text of Data Mgmt index
jorgeorpinel Dec 8, 2022
149599b
Merge branch 'main' of github.com:iterative/dvc.org into guide/data-m…
rogermparent Dec 8, 2022
f2acb66
guide: update Data Mgmt figures
jorgeorpinel Dec 8, 2022
723eb50
guide: iterate on Data Mgmt again
jorgeorpinel Dec 14, 2022
4b67b64
guide: update Data Mgmt figs
jorgeorpinel Dec 14, 2022
9eb7143
guide: more supporting info about Data Mgmt
jorgeorpinel Dec 18, 2022
e598839
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 21, 2022
dd4466e
guide: update figures (much more concrete) and
jorgeorpinel Dec 21, 2022
d637179
guide: edits to How it works (Data Mgmt)
jorgeorpinel Dec 21, 2022
c007817
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
5a0fd57
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
3eb81ff
guide: update Data Mgmt figures
jorgeorpinel Dec 22, 2022
98e73ff
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 23, 2022
67b1717
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 27, 2022
f3af183
guide: emphaisze dataset versions in UG fig 1
jorgeorpinel Dec 27, 2022
206ce77
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 4, 2023
075aaf3
guide: update Data Mgmt figures (with notes),
jorgeorpinel Jan 5, 2023
7377500
guide: more updates to text and figure styles,
jorgeorpinel Jan 5, 2023
baf5b4c
guide: update figures and text (Data Mgmt) ...
jorgeorpinel Jan 9, 2023
fb35df5
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 11, 2023
4475f78
guide: Data Management text (section 1)
jorgeorpinel Jan 11, 2023
20fbaae
guide: Data Management (main text)
jorgeorpinel Jan 11, 2023
1da7b8a
guide: Data Management (secondary text)
jorgeorpinel Jan 12, 2023
61e2865
Merge branch 'guide/data-mgmt-flows' of github.com:iterative/dvc.org …
jorgeorpinel Jan 12, 2023
ed63127
guide: add DVC data mgmt technical diagram &
jorgeorpinel Jan 12, 2023
0109cf3
guide: update Data Mgmt text
jorgeorpinel Jan 18, 2023
77330cc
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 18, 2023
956b03d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 19, 2023
7152ad3
guide: udpate text and 2nd figure (Data Mgmt)
jorgeorpinel Jan 19, 2023
f29da1e
guide: draft 2nd and 3rd figures
jorgeorpinel Jan 19, 2023
8f49a72
guide: rewrite Data Mgmt/ How it works &
jorgeorpinel Jan 20, 2023
f876c17
guide: update drafts of Data Mgmt figures 2, 3
jorgeorpinel Jan 20, 2023
ee3f721
guide: Data Mgmt improvements and
jorgeorpinel Jan 24, 2023
061a918
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 24, 2023
c10bda6
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 26, 2023
c3ca226
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Feb 17, 2023
d341645
guide: update Data Mgmt figures
jorgeorpinel Feb 17, 2023
311dd3c
guide: 2 typos
jorgeorpinel Feb 17, 2023
0299ebd
guide: Data Mgmt/ Tradeoff section
jorgeorpinel Feb 17, 2023
185f78d
guide: mention remote storage in Data Mgmt
jorgeorpinel Feb 17, 2023
22fde5a
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Feb 18, 2023
d1d54f6
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Feb 21, 2023
3ef82c1
guide: shorten Data Mgmt intro, hide...
jorgeorpinel Mar 24, 2023
cf11bd6
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Mar 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 27 additions & 13 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -165,24 +165,31 @@ See `dvc remote add` and `dvc remote modify` for more information.
value is `cache`, that resolves to `.dvc/cache` (relative to the project
config file location).

> See also the helper command `dvc cache dir` to intuitively set this config
> option, properly transforming paths relative to the current working
> directory into paths relative to the config file location.
<admon type="tip">

See also the helper command `dvc cache dir` to intuitively set this config
option, properly transforming paths relative to the current working directory
into paths relative to the config file location.

</admon>

- `cache.type` - link type that DVC should use to link data files from cache to
the workspace. Possible values: `reflink`, `symlink`, `hardlink`, `copy` or an
ordered combination of those, separated by commas e.g:
`reflink,hardlink,copy`. Default: `reflink,copy`

<admon type="info">

There are pros and cons to different link types. Refer to [File link types]
for a full explanation of each one.

</admon>

If you set `cache.type` to `hardlink` or `symlink`, manually modifying tracked
data files in the workspace would corrupt the cache. To prevent this, DVC
automatically protects those kinds of links (making them read-only). Use
`dvc unprotect` to be able to modify them safely.

There are pros and cons to different link types. Refer to
[File link types](/doc/user-guide/data-management/large-dataset-optimization#file-link-types-for-the-dvc-cache)
for a full explanation of each one.

To apply changes to this config option in the workspace, restore all file
links/copies from cache with `dvc checkout --relink`.

Expand All @@ -192,16 +199,23 @@ See `dvc remote add` and `dvc remote modify` for more information.
faster cache link types available than the defaults (`reflink,copy` – see
`cache.type`). Accepts values `true` and `false`.

> These warnings are automatically turned off when `cache.type` is manually
> set.
<admon type="info">

These warnings are automatically turned off when `cache.type` is manually set.

</admon>

- `cache.shared` - permissions for newly created or downloaded cache files and
directories. The only accepted value right now is `group`, which makes DVC use
`664` (rw-rw-r--) for files and `775` (rwxrwxr-x) for directories. This is
useful when [sharing a cache](/doc/user-guide/how-to/share-a-dvc-cache) among
projects. The default permissions for cache files is system dependent. In
Linux and macOS for example, they're determined using
[`os.umask`](https://docs.python.org/3/library/os.html#os.umask).
useful when [sharing a cache] among projects. The default permissions for
cache files is system dependent. In Linux and macOS for example, they're
determined using [`os.umask`].

[file link types]:
/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache
[sharing a cache]: /doc/user-guide/how-to/share-a-dvc-cache
[`os.umask`]: https://docs.python.org/3/library/os.html#os.umask

The following parameters allow setting an
[external cache](/doc/user-guide/data-management/managing-external-data#setting-up-an-external-cache)
Expand Down
2 changes: 1 addition & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@
},
{
"slug": "data-management",
"source": false,
"source": "data-management/index.md",
"children": [
"large-dataset-optimization",
"importing-external-data",
Expand Down
2 changes: 1 addition & 1 deletion content/docs/use-cases/versioning-data-and-models/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,7 +36,7 @@ learn how DVC looks and feels firsthand.

As you use DVC, unique versions of your data files and directories are
[cached](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
in a systematic way (preventing file duplication). The working datastore is
in a systematic way (preventing file duplication). The working data store is
separated from your <abbr>workspace</abbr> to keep the project light, but stays
connected via file
[links](/doc/user-guide/data-management/large-dataset-optimization#file-link-types-for-the-dvc-cache)
Expand Down
6 changes: 3 additions & 3 deletions content/docs/user-guide/basic-concepts/dvc-cache.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ name: 'DVC Cache'
match: ['DVC cache', cache, caches, cached, 'cache directory', caching]
tooltip: >-
The DVC cache is a hidden storage (by default in `.dvc/cache`) for files and
directories tracked by DVC, and their different versions. Learn more about its
structure
[here](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory).
directories tracked by DVC, and their different versions. For efficiency, it
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
uses a content-addressable
[structure](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory).
---
91 changes: 91 additions & 0 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
# Data Management with DVC

DVC helps you manage and share arbitrarily large files, datasets, and ML models
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
anywhere: mounted drives, network resources (e.g. NAS), external devices, or
remotely (SSH servers, cloud storage, etc.). Once the project is configured, you
can manipulate files normally in your local workspace. DVC tracks, restores, and
synchronizes them across locations.

![]() _Local, external, and remote storage locations_
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## The data cache
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

<abbr>DVC projects</abbr> separate data from code by replacing large files, data
artifacts, ML models, etc. in your <abbr>workspace</abbr> with small
[metafiles]; We call this process _codification_ (of the data). The actual file
contents are cached in an independent data store and linked to your project.

![]() _Separating code from data_

<admon type="info">

In order to [avoid duplicate content], and to support
[versioning features](#data-versioning), files and directories are reorganized
in the cache into a [content-addressable structure].

[avoid duplicate content]:
/doc/user-guide/data-management/large-dataset-optimization

</admon>

DVC expects fast storage access to the <abbr>cache</abbr>, so it's local to the
project by default (found in `.dvc/cache`). It can, however, be moved to an
external location in the file system or network, for example to [share it] among
several projects.

<admon type="tip">

A DVC cache could even be set up in a remote system and accessed through the
internet, but this is typically too slow for working with the data regularly.

</admon>

[metafiles]: /doc/user-guide/project-structure
[share it]: /doc/user-guide/how-to/share-a-dvc-cache
[content-addressable structure]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory

## Remote storage

Optionally, DVC supports additional storage locations such as cloud services
(Amazon S3, Google Drive, Azure Blob Storage, etc.), SSH servers, HDFS, and
others. [DVC remotes] are typically used to sync copies of all or some of your
datasets and models, for sharing or backup.

![]() _Distributed collaboration on DVC projects_

<admon type="info">

Remote storage uses the same [structure][content-addressable structure] of the
data cache.

</admon>

[dvc remotes]: /doc/command-reference/remote

## Data Versioning
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

DVC brings source code management (SCM) to data science. Specifically, the
metafiles in your repo can be handled with standard [Git workflows] (commits,
branching, pull requests, etc.). This way machine learning teams can apply
mature software engineering practices.

[git workflows]: https://www.atlassian.com/git/tutorials/comparing-workflows

<admon icon="book">

Refer to [Versioning Data and Models] to learn more.

[versioning data and models]: /doc/use-cases/versioning-data-and-models

</admon>

<!--
## Cloud versioning

_New in DVC 2.30.0 (see `dvc version`)_

To simplify remote data operations, DVC now supports native versioning of files
and directories on several cloud providers. This means that you can browse your
files normally as you would see them in your local workspace.
-->
74 changes: 46 additions & 28 deletions content/docs/user-guide/project-structure/internal-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,24 +15,26 @@ operation.
(credentials, private locations, etc). The local config file can be edited by
hand or with the command `dvc config --local`.

- `.dvc/cache`: Default location of the <abbr>cache</abbr> directory. The cache
stores the project data in a special
[structure](#structure-of-the-cache-directory). The data files and directories
in the <abbr>workspace</abbr> will only contain links to the data files in the
cache (refer to
[Large Dataset Optimization](/doc/user-guide/data-management/large-dataset-optimization).
See `dvc config cache` for related configuration options, including changing
its location.

> Note that DVC includes the cache directory in `.gitignore` during
> initialization. No data tracked by DVC should ever be pushed to the Git
> repository, only the <abbr>DVC files</abbr> that are needed to download or
> reproduce that data.
- `.dvc/cache`: Default location of the <abbr>cache directory</abbr>. The cache
stores the project data in a special content-addressable
[structure](#structure-of-the-cache-directory). The files and directories
visible in the <abbr>workspace</abbr> will typically be [links] to cached
data. See `dvc config cache` for related configuration options, including
changing this default location.

<admon type="info">

DVC includes the cache directory in `.gitignore` during [initialization]. No
data tracked by DVC should ever be pushed to the Git repository, only the
<abbr>DVC files</abbr> that are needed to download or reproduce that data.

[initialization]: /doc/command-reference/init

</admon>

- `.dvc/cache/runs`: Default location of the [run-cache](#run-cache).

- `.dvc/plots`: Directory for
[plot templates](/doc/user-guide/experiment-management/visualizing-plots#plot-templates-data-series-only)
- `.dvc/plots`: Directory for [plot templates]

- `.dvc/tmp`: Directory for miscellaneous temporary files

Expand Down Expand Up @@ -69,22 +71,34 @@ operation.
- `.dvc/tmp/exps`: This directory will contain workspace copies used for
temporary or [queued experiments].

[links]: /doc/user-guide/large-dataset-optimization
[plot templates]:
/doc/user-guide/visualizing-plots#plot-templates-data-series-only
[queued experiments]:
/doc/user-guide/experiment-management/running-experiments#the-experiments-queue

## Structure of the cache directory

The DVC cache is a
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
(by default in `.dvc/cache`), which adds a layer of indirection between code and
data.
The DVC cache is a [content-addressable storage] (found in `.dvc/cache` by
default), which adds a layer of [indirection] between code and data.

[content-addressable storage]:
https://en.wikipedia.org/wiki/Content-addressable_storage
[indirection]: https://en.wikipedia.org/wiki/Indirection

There are two ways in which the data is <abbr>cached</abbr>, depending on
whether it's a single file, or a directory (which may contain multiple files).
<admon type="info">

Note files are renamed, reorganized, and directory trees are flattened in the
cache, which always has exactly one depth level with 2-character directories
(based on hashes of the data contents, as explained next).
This structure is also used by [remote storage].

[remote storage]: /doc/user-guide/data-management#remote-storage

</admon>

There are two ways in which data is <abbr>cached</abbr>, depending on whether
it's a single file or a folder: Files are renamed and reorganized to prevent
duplication. Directory trees are flattened so that the cache always has exactly
one depth level containing 2-character directories (based on hashes of the data
contents, as explained next).

### Files

Expand All @@ -94,10 +108,14 @@ rest become the file name of the cached file. For example, if a data file has a
hash value of `ec1d2935f811b77cc49b031b999cbf17`, its path in the cache will be
`.dvc/cache/ec/1d2935f811b77cc49b031b999cbf17`.

> Note that file hashes are calculated from file contents only. 2 or more files
> with different names but the same contents can exist in the workspace and be
> tracked by DVC, but only one copy is stored in the cache. This helps avoid
> data duplication.
<admon type="info">

File hashes are calculated from file contents only. 2 or more files with
different names but the same contents can exist in the workspace and be tracked
by DVC, but only one copy is stored in the cache. This helps avoid data
duplication.

</admon>

### Directories

Expand Down