diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 144c0266c2..fcd91959bd 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -122,7 +122,7 @@ }, { "slug": "data-management", - "source": false, + "source": "data-management/index.md", "children": [ "large-dataset-optimization", { diff --git a/content/docs/user-guide/data-management/index.md b/content/docs/user-guide/data-management/index.md new file mode 100644 index 0000000000..b610d1a1fa --- /dev/null +++ b/content/docs/user-guide/data-management/index.md @@ -0,0 +1,188 @@ +# Data Management for Machine Learning + + + +Where and how to store data and ML model files is one of the first decisions +your team will face, but traditional back up strategies do not fit the Data +Science lifecycle. Large files end up scattered around multiple buckets; +Overlapping dataset versions coexist, causing data leakage and inefficient use +of space; The project evolution is harder to track. What was the name of the +best model? Is it safe to delete `2020-dset_v2.zip`? Can others reproduce my +results? + +![Direct access storage](/img/direct_access_storage.png) _The S3 bucket on the +right is shared (and bloated) by several people and projects. You need to know +the exact location of the correct files, and use cloud-specific tools (e.g. AWS +CLI) to access them directly._ + +To maintain control and visibility over all your data and models, DVC stores +large files and directories for you in a structured way. It tracks them by +logging their locations and unique descriptions in YAML files. Committing these +to Git along ML source code creates reproducible project versions (no need for +special file naming schemes to identify data or model variants). The project +history becomes easy to review, rewind, and repeat. + +![DVC-cached storage](/img/dvc_managed_storage.png) _DVC writes `.dvc` files +with YAML content next to large files. A data cache indexes them with `md5` +checksums. Mass storage holds all unique files pushed with DVC for back up or +sharing._ + +## How it works + + + +Let's consider a simple ML project that looks like this: + +``` +training.csv +validation.xml +model.bin +src/train.py +``` + +![]() _The first two data files are very large (multiple Gigabytes). The model +file is not as large (several Megabytes) but still large enough to avoid storing +it in Git. The `.py` code file (last) is safe to commit to Git (some +Kilobytes)._ + +DVC appends unique large files to a hidden cache, organized by +content hashes (similar to an index). As the data changes, its full history can +be preserved this way, while preventing accidental file deletions. + +```cli +.dvc/cache +├── 0a/aa77e # training.csv +├── 3f/db533 # validation.xml before +├── 6a/2aa4b # validation.xml now +├── a7/28107 # first model.bin + ... +``` + +Now that they're cached safely, DVC-tracked files in your workspace +can be replaced with [file links], so you continue seeing and using them as +usual. File hashes (usually MD5) are written in human-readable YAML [metafiles] +next to the original data. + +```git + training.csv -> .dvc/cache/0a/aa77e ++ training.csv.dvc + validation.xml -> .dvc/cache/6a/2aa4b ++ validation.xml.dvc + model.bin + src/train.py +``` + +```yaml +# validation.xml.dvc +md5: 6a2aa4b # Note: actual hashes are longer +path: validation.xml +``` + +[metafiles]: /doc/user-guide/project-structure +[file links]: /doc/user-guide/data-management/large-dataset-optimization + + + +Data tracked by DVC can be stored in more than one location. You get a project +cache by default, but it's possible to synchronize all or parts of it with +[remote storage]. The same content-addressable file structure is used remotely +unless you enable [cloud versioning], which lets you see a similar directory +structure in your cloud buckets as in the local project. + +[remote storage]: /doc/user-guide/data-management/remote-storage +[cloud versioning]: /doc/user-guide/data-management/cloud-versioning + + + +To keep track of relevant versions of the data, models, etc. cached by DVC, the +corresponding metafiles should be [versioned with Git] (or any SCM) along with +the rest of the code. This also means that a single file name can represent +different contents, keeping your project structure clean (use branches or tags +to organize data versions instead). + +[versioned with git]: + https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control + +```cli +$ git checkout dev-branch +$ dvc checkout +$ ls + training.csv 2 G # old data + model.bin 2.7 M # old model + src/train.py 214 K + +$ git checkout latest-tag +$ dvc checkout +$ ls + training.csv 3 G # latest data + validation.xml 1 G + model.bin 3.2 M # better model + src/train.py 354 K + src/evaluate.py 175 K # more code +``` + + + +DVC replaces data assets in the project with code-like YAML [metafiles] (and +links). Codifying data lets you treat it as a first-class citizen in any code +repository. + + + + diff --git a/content/docs/user-guide/data-management/storage-locations.md b/content/docs/user-guide/data-management/storage-locations.md new file mode 100644 index 0000000000..9f780e7813 --- /dev/null +++ b/content/docs/user-guide/data-management/storage-locations.md @@ -0,0 +1,38 @@ +# Storage locations + +DVC can manage data anywhere: cloud storage, SSH servers, network resources +(e.g. NAS), mounted drives, local file systems, etc. These locations can be put +into three groups. + +![Storage locations](/img/storage-locations.png) _Local, external, and remote +storage locations_ + +Every DVC project starts with 2 locations. The +workspace is the main project directory, containing your data, +models, source code, etc. DVC also creates a data cache (found +locally in `.dvc/cache` by default), which will be used as fast-access storage +for DVC operations. + + + +The cache can be moved to an external location in the file system or network, +for example to [share it] among several projects. It could even be set up in a +remote system (Internet access), but this is typically too slow for working with +data regularly. + + + +[share it]: /doc/user-guide/how-to/share-a-dvc-cache + +DVC supports additional storage locations such as cloud services (Amazon S3, +Google Drive, Azure Blob Storage, etc.), SSH servers, network-attached storage, +etc. These are called [DVC remotes], and help you to share or back up copies of +your data assets. + + + +DVC remotes are similar to Git remotes, but for cached data. + + + +[dvc remotes]: /doc/command-reference/remote diff --git a/static/img/direct_access_storage.png b/static/img/direct_access_storage.png new file mode 100644 index 0000000000..75b57231ae Binary files /dev/null and b/static/img/direct_access_storage.png differ diff --git a/static/img/dvc_managed_storage.png b/static/img/dvc_managed_storage.png new file mode 100644 index 0000000000..66aa85b9d1 Binary files /dev/null and b/static/img/dvc_managed_storage.png differ diff --git a/static/img/project_versioning.png b/static/img/project_versioning.png new file mode 100644 index 0000000000..d144c483f3 Binary files /dev/null and b/static/img/project_versioning.png differ diff --git a/static/img/storage-locations.png b/static/img/storage-locations.png new file mode 100644 index 0000000000..92fa9c7630 Binary files /dev/null and b/static/img/storage-locations.png differ