diff --git a/content/docs/user-guide/data-management/index.md b/content/docs/user-guide/data-management/index.md index d964885d8a..62f4f69e8d 100644 --- a/content/docs/user-guide/data-management/index.md +++ b/content/docs/user-guide/data-management/index.md @@ -6,53 +6,71 @@ permissions on cloud storage, sync tools and schedules, back up snapshots, etc. and focus on machine learning. You work with data normally in a local workspace. DVC tracks, -restores, and synchronize everything with a few, straightforward commands -(similar to Git) that do not change regardless of the underlying file systems, -transfer protocols, etc. +restores, and synchronize everything with a few, straightforward operations that +do not change regardless of the underlying file systems, transfer protocols, +etc. -![]() _Separating data from code_ +![]() _Separating data from code (codification)_ -To achieve this, DVC relies on data _codification_: replacing large files and -directories with small [metafiles] that describe the assets. Data files are -moved to a separate cache but kept virtually (linked) in the -workspace. This **separates your data from code** (including metafiles). +
- +## Click to learn more about data _codification_ -This also allows you to [version] all project files with Git, a battle-tested -[SCM] tool. +To achieve this, DVC replaces large files and directories with small [metafiles] +that describe the assets. Data files are moved to a separate cache +but kept virtually (linked) in the workspace. This separates your data from code +(including metafiles). - + -DVC operations stay the same because they work [indirectly], by going through -the metafiles and [configuration] of your project to find out where -and how to handle files. This is transparent to you as user, but it's important -to understand the mechanics in general. +This also allows you to [version] project files with Git, a battle-tested [SCM] +tool. -## Workflow and benefits + - +[version]: /doc/user-guide/data-management/data-versioning +[scm]: https://www.atlassian.com/git/tutorials/source-code-management -... +
- +Your experience can stay consistent because DVC works [indirectly], by checking +the [metafiles] and [configuration] of your project to find out +where and how to handle files. This is transparent to you as user, but it's +important to understand the mechanics in general. [metafiles]: /doc/user-guide/project-structure [indirectly]: https://en.wikipedia.org/wiki/Indirection [configuration]: /doc/command-reference/config -[version]: /doc/user-guide/data-management/data-versioning -[scm]: https://www.atlassian.com/git/tutorials/source-code-management -## Storage locations +## Workflow and benefits + +**Before**: Files are scattered in the cloud; You use low-level operations +specific to each storage (e.g. `aws s3 cp`); Ad hoc file names are used to save +versions; It's easy to lose track of which data produced which results; Everyone +can read and write. + +**After**: Stored objects are organized by DVC and you don't need to touch them +directly; DVC exposes a few commands to manage them; Everything is happening +though a code repository that can be controlled with Git; Project versions (Git +commits) guarantee reproducibility of ML processes (e.g. training models with +the same datasets, hyperparametes, features, etc.). + + - +**Benefits**: You always work with project-specific paths; Efficient usage of +storage space (file deduplication); Small repository; [Data versioning]; [Fast +caching], [GitOps]. + +[data versioning]: /doc/use-cases/versioning-data-and-models +[fast caching]: /doc/use-cases/fast-data-caching-hub +[gitops]: https://www.gitops.tech/ + +## Storage locations DVC can manage data anywhere: cloud storage, SSH servers, network resources (e.g. NAS), mounted drives, local file systems, etc. These locations can be separated into three groups. - - ![Storage locations](/img/storage-locations.png) _Local, external, and remote storage locations_