diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index bcc33aceb7..12d12a9e67 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -123,6 +123,10 @@ "slug": "data-management", "source": "data-management/index.md", "children": [ + { + "label": "Track & Sync Versioned Data", + "slug": "track-sync-data" + }, "large-dataset-optimization", "remote-storage", "cloud-versioning", diff --git a/content/docs/start/data-management/data-versioning.md b/content/docs/start/data-management/data-versioning.md index 687b97496e..ad92aaae92 100644 --- a/content/docs/start/data-management/data-versioning.md +++ b/content/docs/start/data-management/data-versioning.md @@ -172,6 +172,9 @@ set up earlier. The remote storage directory should look like this:    └── a1a2931c8370d3aeedd7183606fd7f ``` +Learn more about +[storage synchronization](/doc/user-guide/data-management/track-sync-data#synchronizing-data). + ## Retrieving diff --git a/content/docs/user-guide/data-management/cloud-versioning.md b/content/docs/user-guide/data-management/cloud-versioning.md index 8278ea3b3d..ed46aea9a7 100644 --- a/content/docs/user-guide/data-management/cloud-versioning.md +++ b/content/docs/user-guide/data-management/cloud-versioning.md @@ -30,19 +30,22 @@ benefits of content-addressable storage. ### Expand for more details on the differences between cloud versioned and content-addressable storage -`dvc remote` storage normally uses -[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) -to organize versioned data. Different versions of files are stored in the remote -according to hash of their data content instead of according to their original -filenames and directory location. This allows DVC to optimize certain remote -storage lookup and data sync operations, and provides data de-duplication at the -file level. However, this comes with the drawback of losing human-readable -filenames without the use of the DVC CLI (`dvc get --show-url`) or API -(`dvc.api.get_url()`). +`dvc remote` storage normally uses [content-addressable storage] to organize +versioned data. Different versions of files are stored in the remote according +to a hash of their data contents instead of using their original filenames and +directory location. This allows DVC to optimize certain remote storage lookup +and [data sync operations], and provides data de-duplication at the file level. +However, this comes with the drawback of losing human-readable filenames without +the use of the DVC CLI (`dvc get --show-url`) or API (`dvc.api.get_url()`). When using cloud versioning, DVC does not provide de-duplication, and certain remote storage performance optimizations will be unavailable. +[content-addressable storage]: + /doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory +[data sync operations]: + /doc/user-guide/data-management/track-sync-data#synchronizing-data + ## Supported storage providers diff --git a/content/docs/user-guide/data-management/index.md b/content/docs/user-guide/data-management/index.md index f0fc87c7d3..1ca18c83a2 100644 --- a/content/docs/user-guide/data-management/index.md +++ b/content/docs/user-guide/data-management/index.md @@ -158,10 +158,10 @@ At the same time, it comes with many benefits: - Your repository stays small and easy **collaborate** on (using regular [Git workflows]). - [Data versioning] guarantees ML **reproducibility**. -- Use a **consistent interface** to access and sync data anywhere (via [CLI], +- Use a **consistent interface** to access and [sync data] anywhere (via [CLI], [API], [IDE], or [web]), regardless of the storage platform (S3, GDrive, NAS, etc.). -- Data **integrity** based on a Git-based storage; Data **security** through an +- Data **integrity** based on Git-based storage; Data **security** through an authored project history that can be audited. - Advanced features: [Data registries], [ML pipelines], [CI/CD for ML], [productize] your ML models, and more! @@ -171,6 +171,7 @@ At the same time, it comes with many benefits: [git workflows]: https://git-scm.com/book/en/v2/Distributed-Git-Distributed-Workflows [data versioning]: /doc/use-cases/versioning-data-and-models +[sync data]: /doc/user-guide/data-management/track-sync-data#synchronizing-data [cli]: /doc/command-reference [api]: /doc/api-reference [ide]: /doc/vs-code-extension diff --git a/content/docs/user-guide/data-management/remote-storage.md b/content/docs/user-guide/data-management/remote-storage.md index a4457ff5e6..d2604441e1 100644 --- a/content/docs/user-guide/data-management/remote-storage.md +++ b/content/docs/user-guide/data-management/remote-storage.md @@ -20,11 +20,14 @@ wide variety of [storage types](#supported-storage-types). The main uses of remote storage are: -- Synchronize DVC-tracked data (previously cached). +- [Synchronize] DVC-tracked data (previously cached). - Centralize or distribute large file storage for sharing and collaboration. - Back up different versions of your data and models. - Save space in your working environment (by deleting pushed files/directories). +[synchronize]: + /doc/user-guide/data-management/track-sync-data#synchronizing-data + ## Configuration You can set up one or more remote storage locations, mainly with the diff --git a/content/docs/user-guide/data-management/track-sync-data.md b/content/docs/user-guide/data-management/track-sync-data.md new file mode 100644 index 0000000000..a4c9e9361c --- /dev/null +++ b/content/docs/user-guide/data-management/track-sync-data.md @@ -0,0 +1,164 @@ +# Track and Sync Versioned Data & Models + +The fundamental workflow of most DVC projects includes the +following **basic operations**. These can be performed directly (as we cover +here) but are sometimes included automatically in advanced workflows, like +[pipelining] and [experiment management]. + +[pipelining]: /doc/user-guide/pipelines +[experiment management]: /doc/user-guide/experiment-management + +## Tracking data + +DVC is [similar to Git] here. To start tracking large files or directories (e.g. +data or machine learning models), "add" them to DVC with the `dvc add` command. +This caches the files and [links them] back to the +workspace (hiding them from Git). A matching `.dvc` file is +created. + +To capture changes to tracked data, `dvc add` them again (`dvc commit` will also +do the trick). This caches the latest file contents and updates `.dvc` metafiles +accordingly. + +[similar to git]: + https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository +[links them]: /doc/user-guide/data-management/large-dataset-optimization + + + +`.dvc` and other [metafiles] can be tracked (and [versioned](#versioning-data)) +with Git. + +[metafiles]: /doc/user-guide/project-structure + + + +If you need to move or rename tracked data, use `dvc move`. To stop tracking it, +use `dvc remove`. To also remove it from the cache, use `dvc gc`. See [more +details]. + +To wrap up, you can get an overview of DVC-tracked assets with +`dvc data status`. This will list changes to tracked files and directories as +well as files unknown to DVC (or Git): + +```cli +$ dvc data status +Not in cache: + tmp/ + +DVC committed changes: + added: data.xml + modified: data/features/ + +DVC uncommitted changes: + deleted: model.pkl +``` + +[more details]: /doc/user-guide/how-to/stop-tracking-data + + + +Other related commands: `dvc status`, `dvc list`, `dvc import`, +`dvc import-url`, `dvc unprotect`. + + + +## Synchronizing data + +DVC lets you [codify your data][data versioning] and ML models, configure the +project's storage location(s), and stop worrying about low-level file operations +like copying, moving, renaming, uploading, etc. + +At a minimum, you'll have one data store: the project's cache. +[Data-tracking](#tracking-data) operations already keep it in sync with your +workspace most of the time. + + + +`dvc commit` and `dvc checkout` let you force-sync them if needed, for example +if unexpected errors occur (e.g. cache corruption). + + + +[data versioning]: /doc/use-cases/versioning-data-and-models + +To add storage locations to share and back up your work, you can configure [DVC +remotes] using `dvc remote` commands (more on their [configuration]). Once this +is done, use `dvc push` and `dvc pull` (among others) to transfer data between +the project and remote storage. + +[dvc remotes]: /doc/user-guide/data-management/remote-storage +[configuration]: /doc/user-guide/data-management/remote-storage#configuration + +![Sync ops among locations](/img/sync-ops-locations.png) _Data sync operations +among locations_ + + + +`dvc fetch` transfers files downstream halfway -- from remote storage to the +cache. This can be useful to make sure that some data is available +for checkout later. + + + +A more advanced strategy is to access and synchronize data assets directly from +misc. locations or other DVC projects (e.g. [data registry] pattern). See +`dvc list`, `dvc import`/`dvc import-url`, and `dvc update`, as well as the +[Python API]. + +[protected]: /doc/command-reference/unprotect +[data registry]: /doc/use-cases/data-registry +[python api]: /doc/api-reference + +## Versioning data + +Many `dvc` commands give out hints about `git` commands to follow then with. +This helps you complete the [data versioning] side of the operation (if needed). + +![Versioning flow](/img/flow.png) _DVC metafiles represent your data and models +in the Git repo, while large files are stored in the cache (and/or remote +storage) and linked to your workspace._ + +Some common sequences: + +- Check the `dvc data status` (or `dvc status`) before deciding what changes to + track with Git. +- `dvc add` (or `dvc commit`) your data and then `git add` and `git commit` the + resulting DVC metafiles. This registers DVC-tracked files with Git indirectly + (without storing them in the Git repo). +- After you `git push` project versions associated with new or changed data, you + may want to `dvc push` those data updates to a [DVC remote][dvc remotes]. +- `git checkout` to switch project versions (commits, branches, etc.) and then + `dvc checkout` to get the corresponding large files tracked by DVC into your + workspace. +- `git clone` or `git pull` a DVC repository (e.g. to get others contributions), + and then `dvc pull` the matching data files. + + + +Some of these are so common that DVC provides the `dvc install` helper command +to set up [certain Git hooks] that automate them. + +[certain git hooks]: /doc/command-reference/install#installed-git-hooks + + + +Managing multiple versions of data or models (including their training +parameters and performance metrics) with Git is great, but sometimes requires +navigation aids. DVC provides comparison commands like `dvc diff` (similar to +`git diff`) to help with this. See also `dvc params diff`, `dvc metrics diff`, +and `dvc plots diff`. + + + +Another neat feature of some DVC commands is the `--rev` ([revision]) option. +This lets you specify a version of the project to operate from. For example, +`dvc import --rev a17b8fd` can import data associated with the source project +commit `a17b8fd`. Other commands with `--rev`: `dvc gc`, `dvc list`, etc. + + + +[git branches]: + https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging +[tags]: https://git-scm.com/book/en/v2/Git-Basics-Tagging +[revision]: https://git-scm.com/docs/revisions diff --git a/content/docs/user-guide/experiment-management/sharing-experiments.md b/content/docs/user-guide/experiment-management/sharing-experiments.md index edbaf27cad..a71b743138 100644 --- a/content/docs/user-guide/experiment-management/sharing-experiments.md +++ b/content/docs/user-guide/experiment-management/sharing-experiments.md @@ -4,7 +4,7 @@ In a regular Git workflow, DVC repository versions are typically synchronized among team members. And [DVC Experiments] are internally connected to this commit history, so you can similarly share them. -## Basic workflow: store as peristent commits +## Basic workflow: store as persistent commits The most straightforward way to share experiments is to store them as [persistent](/doc/user-guide/experiment-management/persisting-experiments) Git diff --git a/static/img/sync-ops-locations.png b/static/img/sync-ops-locations.png new file mode 100644 index 0000000000..c8573f36e2 Binary files /dev/null and b/static/img/sync-ops-locations.png differ