Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: Basic Operations (Data Mgmt) #4053

Closed
wants to merge 39 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
39 commits
Select commit Hold shift + click to select a range
3df841e
guide: add DM/ Basic Ops intro + struct
jorgeorpinel Oct 19, 2022
6e5450e
guide: Tracking data guide and
jorgeorpinel Oct 19, 2022
9a8ce22
guide: Traking updates, intros for Sync and Version (Data Mgmt)
jorgeorpinel Oct 19, 2022
afb38ee
guide: complete Sync inc. figure (Data Mgmt)
jorgeorpinel Oct 19, 2022
9fc77a9
Merge branch 'main' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 20, 2022
314c953
guide" fixes and import/update flow to DM/Sync
jorgeorpinel Oct 20, 2022
a86325e
guide: More Data Versioning info. and
jorgeorpinel Oct 20, 2022
bc07653
guide: complete Versioning info (DM/ Basic Ops)
jorgeorpinel Oct 20, 2022
ac0a555
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 20, 2022
baf97cc
guide: typo fix
jorgeorpinel Oct 20, 2022
766f329
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 20, 2022
fe3208c
ref: simplify remote index (move to guide)
jorgeorpinel Oct 21, 2022
83834cd
ref: link from remote index to DM/ Ops/ Sync and
jorgeorpinel Oct 21, 2022
a45c9d6
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
561ff06
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
39e2965
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
e77f5f1
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
c1ed918
guide: rename DM/ TSV -> TSVD and
jorgeorpinel Oct 22, 2022
69f7de5
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
ec7f2c1
guide: DM/ TSVD - data codification
jorgeorpinel Oct 22, 2022
032cfef
guide: remove comment
jorgeorpinel Oct 22, 2022
903632a
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
5335be2
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 22, 2022
1430354
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 24, 2022
af6fec3
ref: roll back changes to remote index which
jorgeorpinel Oct 24, 2022
664f68a
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Oct 27, 2022
13ba7ae
guide: don't call `checkout` "plumbing"
jorgeorpinel Oct 27, 2022
dba203e
guide: don't call remote storage "additional"
jorgeorpinel Oct 27, 2022
e6a9ebf
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Feb 18, 2023
2916cf3
nav: remove unexistent page (per previous merge)
jorgeorpinel Feb 18, 2023
e2fdc09
typo
jorgeorpinel Feb 18, 2023
6115f72
guide: Basic Ops/ Tracking Data
jorgeorpinel Feb 18, 2023
315cf47
Remove unrelated changes...
jorgeorpinel Feb 18, 2023
272bc07
guide: Data Mgmt/ Basic Ops/ Sync
jorgeorpinel Feb 18, 2023
8fa58bc
fix link
jorgeorpinel Feb 18, 2023
d0ac5b9
guide: Data Mgmt/ Versioning + mention ML models more
jorgeorpinel Feb 20, 2023
5397f46
guide: link from Remote Storage to Basic Ops/ Sync
jorgeorpinel Feb 20, 2023
2c38ee4
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/basic-ops
jorgeorpinel Feb 21, 2023
4a11e04
guide: more links to cache/remote sync
jorgeorpinel Feb 21, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -123,6 +123,10 @@
"slug": "data-management",
"source": "data-management/index.md",
"children": [
{
"label": "Track & Sync Versioned Data",
"slug": "track-sync-data"
},
"large-dataset-optimization",
"remote-storage",
"cloud-versioning",
Expand Down
3 changes: 3 additions & 0 deletions content/docs/start/data-management/data-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -172,6 +172,9 @@ set up earlier. The remote storage directory should look like this:
   └── a1a2931c8370d3aeedd7183606fd7f
```

Learn more about
[storage synchronization](/doc/user-guide/data-management/track-sync-data#synchronizing-data).

</details>

## Retrieving
Expand Down
21 changes: 12 additions & 9 deletions content/docs/user-guide/data-management/cloud-versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -30,19 +30,22 @@ benefits of content-addressable storage.

### Expand for more details on the differences between cloud versioned and content-addressable storage

`dvc remote` storage normally uses
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
to organize versioned data. Different versions of files are stored in the remote
according to hash of their data content instead of according to their original
filenames and directory location. This allows DVC to optimize certain remote
storage lookup and data sync operations, and provides data de-duplication at the
file level. However, this comes with the drawback of losing human-readable
filenames without the use of the DVC CLI (`dvc get --show-url`) or API
(`dvc.api.get_url()`).
`dvc remote` storage normally uses [content-addressable storage] to organize
versioned data. Different versions of files are stored in the remote according
to a hash of their data contents instead of using their original filenames and
directory location. This allows DVC to optimize certain remote storage lookup
and [data sync operations], and provides data de-duplication at the file level.
However, this comes with the drawback of losing human-readable filenames without
the use of the DVC CLI (`dvc get --show-url`) or API (`dvc.api.get_url()`).

When using cloud versioning, DVC does not provide de-duplication, and certain
remote storage performance optimizations will be unavailable.

[content-addressable storage]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
[data sync operations]:
/doc/user-guide/data-management/track-sync-data#synchronizing-data
Comment on lines +37 to +47
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most other changes just add links to the new page (mainly to the Sync section).


</details>

## Supported storage providers
Expand Down
5 changes: 3 additions & 2 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -158,10 +158,10 @@ At the same time, it comes with many benefits:
- Your <abbr>repository</abbr> stays small and easy **collaborate** on (using
regular [Git workflows]).
- [Data versioning] guarantees ML **reproducibility**.
- Use a **consistent interface** to access and sync data anywhere (via [CLI],
- Use a **consistent interface** to access and [sync data] anywhere (via [CLI],
[API], [IDE], or [web]), regardless of the storage platform (S3, GDrive, NAS,
etc.).
- Data **integrity** based on a Git-based storage; Data **security** through an
- Data **integrity** based on Git-based storage; Data **security** through an
authored project history that can be audited.
- Advanced features: [Data registries], [ML pipelines], [CI/CD for ML],
[productize] your ML models, and more!
Expand All @@ -171,6 +171,7 @@ At the same time, it comes with many benefits:
[git workflows]:
https://git-scm.com/book/en/v2/Distributed-Git-Distributed-Workflows
[data versioning]: /doc/use-cases/versioning-data-and-models
[sync data]: /doc/user-guide/data-management/track-sync-data#synchronizing-data
[cli]: /doc/command-reference
[api]: /doc/api-reference
[ide]: /doc/vs-code-extension
Expand Down
5 changes: 4 additions & 1 deletion content/docs/user-guide/data-management/remote-storage.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,14 @@ wide variety of [storage types](#supported-storage-types).

The main uses of remote storage are:

- Synchronize DVC-tracked data (previously <abbr>cached</abbr>).
- [Synchronize] DVC-tracked data (previously <abbr>cached</abbr>).
- Centralize or distribute large file storage for sharing and collaboration.
- Back up different versions of your data and models.
- Save space in your working environment (by deleting pushed files/directories).

[synchronize]:
/doc/user-guide/data-management/track-sync-data#synchronizing-data

## Configuration

You can set up one or more remote storage locations, mainly with the
Expand Down
164 changes: 164 additions & 0 deletions content/docs/user-guide/data-management/track-sync-data.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,164 @@
# Track and Sync Versioned Data & Models

The fundamental workflow of most <abbr>DVC projects</abbr> includes the
following **basic operations**. These can be performed directly (as we cover
Comment on lines +1 to +4
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the main file being contributed.

here) but are sometimes included automatically in advanced workflows, like
[pipelining] and [experiment management].

[pipelining]: /doc/user-guide/pipelines
[experiment management]: /doc/user-guide/experiment-management

## Tracking data

DVC is [similar to Git] here. To start tracking large files or directories (e.g.
data or machine learning models), "add" them to DVC with the `dvc add` command.
This <abbr>caches</abbr> the files and [links them] back to the
<abbr>workspace</abbr> (hiding them from Git). A matching `.dvc` file is
created.

To capture changes to tracked data, `dvc add` them again (`dvc commit` will also
do the trick). This caches the latest file contents and updates `.dvc` metafiles
accordingly.

[similar to git]:
https://git-scm.com/book/en/v2/Git-Basics-Recording-Changes-to-the-Repository
[links them]: /doc/user-guide/data-management/large-dataset-optimization

<admon type="info">

`.dvc` and other [metafiles] can be tracked (and [versioned](#versioning-data))
with Git.

[metafiles]: /doc/user-guide/project-structure

</admon>

If you need to move or rename tracked data, use `dvc move`. To stop tracking it,
use `dvc remove`. To also remove it from the cache, use `dvc gc`. See [more
details].

To wrap up, you can get an overview of DVC-tracked assets with
`dvc data status`. This will list changes to tracked files and directories as
well as files unknown to DVC (or Git):

```cli
$ dvc data status
Not in cache:
tmp/

DVC committed changes:
added: data.xml
modified: data/features/

DVC uncommitted changes:
deleted: model.pkl
```

[more details]: /doc/user-guide/how-to/stop-tracking-data

<admon type="tip">

Other related commands: `dvc status`, `dvc list`, `dvc import`,
`dvc import-url`, `dvc unprotect`.

</admon>

## Synchronizing data

DVC lets you [codify your data][data versioning] and ML models, configure the
project's storage location(s), and stop worrying about low-level file operations
like copying, moving, renaming, uploading, etc.

At a minimum, you'll have one data store: the project's <abbr>cache</abbr>.
[Data-tracking](#tracking-data) operations already keep it in sync with your
<abbr>workspace</abbr> most of the time.

<admon type="tip">

`dvc commit` and `dvc checkout` let you force-sync them if needed, for example
if unexpected errors occur (e.g. cache corruption).

</admon>

[data versioning]: /doc/use-cases/versioning-data-and-models

To add storage locations to share and back up your work, you can configure [DVC
remotes] using `dvc remote` commands (more on their [configuration]). Once this
is done, use `dvc push` and `dvc pull` (among others) to transfer data between
the project and remote storage.

[dvc remotes]: /doc/user-guide/data-management/remote-storage
[configuration]: /doc/user-guide/data-management/remote-storage#configuration

![Sync ops among locations](/img/sync-ops-locations.png) _Data sync operations
among locations_

<admon type="tip">

`dvc fetch` transfers files downstream halfway -- from remote storage to the
<abbr>cache</abbr>. This can be useful to make sure that some data is available
for checkout later.

</admon>

A more advanced strategy is to access and synchronize data assets directly from
misc. locations or other DVC projects (e.g. [data registry] pattern). See
`dvc list`, `dvc import`/`dvc import-url`, and `dvc update`, as well as the
[Python API].

[protected]: /doc/command-reference/unprotect
[data registry]: /doc/use-cases/data-registry
[python api]: /doc/api-reference

## Versioning data

Many `dvc` commands give out hints about `git` commands to follow then with.
This helps you complete the [data versioning] side of the operation (if needed).

![Versioning flow](/img/flow.png) _DVC metafiles represent your data and models
in the Git repo, while large files are stored in the cache (and/or remote
storage) and linked to your workspace._

Some common sequences:

- Check the `dvc data status` (or `dvc status`) before deciding what changes to
track with Git.
- `dvc add` (or `dvc commit`) your data and then `git add` and `git commit` the
resulting DVC metafiles. This registers DVC-tracked files with Git indirectly
(without storing them in the Git repo).
- After you `git push` project versions associated with new or changed data, you
may want to `dvc push` those data updates to a [DVC remote][dvc remotes].
- `git checkout` to switch project versions (commits, branches, etc.) and then
`dvc checkout` to get the corresponding large files tracked by DVC into your
<abbr>workspace</abbr>.
- `git clone` or `git pull` a DVC repository (e.g. to get others contributions),
and then `dvc pull` the matching data files.

<admon type="tip">

Some of these are so common that DVC provides the `dvc install` helper command
to set up [certain Git hooks] that automate them.

[certain git hooks]: /doc/command-reference/install#installed-git-hooks

</admon>

Managing multiple versions of data or models (including their training
parameters and performance metrics) with Git is great, but sometimes requires
navigation aids. DVC provides comparison commands like `dvc diff` (similar to
`git diff`) to help with this. See also `dvc params diff`, `dvc metrics diff`,
and `dvc plots diff`.

<admon type="tip">

Another neat feature of some DVC commands is the `--rev` ([revision]) option.
This lets you specify a version of the project to operate from. For example,
`dvc import --rev a17b8fd` can import data associated with the source project
commit `a17b8fd`. Other commands with `--rev`: `dvc gc`, `dvc list`, etc.

</admon>

[git branches]:
https://git-scm.com/book/en/v2/Git-Branching-Basic-Branching-and-Merging
[tags]: https://git-scm.com/book/en/v2/Git-Basics-Tagging
[revision]: https://git-scm.com/docs/revisions
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@ In a regular Git workflow, <abbr>DVC repository</abbr> versions are typically
synchronized among team members. And [DVC Experiments] are internally connected
to this commit history, so you can similarly share them.

## Basic workflow: store as peristent commits
## Basic workflow: store as persistent commits
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Feb 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an unrelated fix, oops. Found it when looking for places to link from (but didn't end up linking in this file).


The most straightforward way to share experiments is to store them as
[persistent](/doc/user-guide/experiment-management/persisting-experiments) Git
Expand Down
Binary file added static/img/sync-ops-locations.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.