Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: remote storage #4058

Merged
merged 100 commits into from
Jan 24, 2023
Merged
Changes from 1 commit
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
7350938
guide: draft structure of Data Mgmt and
jorgeorpinel Oct 13, 2022
203f6a6
guide: full text for draft intro to DM
jorgeorpinel Oct 14, 2022
90eaa5d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 17, 2022
eb246bb
guide: hide cloud versioning info
jorgeorpinel Oct 17, 2022
a3687ec
guide: clarify Data Mgmt parts and
jorgeorpinel Oct 18, 2022
fad0bad
guide: add figure drafts to Data Mgmt
jorgeorpinel Oct 19, 2022
4e3c3da
guide: SCM->VC (Data Mgmt)
jorgeorpinel Oct 19, 2022
7f02c15
guide: update 2 figs and add 1 more (Data Mgmt)
jorgeorpinel Oct 19, 2022
f41d16e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
3a9a045
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
df40521
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
adc13ee
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 21, 2022
c0b92f1
guide: roll back unrelated changes
jorgeorpinel Oct 21, 2022
636872a
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
c2303c0
guide: mention clouds first (DM) and
jorgeorpinel Oct 22, 2022
62997ab
guide: flatten DM index
jorgeorpinel Oct 22, 2022
fc74c53
guide: udpates to DM/ DV
jorgeorpinel Oct 22, 2022
8c40a03
guide: add DM/ Data Versioning page
jorgeorpinel Oct 22, 2022
1a8ca61
guide: update outdated link
jorgeorpinel Oct 22, 2022
27be87f
guide: revert more unrelatedly chaqnged files
jorgeorpinel Oct 22, 2022
aaee7af
guide: remove unused ref link
jorgeorpinel Oct 22, 2022
dd99f21
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
118e3eb
guide: DM/ Remote Storage (not just Setup) and
jorgeorpinel Oct 22, 2022
24c331a
guide: remove a comment
jorgeorpinel Oct 22, 2022
ff85dcc
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
266a8f7
guide: draft for DM/ Remote Storage content
jorgeorpinel Oct 22, 2022
b04f20a
ref: expand config.remote and link to/from Remotes guide
jorgeorpinel Oct 23, 2022
1c77de4
ref: fix remote config file examples
jorgeorpinel Oct 23, 2022
8e7c320
guide: complete Remote Config section and
jorgeorpinel Oct 23, 2022
9b904f5
guide: complete list of supported storage types
jorgeorpinel Oct 24, 2022
3b5e520
guide: clarify `remote modify` phrase in
jorgeorpinel Oct 24, 2022
73e2f55
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 27, 2022
7fc7fa3
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 27, 2022
ff7e666
Update content/docs/user-guide/data-management/data-versioning.md
Oct 27, 2022
c0026fc
guide: update versioning config
jorgeorpinel Oct 27, 2022
71b599c
guide: don't call remote storage "additional" here
jorgeorpinel Oct 27, 2022
9774855
guide: pull -> download (DM/ RS intro)
jorgeorpinel Oct 27, 2022
e5c6f13
guide: remove "optional" from Remote Storage nav & title
jorgeorpinel Oct 27, 2022
ec1af6d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 28, 2022
2f31bb6
guide: splits and notes around Data Mgmt index page
jorgeorpinel Oct 28, 2022
a84c442
guide: Data Mgmt intro + note updates
jorgeorpinel Oct 29, 2022
ab55389
guide: draft of all contents +
jorgeorpinel Oct 29, 2022
31d5288
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 1, 2022
a13f989
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 2, 2022
601c99e
guide: small impros to Data Mgmt
jorgeorpinel Nov 2, 2022
a8bad84
guide: rewrite Data Mgmt index in before/after form
jorgeorpinel Nov 3, 2022
c8cc17b
guide: add draft figure for Data Mgmt
jorgeorpinel Nov 4, 2022
3cb84cb
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 8, 2022
a13cb0f
guide: simplify/refocus data mgmt index
jorgeorpinel Nov 8, 2022
e3ba70b
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 17, 2022
c29d9ec
work around commented header bug
jorgeorpinel Nov 17, 2022
875fba3
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 23, 2022
831ad1d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 25, 2022
8ddda9c
guide: drop DM/ DV page
jorgeorpinel Nov 25, 2022
28322e5
guide: rewrite DM intro and
jorgeorpinel Nov 25, 2022
179d172
guide: use DM table instead of figure for now
jorgeorpinel Nov 25, 2022
d979a5e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 30, 2022
74bc156
guide: rewrite Data Mgmt story
jorgeorpinel Nov 30, 2022
e138096
guide: add draft figures to Data Mgmt
jorgeorpinel Nov 30, 2022
f904038
guide: simplify Data Mgmt story and benefits
jorgeorpinel Dec 1, 2022
e1772ea
guide: remove unused images (DM)
jorgeorpinel Dec 1, 2022
cc0390e
guide: update Data Mgmt figures (v1)
jorgeorpinel Dec 2, 2022
4ee3223
guide: rewrite text of Data Mgmt index
jorgeorpinel Dec 8, 2022
149599b
Merge branch 'main' of github.com:iterative/dvc.org into guide/data-m…
rogermparent Dec 8, 2022
f2acb66
guide: update Data Mgmt figures
jorgeorpinel Dec 8, 2022
723eb50
guide: iterate on Data Mgmt again
jorgeorpinel Dec 14, 2022
4b67b64
guide: update Data Mgmt figs
jorgeorpinel Dec 14, 2022
9eb7143
guide: more supporting info about Data Mgmt
jorgeorpinel Dec 18, 2022
e598839
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 21, 2022
dd4466e
guide: update figures (much more concrete) and
jorgeorpinel Dec 21, 2022
d637179
guide: edits to How it works (Data Mgmt)
jorgeorpinel Dec 21, 2022
c007817
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
5a0fd57
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
3eb81ff
guide: update Data Mgmt figures
jorgeorpinel Dec 22, 2022
98e73ff
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 23, 2022
67b1717
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 27, 2022
f3af183
guide: emphaisze dataset versions in UG fig 1
jorgeorpinel Dec 27, 2022
206ce77
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 4, 2023
075aaf3
guide: update Data Mgmt figures (with notes),
jorgeorpinel Jan 5, 2023
7377500
guide: more updates to text and figure styles,
jorgeorpinel Jan 5, 2023
baf5b4c
guide: update figures and text (Data Mgmt) ...
jorgeorpinel Jan 9, 2023
fb35df5
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 11, 2023
4475f78
guide: Data Management text (section 1)
jorgeorpinel Jan 11, 2023
20fbaae
guide: Data Management (main text)
jorgeorpinel Jan 11, 2023
1da7b8a
guide: Data Management (secondary text)
jorgeorpinel Jan 12, 2023
61e2865
Merge branch 'guide/data-mgmt-flows' of github.com:iterative/dvc.org …
jorgeorpinel Jan 12, 2023
ed63127
guide: add DVC data mgmt technical diagram &
jorgeorpinel Jan 12, 2023
0109cf3
guide: update Data Mgmt text
jorgeorpinel Jan 18, 2023
77330cc
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 18, 2023
956b03d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 19, 2023
7152ad3
guide: udpate text and 2nd figure (Data Mgmt)
jorgeorpinel Jan 19, 2023
f29da1e
guide: draft 2nd and 3rd figures
jorgeorpinel Jan 19, 2023
8f49a72
guide: rewrite Data Mgmt/ How it works &
jorgeorpinel Jan 20, 2023
f876c17
guide: update drafts of Data Mgmt figures 2, 3
jorgeorpinel Jan 20, 2023
ee3f721
guide: Data Mgmt improvements and
jorgeorpinel Jan 24, 2023
061a918
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 24, 2023
ac50c94
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Jan 24, 2023
d781fdd
guide: separate from Data Mgmt work
jorgeorpinel Jan 24, 2023
a8acb25
guide: remove hidden Storage locations page for now
jorgeorpinel Jan 24, 2023
882170a
guide: small cleanup of Remote storage page
jorgeorpinel Jan 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
guide: Data Management (main text)
finalized for this version of figures
jorgeorpinel committed Jan 12, 2023
commit 20fbaae42c10cc2bfd3c9fbc08d3464ea764bdc6
52 changes: 30 additions & 22 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
@@ -9,16 +9,20 @@ your team will face. But as time progresses, unnecessary files may end up
scattered throughout multiple buckets. Overlapping contents cause data leakage
and inefficient storage. The project evolution is not easily to track, so
multiple data versions coexist (error-prone and not secure). What was the name
of the best model? Can others reproduce your results?
of the best model? Can others reproduce your results? _Example:_

![Direct access storage](/img/direct_access_storage.png) _The S3 bucket on the
left is shared by several people and projects; The user on the right needs to
know the exact location of the correct files, and uses cloud-specific tools
(e.g. AWS CLI) to access them directly._
right is shared (and bloated) by several people and projects. You need to know
the exact location of the correct files, and use cloud-specific tools (e.g. AWS
CLI) to access them directly._

DVC captures information about your data sets, which can be versioned with Git.
Existing storage can now be organized efficiently. Click among **test** and
**v1.0** below to see an example:
DVC captures information that describes your data. This allows datasets to exist
in a <abbr>project</abbr> regardless of where and how they're actually stored.
Their storage can be (re)[organized efficiently] without affecting original
projects. _Click on **v1.0** and **test** below for an example:_

[organized efficiently]:
/doc/user-guide/data-management/large-dataset-optimization

<toggle>
<tab title="test">
@@ -33,28 +37,32 @@ Existing storage can now be organized efficiently. Click among **test** and
</tab>
</toggle>

![]() _DVC captures information about your data in a Git repository. Shared
storage (left) contains unique, indexed data objects in this example; You access
data using DVC synchronization tools._
![]() _DVC [metadata] including folder structure is saved a in Git repository.
The shared storage (right) contains unique, indexed data objects, minimizing its
size; You access them using DVC [synchronization] features._

This provides visibility over all your data and helps secure its access. Sharing
storage locations is no longer a problem, and it's easy to migrate with
[multiple storage provider] support.
Every file and directory that matters at a given time is tracked by DVC. And Git
let's you record of many such times (project versions). You'll be able to know
when/why any data was included (visibility), guarantee storage integrity, and
secure its access. Sharing data stores is not a problem, and they're easy to
migrate across platforms with [multiple provider support].

[multiple storage provider]:
[metadata]: /doc/user-guide/project-structure/dvc-files#specification
[synchronization]:
/doc/start/data-management/data-versioning#storing-and-sharing
[multiple provider support]:
/doc/command-reference/remote/add#supported-storage-types

To get there, a few key changes to your workflow are required:
Just keep in mind these key changes to your workflow, required by our approach:

1. Data and models must be registered in a code repository (typically on Git).
1. Stored objects are [reorganized] by DVC (not intended for manual handling).
1. Data operations (drop, update, transfer, etc.) happen indirectly -- through
the repo.
1. Relevant data and models are registered in a code repository (typically Git).
1. Data operations (add, remove, move, etc.) happen [indirectly]: DVC checks the
metadata to locate files in both sides.
1. Stored objects managed with DVC are not intended for handling manually.

[reorganized]:
/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory
[indirectly]: https://en.wikipedia.org/wiki/Indirection

## Details & Benefits
## More details and benefits

<!--
DVC lets you describe the entire <abbr>project</abbr> in a Git repository, so