Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: remote storage #4058

Merged
merged 100 commits into from
Jan 24, 2023
Merged
Show file tree
Hide file tree
Changes from 38 commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
7350938
guide: draft structure of Data Mgmt and
jorgeorpinel Oct 13, 2022
203f6a6
guide: full text for draft intro to DM
jorgeorpinel Oct 14, 2022
90eaa5d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 17, 2022
eb246bb
guide: hide cloud versioning info
jorgeorpinel Oct 17, 2022
a3687ec
guide: clarify Data Mgmt parts and
jorgeorpinel Oct 18, 2022
fad0bad
guide: add figure drafts to Data Mgmt
jorgeorpinel Oct 19, 2022
4e3c3da
guide: SCM->VC (Data Mgmt)
jorgeorpinel Oct 19, 2022
7f02c15
guide: update 2 figs and add 1 more (Data Mgmt)
jorgeorpinel Oct 19, 2022
f41d16e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
3a9a045
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
df40521
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
adc13ee
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 21, 2022
c0b92f1
guide: roll back unrelated changes
jorgeorpinel Oct 21, 2022
636872a
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
c2303c0
guide: mention clouds first (DM) and
jorgeorpinel Oct 22, 2022
62997ab
guide: flatten DM index
jorgeorpinel Oct 22, 2022
fc74c53
guide: udpates to DM/ DV
jorgeorpinel Oct 22, 2022
8c40a03
guide: add DM/ Data Versioning page
jorgeorpinel Oct 22, 2022
1a8ca61
guide: update outdated link
jorgeorpinel Oct 22, 2022
27be87f
guide: revert more unrelatedly chaqnged files
jorgeorpinel Oct 22, 2022
aaee7af
guide: remove unused ref link
jorgeorpinel Oct 22, 2022
dd99f21
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
118e3eb
guide: DM/ Remote Storage (not just Setup) and
jorgeorpinel Oct 22, 2022
24c331a
guide: remove a comment
jorgeorpinel Oct 22, 2022
ff85dcc
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
266a8f7
guide: draft for DM/ Remote Storage content
jorgeorpinel Oct 22, 2022
b04f20a
ref: expand config.remote and link to/from Remotes guide
jorgeorpinel Oct 23, 2022
1c77de4
ref: fix remote config file examples
jorgeorpinel Oct 23, 2022
8e7c320
guide: complete Remote Config section and
jorgeorpinel Oct 23, 2022
9b904f5
guide: complete list of supported storage types
jorgeorpinel Oct 24, 2022
3b5e520
guide: clarify `remote modify` phrase in
jorgeorpinel Oct 24, 2022
73e2f55
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 27, 2022
7fc7fa3
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 27, 2022
ff7e666
Update content/docs/user-guide/data-management/data-versioning.md
Oct 27, 2022
c0026fc
guide: update versioning config
jorgeorpinel Oct 27, 2022
71b599c
guide: don't call remote storage "additional" here
jorgeorpinel Oct 27, 2022
9774855
guide: pull -> download (DM/ RS intro)
jorgeorpinel Oct 27, 2022
e5c6f13
guide: remove "optional" from Remote Storage nav & title
jorgeorpinel Oct 27, 2022
ec1af6d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 28, 2022
2f31bb6
guide: splits and notes around Data Mgmt index page
jorgeorpinel Oct 28, 2022
a84c442
guide: Data Mgmt intro + note updates
jorgeorpinel Oct 29, 2022
ab55389
guide: draft of all contents +
jorgeorpinel Oct 29, 2022
31d5288
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 1, 2022
a13f989
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 2, 2022
601c99e
guide: small impros to Data Mgmt
jorgeorpinel Nov 2, 2022
a8bad84
guide: rewrite Data Mgmt index in before/after form
jorgeorpinel Nov 3, 2022
c8cc17b
guide: add draft figure for Data Mgmt
jorgeorpinel Nov 4, 2022
3cb84cb
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 8, 2022
a13cb0f
guide: simplify/refocus data mgmt index
jorgeorpinel Nov 8, 2022
e3ba70b
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 17, 2022
c29d9ec
work around commented header bug
jorgeorpinel Nov 17, 2022
875fba3
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 23, 2022
831ad1d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 25, 2022
8ddda9c
guide: drop DM/ DV page
jorgeorpinel Nov 25, 2022
28322e5
guide: rewrite DM intro and
jorgeorpinel Nov 25, 2022
179d172
guide: use DM table instead of figure for now
jorgeorpinel Nov 25, 2022
d979a5e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 30, 2022
74bc156
guide: rewrite Data Mgmt story
jorgeorpinel Nov 30, 2022
e138096
guide: add draft figures to Data Mgmt
jorgeorpinel Nov 30, 2022
f904038
guide: simplify Data Mgmt story and benefits
jorgeorpinel Dec 1, 2022
e1772ea
guide: remove unused images (DM)
jorgeorpinel Dec 1, 2022
cc0390e
guide: update Data Mgmt figures (v1)
jorgeorpinel Dec 2, 2022
4ee3223
guide: rewrite text of Data Mgmt index
jorgeorpinel Dec 8, 2022
149599b
Merge branch 'main' of github.com:iterative/dvc.org into guide/data-m…
rogermparent Dec 8, 2022
f2acb66
guide: update Data Mgmt figures
jorgeorpinel Dec 8, 2022
723eb50
guide: iterate on Data Mgmt again
jorgeorpinel Dec 14, 2022
4b67b64
guide: update Data Mgmt figs
jorgeorpinel Dec 14, 2022
9eb7143
guide: more supporting info about Data Mgmt
jorgeorpinel Dec 18, 2022
e598839
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 21, 2022
dd4466e
guide: update figures (much more concrete) and
jorgeorpinel Dec 21, 2022
d637179
guide: edits to How it works (Data Mgmt)
jorgeorpinel Dec 21, 2022
c007817
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
5a0fd57
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
3eb81ff
guide: update Data Mgmt figures
jorgeorpinel Dec 22, 2022
98e73ff
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 23, 2022
67b1717
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 27, 2022
f3af183
guide: emphaisze dataset versions in UG fig 1
jorgeorpinel Dec 27, 2022
206ce77
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 4, 2023
075aaf3
guide: update Data Mgmt figures (with notes),
jorgeorpinel Jan 5, 2023
7377500
guide: more updates to text and figure styles,
jorgeorpinel Jan 5, 2023
baf5b4c
guide: update figures and text (Data Mgmt) ...
jorgeorpinel Jan 9, 2023
fb35df5
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 11, 2023
4475f78
guide: Data Management text (section 1)
jorgeorpinel Jan 11, 2023
20fbaae
guide: Data Management (main text)
jorgeorpinel Jan 11, 2023
1da7b8a
guide: Data Management (secondary text)
jorgeorpinel Jan 12, 2023
61e2865
Merge branch 'guide/data-mgmt-flows' of github.com:iterative/dvc.org …
jorgeorpinel Jan 12, 2023
ed63127
guide: add DVC data mgmt technical diagram &
jorgeorpinel Jan 12, 2023
0109cf3
guide: update Data Mgmt text
jorgeorpinel Jan 18, 2023
77330cc
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 18, 2023
956b03d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 19, 2023
7152ad3
guide: udpate text and 2nd figure (Data Mgmt)
jorgeorpinel Jan 19, 2023
f29da1e
guide: draft 2nd and 3rd figures
jorgeorpinel Jan 19, 2023
8f49a72
guide: rewrite Data Mgmt/ How it works &
jorgeorpinel Jan 20, 2023
f876c17
guide: update drafts of Data Mgmt figures 2, 3
jorgeorpinel Jan 20, 2023
ee3f721
guide: Data Mgmt improvements and
jorgeorpinel Jan 24, 2023
061a918
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 24, 2023
ac50c94
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Jan 24, 2023
d781fdd
guide: separate from Data Mgmt work
jorgeorpinel Jan 24, 2023
a8acb25
guide: remove hidden Storage locations page for now
jorgeorpinel Jan 24, 2023
882170a
guide: small cleanup of Remote storage page
jorgeorpinel Jan 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 25 additions & 4 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ within:

### core

- `core.remote` - name of the remote storage to use by default.
- `core.remote` - name of the [remote storage](#remote) to use by default.

- `core.interactive` - whether to always ask for confirmation before reproducing
each [stage](/doc/command-reference/run) in `dvc repro`. (Normally, this
Expand Down Expand Up @@ -157,9 +157,30 @@ within:

### remote

All `remote` sections contain a `url` value and can also specify `user`, `port`,
`keyfile`, `timeout`, `ask_password`, and other cloud-specific key/value pairs.
See `dvc remote add` and `dvc remote modify` for more information.
Unlike most other sections, configuration files may have more than one
`'remote'`. All of them require a unique `"name"` and a `url` value. They can
also specify `jobs`, `verify`, and many platform-specific key/value pairs like
`port` and `password`.

<admon icon="book">

See [Remote Storage Configuration] for more details.

[remote storage configuration]:
/doc/user-guide/data-management/remote-storage#configuration

</admon>

For example, the following config file defines a `temp` remote in the local file
system (located in `/tmp/dvcstore`), and marked as default (via [`core`](#core)
section):

```ini
['remote "temp"']
url = /tmp/dvcstore
[core]
remote = temp
Comment on lines +163 to +185
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated core.remote description, linked to/from Remotes guide, and added a simple example.

```

### cache

Expand Down
36 changes: 20 additions & 16 deletions content/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# remote add

Add a new [data remote](/doc/command-reference/remote).
Register a new [DVC remote](/doc/user-guide/data-management/remote-storage).

> Depending on your storage type, you may also need `dvc remote modify` to
> provide credentials and/or configure other remote parameters.
<admon type="tip">
Comment on lines -3 to +5
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the remote cmd ref intros are just re-linked to the new doc.

And some block quotes changed to proper admonitions along the way...


Depending on your storage type, you may also need `dvc remote modify` to provide
credentials and/or configure other remote parameters.

</admon>

## Synopsis

Expand All @@ -26,9 +30,9 @@ for the first remote):

```ini
['remote "myremote"']
url = /tmp/dvcstore
url = /tmp/dvcstore
[core]
remote = myremote
remote = myremote
```

> 💡 Default remotes are expected by commands that accept a `-r`/`--remote`
Expand Down Expand Up @@ -379,10 +383,10 @@ Using an absolute path (recommended):
```cli
$ dvc remote add -d myremote /tmp/dvcstore
$ cat .dvc/config
...
['remote "myremote"']
url = /tmp/dvcstore
...
...
['remote "myremote"']
url = /tmp/dvcstore
...
```

> Note that the absolute path `/tmp/dvcstore` is saved as is.
Expand All @@ -393,10 +397,10 @@ directory, but saved **relative to the config file location**:
```cli
$ dvc remote add -d myremote ../dvcstore
$ cat .dvc/config
...
['remote "myremote"']
url = ../../dvcstore
...
...
['remote "myremote"']
url = ../../dvcstore
...
```

> Note that `../dvcstore` has been resolved relative to the `.dvc/` dir,
Expand All @@ -423,10 +427,10 @@ The <abbr>project</abbr>'s config file (`.dvc/config`) now looks like this:

```ini
['remote "myremote"']
url = s3://mybucket/path
region = us-east-2
url = s3://mybucket/path
region = us-east-2
[core]
remote = myremote
remote = myremote
```

The list of remotes should now be:
Expand Down
6 changes: 2 additions & 4 deletions content/docs/command-reference/remote/default.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# remote default

Set/unset the default [data remote](/doc/command-reference/remote).

> Depending on your remote storage type, you may also need `dvc remote modify`
> to provide credentials and/or configure other remote parameters.
Set/unset the default
[remote storage](/doc/user-guide/data-management/remote-storage).

## Synopsis

Expand Down
16 changes: 9 additions & 7 deletions content/docs/command-reference/remote/index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# remote

A set of commands to set up and manage data remotes:
A set of commands to set up and manage [remote storage]:
[add](/doc/command-reference/remote/add),
[default](/doc/command-reference/remote/default),
[list](/doc/command-reference/remote/list),
[modify](/doc/command-reference/remote/modify),
[remove](/doc/command-reference/remote/remove), and
[rename](/doc/command-reference/remote/rename).

[remote storage]: /doc/user-guide/data-management/remote-storage

## Synopsis

```usage
Expand Down Expand Up @@ -101,9 +103,9 @@ The <abbr>project</abbr>'s config file should now look like this:

```ini
['remote "myremote"']
url = /path/to/remote
url = /path/to/remote
[core]
remote = myremote
remote = myremote
```

## Example: List all remotes in the project
Expand All @@ -128,12 +130,12 @@ The project's config file should now look something like this:

```ini
['remote "myremote"']
url = /path/to/remote
url = /path/to/remote
[core]
remote = myremote
remote = myremote
['remote "newremote"']
url = s3://mybucket/path
endpointurl = https://object-storage.example.com
url = s3://mybucket/path
endpointurl = https://object-storage.example.com
```

## Example: Change the name of a remote
Expand Down
3 changes: 2 additions & 1 deletion content/docs/command-reference/remote/list.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# remote list

List all available [data remotes](/doc/command-reference/remote).
List all available
[DVC remotes](/doc/user-guide/data-management/remote-storage).

## Synopsis

Expand Down
18 changes: 11 additions & 7 deletions content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# remote modify

Modify the configuration of a [data remote](/doc/command-reference/remote).
Configure a [DVC remote](/doc/user-guide/data-management/remote-storage).

> This command is commonly needed after `dvc remote add` or
> [default](/doc/command-reference/remote/default) to set up credentials or
> other customizations to each remote storage type.
<admon type="tip">

This command is commonly needed after `dvc remote add` or `dvc remote default`
to set up credentials or for other customizations specific to the
[storage type](#available-parameters-per-storage-type).

</admon>

## Synopsis

Expand Down Expand Up @@ -1197,10 +1201,10 @@ Now the project config file should look like this:

```ini
['remote "myremote"']
url = s3://mybucket/path
profile = myuser
url = s3://mybucket/path
profile = myuser
[core]
remote = myremote
remote = myremote
```

## Example: Some Azure authentication methods
Expand Down
11 changes: 8 additions & 3 deletions content/docs/command-reference/remote/remove.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# remote remove

Remove a [data remote](/doc/command-reference/remote). This command affects DVC
configuration files only, it does not physically remove data files stored
remotely.
Remove a [DVC remote](/doc/user-guide/data-management/remote-storage).

<admon type="info">

This command affects DVC configuration files only. It does not physically remove
data files stored remotely. See `dvc gc --cloud` for that.

</admon>

## Synopsis

Expand Down
9 changes: 7 additions & 2 deletions content/docs/command-reference/remote/rename.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# remote rename

Rename a [data remote](/doc/command-reference/remote). The remote's URL is not
changed by this command.
Rename a [DVC remote](/doc/user-guide/data-management/remote-storage).

<admon type="info">

The remote storage URL is not changed by this command.

</admon>

## Synopsis

Expand Down
4 changes: 3 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -122,8 +122,10 @@
},
{
"slug": "data-management",
"source": false,
"source": "data-management/index.md",
"children": [
"data-versioning",
"remote-storage",
"large-dataset-optimization",
"importing-external-data",
"managing-external-data"
Expand Down
70 changes: 70 additions & 0 deletions content/docs/user-guide/data-management/data-versioning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Data Versioning

DVC enables [version control] for data science. But DVC does not actually
implement versioning directly! Instead, DVC focuses on [codifying your data]:
generating small [metafiles] that you can handle with standard [Git workflows]
(commits, branching, pull requests, etc.).

The resulting projects are neatly organized in the "space dimension", having
only the files and directories needed at the time and without complicated, ad
hoc file names like `2022-10-20_linear-model_v2-Carl`. Project versions live in
the "time dimension" ([Git history]).

![Versioned ML project](/img/versioned-project.png) _Navigate versions with Git
commits_

**Data version control** is the unifying trait across DVC features (data
management and beyond).

<admon icon="book">

Refer to [Versioning Data and Models] to learn more.

[versioning data and models]: /doc/use-cases/versioning-data-and-models

</admon>

[version control]:
https://git-scm.com/book/en/v2/Getting-Started-About-Version-Control
[codifying your data]: /doc/use-cases/versioning-data-and-models
[metafiles]: /doc/user-guide/project-structure
[git workflows]: https://www.atlassian.com/git/tutorials/comparing-workflows
[git history]:
https://git-scm.com/book/en/v2/Git-Basics-Viewing-the-Commit-History

<!--
## Cloud versioning

_New in DVC 2.30.0 (see `dvc version`)_

To simplify remote data operations, DVC now supports native versioning of files
and directories on several cloud providers. This means that you can browse your
files normally as you would see them in your local workspace.
-->

## Project configuration

Besides metafiles, <abbr>DVC projects</abbr> may contain a config file
(`.dvc/config`) that can also be treated as code when it comes to version
control.

<admon icon="book">

See `dvc config` for more information on DVC config.

</admon>

Some times it's important to version configuration changes along with
corresponding data updates. Most notably, if you [set up remote storage] and
`dvc push` data for others to `dvc pull` later, you should `git commit` both the
metafile(s) and `.dvc/config` to the repo.

Advanced situations where this may also be necessary:

- When migrating to a [shared cache]
- If you change a `dvc config parsing` option, which impact how `dvc.yaml` files
get parsed.

[set up remote storage]:
/doc/user-guide/data-management/remote-storage#configuration
[shared cache]: doc/user-guide/how-to/share-a-dvc-cache
39 changes: 39 additions & 0 deletions content/docs/user-guide/data-management/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# Data Management with DVC

DVC helps you manage and share arbitrarily large files, datasets, and ML models
anywhere: cloud storage, SSH servers, network resources (e.g. NAS), mounted
drives, local file systems, etc. You manipulate DVC project normally in your
local workspace; DVC tracks, restores, and synchronizes them across locations.

![Storage locations](/img/storage-locations.png) _Local, external, and remote
storage locations_

Every <abbr>DVC project</abbr> starts with 2 locations. The
<abbr>workspace</abbr> is the main project directory, containing your data,
models, source code, etc. DVC also creates a <abbr>data cache</abbr> (found
locally in `.dvc/cache` by default), which will be used as fast-access storage
for DVC operations.

<admon type="tip">

The cache can be moved to an external location in the file system or network,
for example to [share it] among several projects. It could even be set up in a
remote system (Internet access), but this is typically too slow for working with
data regularly.

</admon>

[share it]: /doc/user-guide/how-to/share-a-dvc-cache

Optionally, DVC supports additional storage locations such as cloud services
(Amazon S3, Google Drive, Azure Blob Storage, etc.), SSH servers,
network-attached storage, etc. These are called [DVC remotes], and help you to
share or back up copies of your data assets.

<admon type="info">

DVC remotes are similar to Git remotes, but for <abbr>cached</abbr> data.

</admon>

[dvc remotes]: /doc/user-guide/data-management/remote-storage
Loading