Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: remote storage #4058

Merged
merged 100 commits into from
Jan 24, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
100 commits
Select commit Hold shift + click to select a range
7350938
guide: draft structure of Data Mgmt and
jorgeorpinel Oct 13, 2022
203f6a6
guide: full text for draft intro to DM
jorgeorpinel Oct 14, 2022
90eaa5d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 17, 2022
eb246bb
guide: hide cloud versioning info
jorgeorpinel Oct 17, 2022
a3687ec
guide: clarify Data Mgmt parts and
jorgeorpinel Oct 18, 2022
fad0bad
guide: add figure drafts to Data Mgmt
jorgeorpinel Oct 19, 2022
4e3c3da
guide: SCM->VC (Data Mgmt)
jorgeorpinel Oct 19, 2022
7f02c15
guide: update 2 figs and add 1 more (Data Mgmt)
jorgeorpinel Oct 19, 2022
f41d16e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
3a9a045
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
df40521
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 20, 2022
adc13ee
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 21, 2022
c0b92f1
guide: roll back unrelated changes
jorgeorpinel Oct 21, 2022
636872a
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
c2303c0
guide: mention clouds first (DM) and
jorgeorpinel Oct 22, 2022
62997ab
guide: flatten DM index
jorgeorpinel Oct 22, 2022
fc74c53
guide: udpates to DM/ DV
jorgeorpinel Oct 22, 2022
8c40a03
guide: add DM/ Data Versioning page
jorgeorpinel Oct 22, 2022
1a8ca61
guide: update outdated link
jorgeorpinel Oct 22, 2022
27be87f
guide: revert more unrelatedly chaqnged files
jorgeorpinel Oct 22, 2022
aaee7af
guide: remove unused ref link
jorgeorpinel Oct 22, 2022
dd99f21
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
118e3eb
guide: DM/ Remote Storage (not just Setup) and
jorgeorpinel Oct 22, 2022
24c331a
guide: remove a comment
jorgeorpinel Oct 22, 2022
ff85dcc
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 22, 2022
266a8f7
guide: draft for DM/ Remote Storage content
jorgeorpinel Oct 22, 2022
b04f20a
ref: expand config.remote and link to/from Remotes guide
jorgeorpinel Oct 23, 2022
1c77de4
ref: fix remote config file examples
jorgeorpinel Oct 23, 2022
8e7c320
guide: complete Remote Config section and
jorgeorpinel Oct 23, 2022
9b904f5
guide: complete list of supported storage types
jorgeorpinel Oct 24, 2022
3b5e520
guide: clarify `remote modify` phrase in
jorgeorpinel Oct 24, 2022
73e2f55
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 27, 2022
7fc7fa3
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Oct 27, 2022
ff7e666
Update content/docs/user-guide/data-management/data-versioning.md
Oct 27, 2022
c0026fc
guide: update versioning config
jorgeorpinel Oct 27, 2022
71b599c
guide: don't call remote storage "additional" here
jorgeorpinel Oct 27, 2022
9774855
guide: pull -> download (DM/ RS intro)
jorgeorpinel Oct 27, 2022
e5c6f13
guide: remove "optional" from Remote Storage nav & title
jorgeorpinel Oct 27, 2022
ec1af6d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Oct 28, 2022
2f31bb6
guide: splits and notes around Data Mgmt index page
jorgeorpinel Oct 28, 2022
a84c442
guide: Data Mgmt intro + note updates
jorgeorpinel Oct 29, 2022
ab55389
guide: draft of all contents +
jorgeorpinel Oct 29, 2022
31d5288
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 1, 2022
a13f989
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 2, 2022
601c99e
guide: small impros to Data Mgmt
jorgeorpinel Nov 2, 2022
a8bad84
guide: rewrite Data Mgmt index in before/after form
jorgeorpinel Nov 3, 2022
c8cc17b
guide: add draft figure for Data Mgmt
jorgeorpinel Nov 4, 2022
3cb84cb
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 8, 2022
a13cb0f
guide: simplify/refocus data mgmt index
jorgeorpinel Nov 8, 2022
e3ba70b
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 17, 2022
c29d9ec
work around commented header bug
jorgeorpinel Nov 17, 2022
875fba3
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 23, 2022
831ad1d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 25, 2022
8ddda9c
guide: drop DM/ DV page
jorgeorpinel Nov 25, 2022
28322e5
guide: rewrite DM intro and
jorgeorpinel Nov 25, 2022
179d172
guide: use DM table instead of figure for now
jorgeorpinel Nov 25, 2022
d979a5e
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Nov 30, 2022
74bc156
guide: rewrite Data Mgmt story
jorgeorpinel Nov 30, 2022
e138096
guide: add draft figures to Data Mgmt
jorgeorpinel Nov 30, 2022
f904038
guide: simplify Data Mgmt story and benefits
jorgeorpinel Dec 1, 2022
e1772ea
guide: remove unused images (DM)
jorgeorpinel Dec 1, 2022
cc0390e
guide: update Data Mgmt figures (v1)
jorgeorpinel Dec 2, 2022
4ee3223
guide: rewrite text of Data Mgmt index
jorgeorpinel Dec 8, 2022
149599b
Merge branch 'main' of github.com:iterative/dvc.org into guide/data-m…
rogermparent Dec 8, 2022
f2acb66
guide: update Data Mgmt figures
jorgeorpinel Dec 8, 2022
723eb50
guide: iterate on Data Mgmt again
jorgeorpinel Dec 14, 2022
4b67b64
guide: update Data Mgmt figs
jorgeorpinel Dec 14, 2022
9eb7143
guide: more supporting info about Data Mgmt
jorgeorpinel Dec 18, 2022
e598839
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 21, 2022
dd4466e
guide: update figures (much more concrete) and
jorgeorpinel Dec 21, 2022
d637179
guide: edits to How it works (Data Mgmt)
jorgeorpinel Dec 21, 2022
c007817
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
5a0fd57
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 22, 2022
3eb81ff
guide: update Data Mgmt figures
jorgeorpinel Dec 22, 2022
98e73ff
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 23, 2022
67b1717
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Dec 27, 2022
f3af183
guide: emphaisze dataset versions in UG fig 1
jorgeorpinel Dec 27, 2022
206ce77
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 4, 2023
075aaf3
guide: update Data Mgmt figures (with notes),
jorgeorpinel Jan 5, 2023
7377500
guide: more updates to text and figure styles,
jorgeorpinel Jan 5, 2023
baf5b4c
guide: update figures and text (Data Mgmt) ...
jorgeorpinel Jan 9, 2023
fb35df5
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 11, 2023
4475f78
guide: Data Management text (section 1)
jorgeorpinel Jan 11, 2023
20fbaae
guide: Data Management (main text)
jorgeorpinel Jan 11, 2023
1da7b8a
guide: Data Management (secondary text)
jorgeorpinel Jan 12, 2023
61e2865
Merge branch 'guide/data-mgmt-flows' of github.com:iterative/dvc.org …
jorgeorpinel Jan 12, 2023
ed63127
guide: add DVC data mgmt technical diagram &
jorgeorpinel Jan 12, 2023
0109cf3
guide: update Data Mgmt text
jorgeorpinel Jan 18, 2023
77330cc
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 18, 2023
956b03d
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 19, 2023
7152ad3
guide: udpate text and 2nd figure (Data Mgmt)
jorgeorpinel Jan 19, 2023
f29da1e
guide: draft 2nd and 3rd figures
jorgeorpinel Jan 19, 2023
8f49a72
guide: rewrite Data Mgmt/ How it works &
jorgeorpinel Jan 20, 2023
f876c17
guide: update drafts of Data Mgmt figures 2, 3
jorgeorpinel Jan 20, 2023
ee3f721
guide: Data Mgmt improvements and
jorgeorpinel Jan 24, 2023
061a918
Merge branch 'main' into guide/data-mgmt-flows
jorgeorpinel Jan 24, 2023
ac50c94
Merge branch 'guide/data-mgmt-flows' into guide/data-mgmt/remote-config
jorgeorpinel Jan 24, 2023
d781fdd
guide: separate from Data Mgmt work
jorgeorpinel Jan 24, 2023
a8acb25
guide: remove hidden Storage locations page for now
jorgeorpinel Jan 24, 2023
882170a
guide: small cleanup of Remote storage page
jorgeorpinel Jan 24, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 25 additions & 4 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ within:

### core

- `core.remote` - name of the remote storage to use by default.
- `core.remote` - name of the [remote storage](#remote) to use by default.

- `core.interactive` - whether to always ask for confirmation before reproducing
each [stage](/doc/command-reference/run) in `dvc repro`. (Normally, this
Expand Down Expand Up @@ -160,9 +160,30 @@ within:

### remote

All `remote` sections contain a `url` value and can also specify `user`, `port`,
`keyfile`, `timeout`, `ask_password`, and other cloud-specific key/value pairs.
See `dvc remote add` and `dvc remote modify` for more information.
Unlike most other sections, configuration files may have more than one
`'remote'`. All of them require a unique `"name"` and a `url` value. They can
also specify `jobs`, `verify`, and many platform-specific key/value pairs like
`port` and `password`.

<admon icon="book">

See [Remote Storage Configuration] for more details.

[remote storage configuration]:
/doc/user-guide/data-management/remote-storage#configuration

</admon>

For example, the following config file defines a `temp` remote in the local file
system (located in `/tmp/dvcstore`), and marked as default (via [`core`](#core)
section):

```ini
['remote "temp"']
url = /tmp/dvcstore
[core]
remote = temp
Comment on lines +163 to +185
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated core.remote description, linked to/from Remotes guide, and added a simple example.

```

### cache

Expand Down
36 changes: 20 additions & 16 deletions content/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# remote add

Add a new [data remote](/doc/command-reference/remote).
Register a new [DVC remote](/doc/user-guide/data-management/remote-storage).

> Depending on your storage type, you may also need `dvc remote modify` to
> provide credentials and/or configure other remote parameters.
<admon type="tip">
Comment on lines -3 to +5
Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Oct 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the remote cmd ref intros are just re-linked to the new doc.

And some block quotes changed to proper admonitions along the way...


Depending on your storage type, you may also need `dvc remote modify` to provide
credentials and/or configure other remote parameters.

</admon>

## Synopsis

Expand All @@ -26,9 +30,9 @@ for the first remote):

```ini
['remote "myremote"']
url = /tmp/dvcstore
url = /tmp/dvcstore
[core]
remote = myremote
remote = myremote
```

> 💡 Default remotes are expected by commands that accept a `-r`/`--remote`
Expand Down Expand Up @@ -379,10 +383,10 @@ Using an absolute path (recommended):
```cli
$ dvc remote add -d myremote /tmp/dvcstore
$ cat .dvc/config
...
['remote "myremote"']
url = /tmp/dvcstore
...
...
['remote "myremote"']
url = /tmp/dvcstore
...
```

> Note that the absolute path `/tmp/dvcstore` is saved as is.
Expand All @@ -393,10 +397,10 @@ directory, but saved **relative to the config file location**:
```cli
$ dvc remote add -d myremote ../dvcstore
$ cat .dvc/config
...
['remote "myremote"']
url = ../../dvcstore
...
...
['remote "myremote"']
url = ../../dvcstore
...
```

> Note that `../dvcstore` has been resolved relative to the `.dvc/` dir,
Expand All @@ -423,10 +427,10 @@ The <abbr>project</abbr>'s config file (`.dvc/config`) now looks like this:

```ini
['remote "myremote"']
url = s3://mybucket/path
region = us-east-2
url = s3://mybucket/path
region = us-east-2
[core]
remote = myremote
remote = myremote
```

The list of remotes should now be:
Expand Down
6 changes: 2 additions & 4 deletions content/docs/command-reference/remote/default.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# remote default

Set/unset the default [data remote](/doc/command-reference/remote).

> Depending on your remote storage type, you may also need `dvc remote modify`
> to provide credentials and/or configure other remote parameters.
Set/unset the default
[remote storage](/doc/user-guide/data-management/remote-storage).

## Synopsis

Expand Down
16 changes: 9 additions & 7 deletions content/docs/command-reference/remote/index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# remote

A set of commands to set up and manage data remotes:
A set of commands to set up and manage [remote storage]:
[add](/doc/command-reference/remote/add),
[default](/doc/command-reference/remote/default),
[list](/doc/command-reference/remote/list),
[modify](/doc/command-reference/remote/modify),
[remove](/doc/command-reference/remote/remove), and
[rename](/doc/command-reference/remote/rename).

[remote storage]: /doc/user-guide/data-management/remote-storage

## Synopsis

```usage
Expand Down Expand Up @@ -101,9 +103,9 @@ The <abbr>project</abbr>'s config file should now look like this:

```ini
['remote "myremote"']
url = /path/to/remote
url = /path/to/remote
[core]
remote = myremote
remote = myremote
```

## Example: List all remotes in the project
Expand All @@ -128,12 +130,12 @@ The project's config file should now look something like this:

```ini
['remote "myremote"']
url = /path/to/remote
url = /path/to/remote
[core]
remote = myremote
remote = myremote
['remote "newremote"']
url = s3://mybucket/path
endpointurl = https://object-storage.example.com
url = s3://mybucket/path
endpointurl = https://object-storage.example.com
```

## Example: Change the name of a remote
Expand Down
3 changes: 2 additions & 1 deletion content/docs/command-reference/remote/list.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# remote list

List all available [data remotes](/doc/command-reference/remote).
List all available
[DVC remotes](/doc/user-guide/data-management/remote-storage).

## Synopsis

Expand Down
18 changes: 11 additions & 7 deletions content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# remote modify

Modify the configuration of a [data remote](/doc/command-reference/remote).
Configure a [DVC remote](/doc/user-guide/data-management/remote-storage).

> This command is commonly needed after `dvc remote add` or
> [default](/doc/command-reference/remote/default) to set up credentials or
> other customizations to each remote storage type.
<admon type="tip">

This command is commonly needed after `dvc remote add` or `dvc remote default`
to set up credentials or for other customizations specific to the
[storage type](#available-parameters-per-storage-type).

</admon>

## Synopsis

Expand Down Expand Up @@ -1272,10 +1276,10 @@ Now the project config file should look like this:

```ini
['remote "myremote"']
url = s3://mybucket/path
profile = myuser
url = s3://mybucket/path
profile = myuser
[core]
remote = myremote
remote = myremote
```

## Example: Some Azure authentication methods
Expand Down
11 changes: 8 additions & 3 deletions content/docs/command-reference/remote/remove.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# remote remove

Remove a [data remote](/doc/command-reference/remote). This command affects DVC
configuration files only, it does not physically remove data files stored
remotely.
Remove a [DVC remote](/doc/user-guide/data-management/remote-storage).

<admon type="info">

This command affects DVC configuration files only. It does not physically remove
data files stored remotely. See `dvc gc --cloud` for that.

</admon>

## Synopsis

Expand Down
9 changes: 7 additions & 2 deletions content/docs/command-reference/remote/rename.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# remote rename

Rename a [data remote](/doc/command-reference/remote). The remote's URL is not
changed by this command.
Rename a [DVC remote](/doc/user-guide/data-management/remote-storage).

<admon type="info">

The remote storage URL is not changed by this command.

</admon>

## Synopsis

Expand Down
5 changes: 3 additions & 2 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,10 @@
"source": false,
"children": [
"large-dataset-optimization",
"remote-storage",
"cloud-versioning",
"importing-external-data",
"managing-external-data",
"cloud-versioning"
"managing-external-data"
Comment on lines 126 to +131
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reordered this a little bit.

]
},
{
Expand Down
113 changes: 113 additions & 0 deletions content/docs/user-guide/data-management/remote-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Remote Storage

_DVC remotes_ provide optional/additional storage to backup and share your data
and ML model. For example, you can download data artifacts created by colleagues
without spending time and resources to regenerate them locally. See `dvc push`
and `dvc pull`.

<admon type="info">

DVC remotes are similar to [Git remotes], but for <abbr>cached</abbr> data.

[git remotes]: https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes

</admon>

This is somehow like GitHub or GitLab providing hosting for source code
repositories. However, DVC does not provide or recommend a specific storage
service. Instead, it adopts a bring-your-own-platform approach, supporting a
wide variety of [storage types](#supported-storage-types).

The main uses of remote storage are:

- Synchronize DVC-tracked data (previously <abbr>cached</abbr>).
- Centralize or distribute large file storage for sharing and collaboration.
- Back up different versions of your data and models.
- Save space in your working environment (by deleting pushed files/directories).

## Configuration

You can set up one or more remote storage locations, mainly with the
`dvc remote add` and `dvc remote modify` commands. These read and write to the
[`remote`] section of the project's configuration file (`.dvc/config`), which
you could edit manually as well.

Typically, you'll first register a DVC remote by adding its name and URL (or
file path), e.g.:

```cli
$ dvc remote add mybucket s3://my-bucket
```

Then, you'll usually need or want to configure the remote's authentication
credentials or other properties, etc. For example:

```cli
$ dvc remote modify --local \
mybucket credentialpath ~/.aws/alt

$ dvc remote modify mybucket connect_timeout 300
```

<admon type="warn">

Make sure to use the `--local` flag when writing secrets to configuration. This
creates a second config file in `.dvc/config.local` that is ignored by Git. This
way your secrets do not get to the repository. See `dvc config` for more info.

This also means each copy of the <abbr>DVC repository</abbr> may have to
re-configure remote storage authentication.

</admon>

<details>

### Click to see the resulting config files.

```ini
# .dvc/config
['remote "mybucket"']
url = s3://my-bucket
connect_timeout = 300
```

```ini
# .dvc/config.local
['remote "mybucket"']
credentialpath = ~/.aws/alt
```

```ini
# .gitignore
.dvc/config.local
```

</details>

Finally, you can `git commit` the changes to share the general configuration of
your remote (`.dvc/config`) via the Git repo.

[`remote`]: /doc/command-reference/config#remote

## Supported storage types

> See more [details](/doc/command-reference/remote/add#supported-storage-types).

### Cloud providers

- Amazon S3 (AWS)
- S3-compatible e.g. MinIO
- Microsoft Azure Blob Storage
- Google Drive
- Google Cloud Storage (GCP)
- Aliyun OSS

### Self-hosted / On-premises

- SSH servers; Like `scp`
- HDFS & WebHDFS
- HTTP
- WebDAV
- Local directories, mounted drives; Like `rsync`
> Includes network resources e.g. network-attached storage (NAS) or other
> external devices
9 changes: 5 additions & 4 deletions content/docs/user-guide/project-structure/internal-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,8 @@ operation.

## Structure of the cache directory

The DVC cache is a
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
(by default in `.dvc/cache`), which adds a layer of indirection between code and
data.
The DVC cache is a [content-addressable storage] (by default in `.dvc/cache`),
which adds a layer of indirection between code and data.

There are two ways in which the data is <abbr>cached</abbr>, depending on
whether it's a single file, or a directory (which may contain multiple files).
Expand All @@ -86,6 +84,9 @@ Note files are renamed, reorganized, and directory trees are flattened in the
cache, which always has exactly one depth level with 2-character directories
(based on hashes of the data contents, as explained next).

[content-addressable storage]:
https://en.wikipedia.org/wiki/Content-addressable_storage

### Files

DVC calculates the file hash, a 32 characters long string (usually MD5). The
Expand Down