Skip to content

Commit

Permalink
guide: remote storage (#4058)
Browse files Browse the repository at this point in the history
* guide: draft structure of Data Mgmt and
some updates around the topic in existing docs

* guide: full text for draft intro to DM

* guide: hide cloud versioning info
per #4042 (review)

* guide: clarify Data Mgmt parts and
add prospective figure titles

* guide: add figure drafts to Data Mgmt

* guide: SCM->VC (Data Mgmt)

* guide: update 2 figs and add 1 more (Data Mgmt)

* guide: roll back unrelated changes
per #4042 (review)

* guide: mention clouds first (DM) and

and update fig. 1
per #4042 (review)

* guide: flatten DM index
per #4042 (review)

* guide: udpates to DM/ DV
moved from #4053 (review)

* guide: add DM/ Data Versioning page

per #4042 (comment)

* guide: update outdated link

* guide: revert more unrelatedly chaqnged files

per #4042 (review)

* guide: remove unused ref link

* guide: DM/ Remote Storage (not just Setup) and

and some links from cmd refs
and avoid term "data remote"
and some admons nearby...

* guide: remove a comment

* guide: draft for DM/ Remote Storage content

* ref: expand config.remote and link to/from Remotes guide

* ref: fix remote config file examples

* guide: complete Remote Config section and

and add Project config section to DM/ DV guide

* guide: complete list of supported storage types

* guide: clarify `remote modify` phrase in

in the Remote config section of DM/ Remote Storage

* Update content/docs/user-guide/data-management/data-versioning.md

* guide: update versioning config

per #4058 (review)

* guide: don't call remote storage "additional" here

(in the DM/ Remote Storage guide)
per #4058 (review)

Co-authored-by: Dave Berenbaum <[email protected]>

* guide: pull -> download (DM/ RS intro)

* guide: remove "optional" from Remote Storage nav & title

per #4058 (review)

* guide: splits and notes around Data Mgmt index page

rel. #4042 (comment)

* guide: Data Mgmt intro + note updates

* guide: draft of all contents +

+ remove comments

* guide: small impros to Data Mgmt

in prep for #4042 (review)

* guide: rewrite Data Mgmt index in before/after form

per #4042 (review)

* guide: add draft figure for Data Mgmt

* guide: simplify/refocus data mgmt index

per #4042 (review)

* work around commented header bug

* guide: drop DM/ DV page

* guide: rewrite DM intro and

- hide benefits (for now)
- remove codification comment block

* guide: use DM table instead of figure for now

* guide: rewrite Data Mgmt story

* guide: add draft figures to Data Mgmt

* guide: simplify Data Mgmt story and benefits

* guide: remove unused images (DM)

* guide: update Data Mgmt figures (v1)

* guide: rewrite text of Data Mgmt index

* guide: update Data Mgmt figures

* guide: iterate on Data Mgmt again

* guide: update Data Mgmt figs

* guide: more supporting info about Data Mgmt

* guide: update figures (much more concrete) and

and matching text updates

* guide: edits to How it works (Data Mgmt)

* guide: update Data Mgmt figures

Rel. #4042 (comment)

* guide: emphaisze dataset versions in UG fig 1

Rel. #4042 (comment)

* guide: update Data Mgmt figures (with notes),

expand img captions,
and update text accordingly.

* guide: more updates to text and figure styles,

esp. to the first half
and comment some stuff out (temporary)

* guide: update figures and text (Data Mgmt) ...

Using a tabs toggle for the 2nd fig.

* guide: Data Management text (section 1)

finalized for this version of figures

* guide: Data Management (main text)

finalized for this version of figures

* guide: Data Management (secondary text)

pending diagram and code sample(s)

* guide: add DVC data mgmt technical diagram &

dummy sample CLI blocks

* guide: update Data Mgmt text

* guide: udpate text and 2nd figure (Data Mgmt)

* guide: draft 2nd and 3rd figures

* guide: rewrite Data Mgmt/ How it works &

and Benefits/ Tradeoffs

Probably still unfinished... Missing more data versioning info? See HTML comments.

* guide: update drafts of Data Mgmt figures 2, 3

* guide: Data Mgmt improvements and

hide the benefits list for now

* guide: separate from Data Mgmt work

Rel. #4042

* guide: remove hidden Storage locations page for now

* guide: small cleanup of Remote storage page

Co-authored-by: Dave Berenbaum <[email protected]>
Co-authored-by: rogermparent <[email protected]>
  • Loading branch information
3 people authored Jan 24, 2023
1 parent 3b4dda5 commit 4d4cbd4
Show file tree
Hide file tree
Showing 11 changed files with 205 additions and 50 deletions.
29 changes: 25 additions & 4 deletions content/docs/command-reference/config.md
Original file line number Diff line number Diff line change
Expand Up @@ -125,7 +125,7 @@ within:

### core

- `core.remote` - name of the remote storage to use by default.
- `core.remote` - name of the [remote storage](#remote) to use by default.

- `core.interactive` - whether to always ask for confirmation before reproducing
each [stage](/doc/command-reference/run) in `dvc repro`. (Normally, this
Expand Down Expand Up @@ -160,9 +160,30 @@ within:

### remote

All `remote` sections contain a `url` value and can also specify `user`, `port`,
`keyfile`, `timeout`, `ask_password`, and other cloud-specific key/value pairs.
See `dvc remote add` and `dvc remote modify` for more information.
Unlike most other sections, configuration files may have more than one
`'remote'`. All of them require a unique `"name"` and a `url` value. They can
also specify `jobs`, `verify`, and many platform-specific key/value pairs like
`port` and `password`.

<admon icon="book">

See [Remote Storage Configuration] for more details.

[remote storage configuration]:
/doc/user-guide/data-management/remote-storage#configuration

</admon>

For example, the following config file defines a `temp` remote in the local file
system (located in `/tmp/dvcstore`), and marked as default (via [`core`](#core)
section):

```ini
['remote "temp"']
url = /tmp/dvcstore
[core]
remote = temp
```

### cache

Expand Down
36 changes: 20 additions & 16 deletions content/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
# remote add

Add a new [data remote](/doc/command-reference/remote).
Register a new [DVC remote](/doc/user-guide/data-management/remote-storage).

> Depending on your storage type, you may also need `dvc remote modify` to
> provide credentials and/or configure other remote parameters.
<admon type="tip">

Depending on your storage type, you may also need `dvc remote modify` to provide
credentials and/or configure other remote parameters.

</admon>

## Synopsis

Expand All @@ -26,9 +30,9 @@ for the first remote):

```ini
['remote "myremote"']
url = /tmp/dvcstore
url = /tmp/dvcstore
[core]
remote = myremote
remote = myremote
```

> 💡 Default remotes are expected by commands that accept a `-r`/`--remote`
Expand Down Expand Up @@ -379,10 +383,10 @@ Using an absolute path (recommended):
```cli
$ dvc remote add -d myremote /tmp/dvcstore
$ cat .dvc/config
...
['remote "myremote"']
url = /tmp/dvcstore
...
...
['remote "myremote"']
url = /tmp/dvcstore
...
```

> Note that the absolute path `/tmp/dvcstore` is saved as is.
Expand All @@ -393,10 +397,10 @@ directory, but saved **relative to the config file location**:
```cli
$ dvc remote add -d myremote ../dvcstore
$ cat .dvc/config
...
['remote "myremote"']
url = ../../dvcstore
...
...
['remote "myremote"']
url = ../../dvcstore
...
```

> Note that `../dvcstore` has been resolved relative to the `.dvc/` dir,
Expand All @@ -423,10 +427,10 @@ The <abbr>project</abbr>'s config file (`.dvc/config`) now looks like this:

```ini
['remote "myremote"']
url = s3://mybucket/path
region = us-east-2
url = s3://mybucket/path
region = us-east-2
[core]
remote = myremote
remote = myremote
```

The list of remotes should now be:
Expand Down
6 changes: 2 additions & 4 deletions content/docs/command-reference/remote/default.md
Original file line number Diff line number Diff line change
@@ -1,9 +1,7 @@
# remote default

Set/unset the default [data remote](/doc/command-reference/remote).

> Depending on your remote storage type, you may also need `dvc remote modify`
> to provide credentials and/or configure other remote parameters.
Set/unset the default
[remote storage](/doc/user-guide/data-management/remote-storage).

## Synopsis

Expand Down
16 changes: 9 additions & 7 deletions content/docs/command-reference/remote/index.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,15 @@
# remote

A set of commands to set up and manage data remotes:
A set of commands to set up and manage [remote storage]:
[add](/doc/command-reference/remote/add),
[default](/doc/command-reference/remote/default),
[list](/doc/command-reference/remote/list),
[modify](/doc/command-reference/remote/modify),
[remove](/doc/command-reference/remote/remove), and
[rename](/doc/command-reference/remote/rename).

[remote storage]: /doc/user-guide/data-management/remote-storage

## Synopsis

```usage
Expand Down Expand Up @@ -101,9 +103,9 @@ The <abbr>project</abbr>'s config file should now look like this:

```ini
['remote "myremote"']
url = /path/to/remote
url = /path/to/remote
[core]
remote = myremote
remote = myremote
```

## Example: List all remotes in the project
Expand All @@ -128,12 +130,12 @@ The project's config file should now look something like this:

```ini
['remote "myremote"']
url = /path/to/remote
url = /path/to/remote
[core]
remote = myremote
remote = myremote
['remote "newremote"']
url = s3://mybucket/path
endpointurl = https://object-storage.example.com
url = s3://mybucket/path
endpointurl = https://object-storage.example.com
```

## Example: Change the name of a remote
Expand Down
3 changes: 2 additions & 1 deletion content/docs/command-reference/remote/list.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
# remote list

List all available [data remotes](/doc/command-reference/remote).
List all available
[DVC remotes](/doc/user-guide/data-management/remote-storage).

## Synopsis

Expand Down
18 changes: 11 additions & 7 deletions content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
@@ -1,10 +1,14 @@
# remote modify

Modify the configuration of a [data remote](/doc/command-reference/remote).
Configure a [DVC remote](/doc/user-guide/data-management/remote-storage).

> This command is commonly needed after `dvc remote add` or
> [default](/doc/command-reference/remote/default) to set up credentials or
> other customizations to each remote storage type.
<admon type="tip">

This command is commonly needed after `dvc remote add` or `dvc remote default`
to set up credentials or for other customizations specific to the
[storage type](#available-parameters-per-storage-type).

</admon>

## Synopsis

Expand Down Expand Up @@ -1272,10 +1276,10 @@ Now the project config file should look like this:

```ini
['remote "myremote"']
url = s3://mybucket/path
profile = myuser
url = s3://mybucket/path
profile = myuser
[core]
remote = myremote
remote = myremote
```

## Example: Some Azure authentication methods
Expand Down
11 changes: 8 additions & 3 deletions content/docs/command-reference/remote/remove.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,13 @@
# remote remove

Remove a [data remote](/doc/command-reference/remote). This command affects DVC
configuration files only, it does not physically remove data files stored
remotely.
Remove a [DVC remote](/doc/user-guide/data-management/remote-storage).

<admon type="info">

This command affects DVC configuration files only. It does not physically remove
data files stored remotely. See `dvc gc --cloud` for that.

</admon>

## Synopsis

Expand Down
9 changes: 7 additions & 2 deletions content/docs/command-reference/remote/rename.md
Original file line number Diff line number Diff line change
@@ -1,7 +1,12 @@
# remote rename

Rename a [data remote](/doc/command-reference/remote). The remote's URL is not
changed by this command.
Rename a [DVC remote](/doc/user-guide/data-management/remote-storage).

<admon type="info">

The remote storage URL is not changed by this command.

</admon>

## Synopsis

Expand Down
5 changes: 3 additions & 2 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -125,9 +125,10 @@
"source": false,
"children": [
"large-dataset-optimization",
"remote-storage",
"cloud-versioning",
"importing-external-data",
"managing-external-data",
"cloud-versioning"
"managing-external-data"
]
},
{
Expand Down
113 changes: 113 additions & 0 deletions content/docs/user-guide/data-management/remote-storage.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Remote Storage

_DVC remotes_ provide optional/additional storage to backup and share your data
and ML model. For example, you can download data artifacts created by colleagues
without spending time and resources to regenerate them locally. See `dvc push`
and `dvc pull`.

<admon type="info">

DVC remotes are similar to [Git remotes], but for <abbr>cached</abbr> data.

[git remotes]: https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes

</admon>

This is somehow like GitHub or GitLab providing hosting for source code
repositories. However, DVC does not provide or recommend a specific storage
service. Instead, it adopts a bring-your-own-platform approach, supporting a
wide variety of [storage types](#supported-storage-types).

The main uses of remote storage are:

- Synchronize DVC-tracked data (previously <abbr>cached</abbr>).
- Centralize or distribute large file storage for sharing and collaboration.
- Back up different versions of your data and models.
- Save space in your working environment (by deleting pushed files/directories).

## Configuration

You can set up one or more remote storage locations, mainly with the
`dvc remote add` and `dvc remote modify` commands. These read and write to the
[`remote`] section of the project's configuration file (`.dvc/config`), which
you could edit manually as well.

Typically, you'll first register a DVC remote by adding its name and URL (or
file path), e.g.:

```cli
$ dvc remote add mybucket s3://my-bucket
```

Then, you'll usually need or want to configure the remote's authentication
credentials or other properties, etc. For example:

```cli
$ dvc remote modify --local \
mybucket credentialpath ~/.aws/alt
$ dvc remote modify mybucket connect_timeout 300
```

<admon type="warn">

Make sure to use the `--local` flag when writing secrets to configuration. This
creates a second config file in `.dvc/config.local` that is ignored by Git. This
way your secrets do not get to the repository. See `dvc config` for more info.

This also means each copy of the <abbr>DVC repository</abbr> may have to
re-configure remote storage authentication.

</admon>

<details>

### Click to see the resulting config files.

```ini
# .dvc/config
['remote "mybucket"']
url = s3://my-bucket
connect_timeout = 300
```

```ini
# .dvc/config.local
['remote "mybucket"']
credentialpath = ~/.aws/alt
```

```ini
# .gitignore
.dvc/config.local
```

</details>

Finally, you can `git commit` the changes to share the general configuration of
your remote (`.dvc/config`) via the Git repo.

[`remote`]: /doc/command-reference/config#remote

## Supported storage types

> See more [details](/doc/command-reference/remote/add#supported-storage-types).
### Cloud providers

- Amazon S3 (AWS)
- S3-compatible e.g. MinIO
- Microsoft Azure Blob Storage
- Google Drive
- Google Cloud Storage (GCP)
- Aliyun OSS

### Self-hosted / On-premises

- SSH servers; Like `scp`
- HDFS & WebHDFS
- HTTP
- WebDAV
- Local directories, mounted drives; Like `rsync`
> Includes network resources e.g. network-attached storage (NAS) or other
> external devices
9 changes: 5 additions & 4 deletions content/docs/user-guide/project-structure/internal-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,10 +74,8 @@ operation.

## Structure of the cache directory

The DVC cache is a
[content-addressable storage](https://en.wikipedia.org/wiki/Content-addressable_storage)
(by default in `.dvc/cache`), which adds a layer of indirection between code and
data.
The DVC cache is a [content-addressable storage] (by default in `.dvc/cache`),
which adds a layer of indirection between code and data.

There are two ways in which the data is <abbr>cached</abbr>, depending on
whether it's a single file, or a directory (which may contain multiple files).
Expand All @@ -86,6 +84,9 @@ Note files are renamed, reorganized, and directory trees are flattened in the
cache, which always has exactly one depth level with 2-character directories
(based on hashes of the data contents, as explained next).

[content-addressable storage]:
https://en.wikipedia.org/wiki/Content-addressable_storage

### Files

DVC calculates the file hash, a 32 characters long string (usually MD5). The
Expand Down

0 comments on commit 4d4cbd4

Please sign in to comment.