diff --git a/content/docs/command-reference/config.md b/content/docs/command-reference/config.md index 9658d2df6e..96124f4968 100644 --- a/content/docs/command-reference/config.md +++ b/content/docs/command-reference/config.md @@ -250,9 +250,8 @@ location. A [DVC remote](/doc/command-reference/remote) name is used (instead of the URL) because often it's necessary to configure authentication or other connection settings, and configuring a remote is the way that can be done. -- `cache.local` - name of a _local remote_ to use as external cache (refer to - `dvc remote` for more info. on "local remotes".) This will overwrite the value - in `cache.dir` (see `dvc cache dir`). +- `cache.local` - name of a [local remote] to use as external cache. This will + overwrite the value in `cache.dir` (see `dvc cache dir`). - `cache.s3` - name of an Amazon S3 remote to use as external cache. @@ -265,10 +264,17 @@ connection settings, and configuring a remote is the way that can be done. - `cache.webhdfs` - name of an HDFS remote with WebHDFS enabled to use as external cache. -> ⚠️ Avoid using the same [remote storage](/doc/command-reference/remote) used -> for `dvc push` and `dvc pull` as external cache, because it may cause file -> hash overlaps: the hash of an external output could collide with -> that of a local file with different content. + + + Avoid using the same [remote storage](/doc/command-reference/remote) used for + `dvc push` and `dvc pull` as external cache, because it may cause file hash + overlaps: the hash of an external output could collide with that + of a local file with different content. + + + +[local remote]: + /doc/user-guide/data-management/remote-storage#file-systems-local-remotes ### exp diff --git a/content/docs/command-reference/remote/add.md b/content/docs/command-reference/remote/add.md index 1c9e1a9ebf..c3d5313d3f 100644 --- a/content/docs/command-reference/remote/add.md +++ b/content/docs/command-reference/remote/add.md @@ -44,7 +44,7 @@ A [default remote] is expected by `dvc push`, `dvc pull`, `dvc status`, The remote `name` (required) is used to identify the remote and must be unique. -DVC will determine the [type of remote](#supported-storage-types) based on the +DVC will determine the [storage type](#supported-storage-types) based on the provided `url` (also required), a URL or path for the location. @@ -121,60 +121,15 @@ $ pip install "dvc[s3]" ## Supported storage types -The following are the types of remote storage (protocols) supported: +The following are the supported types of storage protocols and platforms. -
- -### Amazon S3 - -> 💡 Before adding an S3 remote, be sure to -> [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). - -```cli -$ dvc remote add -d myremote s3://mybucket/path -``` - -By default, DVC authenticates using your AWS CLI -[configuration](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) -(if set). This uses the default AWS credentials file. To use a custom -authentication method, use the parameters described in `dvc remote modify`. - -Make sure you have the following permissions enabled: `s3:ListBucket`, -`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`. This enables the S3 API -methods that are performed by DVC (`list_objects_v2` or `list_objects`, -`head_object`, `upload_file`, `download_file`, `delete_object`, `copy`). - -> See `dvc remote modify` for a full list of S3 parameters. - -
- -
- -### S3-compatible storage - -For object storage that supports an S3-compatible API (e.g. -[Minio](https://min.io/), -[DigitalOcean Spaces](https://www.digitalocean.com/products/spaces/), -[IBM Cloud Object Storage](https://www.ibm.com/cloud/object-storage) etc.), -configure the `endpointurl` parameter. For example, let's set up a DigitalOcean -"space" (equivalent to a bucket in S3) called `mystore` that uses the `nyc3` -region: - -```cli -$ dvc remote add -d myremote s3://mystore/path -$ dvc remote modify myremote endpointurl \ - https://nyc3.digitaloceanspaces.com -``` +### Cloud providers -By default, DVC authenticates using your AWS CLI -[configuration](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) -(if set). This uses the default AWS credentials file. To use a custom -authentication method, use the parameters described in `dvc remote modify`. +- [Amazon S3] (AWS) and [S3-compatible] e.g. MinIO -Any other S3 parameter can also be set for S3-compatible storage. Whether -they're effective depends on each storage platform. - -
+[amazon s3]: /doc/user-guide/data-management/remote-storage/amazon-s3 +[s3-compatible]: + /doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon
@@ -396,90 +351,3 @@ $ dvc remote add -d myremote \ > See `dvc remote modify` for a full list of WebDAV parameters.
- -
- -### local remote - -A "local remote" is a directory in the machine's file system. Not to be confused -with the `--local` option of `dvc remote` (and other config) commands! - -> While the term may seem contradictory, it doesn't have to be. The "local" part -> refers to the type of location where the storage is: another directory in the -> same file system. "Remote" is how we call storage for DVC -> projects. It's essentially a local backup for data tracked by DVC. - -Using an absolute path (recommended): - -```cli -$ dvc remote add -d myremote /tmp/dvcstore -$ cat .dvc/config -... -['remote "myremote"'] - url = /tmp/dvcstore -... -``` - -> Note that the absolute path `/tmp/dvcstore` is saved as is. - -Using a relative path. It will be resolved against the current working -directory, but saved **relative to the config file location**: - -```cli -$ dvc remote add -d myremote ../dvcstore -$ cat .dvc/config -... -['remote "myremote"'] - url = ../../dvcstore -... -``` - -> Note that `../dvcstore` has been resolved relative to the `.dvc/` dir, -> resulting in `../../dvcstore`. - -
- -## Example: Customize an S3 remote - -Add an Amazon S3 remote as the _default_ (via the `-d` option), and modify its -region. - -> 💡 Before adding an S3 remote, be sure to -> [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). - -```cli -$ dvc remote add -d myremote s3://mybucket/path -Setting 'myremote' as a default remote. - -$ dvc remote modify myremote region us-east-2 -``` - -The project's config file (`.dvc/config`) now looks like this: - -```ini -['remote "myremote"'] - url = s3://mybucket/path - region = us-east-2 -[core] - remote = myremote -``` - -The list of remotes should now be: - -```cli -$ dvc remote list -myremote s3://mybucket/path -``` - -You can overwrite existing remotes using `-f` with `dvc remote add`: - -```cli -$ dvc remote add -f myremote s3://mybucket/another-path -``` - -List remotes again to view the updated remote: - -```cli -$ dvc remote list -myremote s3://mybucket/another-path -``` diff --git a/content/docs/command-reference/remote/index.md b/content/docs/command-reference/remote/index.md index 61ed9230b0..22304fb509 100644 --- a/content/docs/command-reference/remote/index.md +++ b/content/docs/command-reference/remote/index.md @@ -58,16 +58,12 @@ default). Alternatively, the config files can be edited manually. ## Example: Add a default local remote -
+ -### What is a "local remote" ? +Learn more about +[local remotes](/doc/user-guide/data-management/remote-storage#file-systems-local-remotes). -While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the type of location where the storage is: another directory in the -same file system. "Remote" is what we call storage for DVC -projects. It's essentially a local backup for data tracked by DVC. - -
+
We use the `-d` (`--default`) option of `dvc remote add` for this: diff --git a/content/docs/command-reference/remote/list.md b/content/docs/command-reference/remote/list.md index 56be0a920a..3c282f4f3b 100644 --- a/content/docs/command-reference/remote/list.md +++ b/content/docs/command-reference/remote/list.md @@ -40,18 +40,7 @@ and local config files (in that order). ## Examples -For simplicity, let's add a default local remote: - -
- -### What is a "local remote" ? - -While the term may seem contradictory, it doesn't have to be. The "local" part -refers to the type of location where the storage is: another directory in the -same file system. "Remote" is how we call storage for DVC projects. -It's essentially a local backup for data tracked by DVC. - -
+For simplicity, let's add a default [local remote]: ```cli $ dvc remote add -d myremote /path/to/remote @@ -66,3 +55,6 @@ myremote /path/to/remote ``` The list will also include any previously added remotes. + +[local remote]: + /doc/user-guide/data-management/remote-storage#file-systems-local-remotes diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index ce6db3182c..d4895cba78 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -27,9 +27,9 @@ positional arguments: ## Description -The DVC remote's `name` and a valid `option` to modify are required. Remote -options or [config parameters](#available-parameters-per-storage-type) are -specific to the storage type and typically require a `value` as well. +The DVC remote's `name` and a valid `option` to modify are required. Most +[config parameters](#available-parameters-per-storage-type) are specific to the +[storage type](#supported-storage-types). This command updates a [`remote`] section in the [config file] (`.dvc/config`): @@ -93,313 +93,53 @@ $ pip install "dvc[s3]" The following config options are available for all remote types: -- `url` - the remote location can always be modified. This is how DVC determines - what type of remote it is, and thus which other config options can be modified - (see each type in the next section for more details). +- `url` - the remote location (URL or path) can always be modified. See each + type in the next section for valid URL formats. - For example, for an Amazon S3 remote (see more details in the S3 section - below): - - ```cli - $ dvc remote modify myremote url s3://mybucket/new/path - ``` + - Or a _local remote_ (a directory in the file system): + This is how DVC determines what type of remote this is, and thus which other + config options can be modified. - ```cli - $ dvc remote modify localremote url /home/user/dvcstore - ``` + - `jobs` - change the default number of processes for [remote storage](/doc/command-reference/remote) synchronization operations - (see the `--jobs` option of `dvc push`, `dvc pull`, `dvc get`, `dvc import`, - `dvc update`, `dvc add --to-remote`, `dvc gc -c`, etc.). Accepts positive - integers. The default is `4 \* cpu_count()`. + (see the `--jobs` option of `dvc push`, `dvc pull`, `dvc import`, + `dvc update`, `dvc gc -c`, etc.). Accepts positive integers. The default is + `4 \* cpu_count()`. ```cli $ dvc remote modify myremote jobs 8 ``` -- `verify` - upon downloading cache files (`dvc pull`, `dvc fetch`) - DVC will recalculate the file hashes, to check that their contents have not - changed. This may slow down the aforementioned commands. The calculated hash - is compared to the value saved in the corresponding DVC file. - - > Note that this option is enabled on **Google Drive** remotes by default. - - ```cli - $ dvc remote modify myremote verify true - ``` - -## Available parameters per storage type - -The following are the types of remote storage (protocols) and their config -options: - -
- -### Amazon S3 - -- `url` - remote location, in the `s3:///` format: - - ```cli - $ dvc remote modify myremote url s3://mybucket/path - ``` - -- `region` - change S3 remote region: - - ```cli - $ dvc remote modify myremote region us-east-2 - ``` - -- `read_timeout` - set the time in seconds till a timeout exception is thrown - when attempting to read from a connection (60 by default). Let's set it to 5 - minutes for example: - - ```cli - $ dvc remote modify myremote read_timeout 300 - ``` - -- `connect_timeout` - set the time in seconds till a timeout exception is thrown - when attempting to make a connection (60 by default). Let's set it to 5 - minutes for example: - - ```cli - $ dvc remote modify myremote connect_timeout 300 - ``` - - - -The `version_aware` option requires that -[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) -be enabled on the specified S3 bucket. - - - -- `version_aware` - Use - [version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes) - cloud versioning features for this S3 remote. Files stored in the remote will - retain their original filenames and directory hierarchy, and different - versions of files will be stored as separate versions of the corresponding - object in the remote. - -**Authentication** - -By default, DVC authenticates using your AWS CLI -[configuration](https://docs.aws.amazon.com/cli/latest/userguide/cli-chap-configure.html) -(if set). This uses the default AWS credentials file. Use the following -parameters to customize the authentication method: - -> If any values given to the parameters below contain sensitive user info, add -> them with the `--local` option, so they're written to a Git-ignored config -> file. - -- `profile` - credentials profile name to access S3: - - ```cli - $ dvc remote modify --local myremote profile myprofile - ``` - -- `credentialpath` - S3 credentials file path: - - ```cli - $ dvc remote modify --local myremote credentialpath /path/to/creds - ``` - -- `configpath` - path to the - [AWS CLI config file](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html). - The default AWS CLI config file path (e.g. `~/.aws/config`) is used if this - parameter isn't set. - - ```cli - $ dvc remote modify --local myremote configpath /path/to/config - ``` - - > Note that only the S3-specific - > [configuration values](https://docs.aws.amazon.com/cli/latest/topic/s3-config.html#configuration-values) - > are used. - -- `endpointurl` - endpoint URL to access S3: - - ```cli - $ dvc remote modify myremote endpointurl https://myendpoint.com - ``` - -- `access_key_id` - AWS Access Key ID. May be used (along with - `secret_access_key`) instead of `credentialpath`: - - ```cli - $ dvc remote modify --local myremote access_key_id 'mykey' - ``` - -- `secret_access_key` - AWS Secret Access Key. May be used (along with - `access_key_id`) instead of `credentialpath`: - - ```cli - $ dvc remote modify --local myremote \ - secret_access_key 'mysecret' - ``` - -- `session_token` - AWS - [MFA](https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa.html) - session token. May be used (along with `access_key_id` and - `secret_access_key`) instead of `credentialpath` when MFA is required: +- `verify` - set to `true` for `dvc pull` and `dvc fetch` to recalculate file + hashes to check whether their contents have changed (compared to the values + saved in the corresponding metafile). This may slow down the + operations. - ```cli - $ dvc remote modify --local myremote session_token my-session-token - ``` - -- `use_ssl` - whether or not to use SSL. By default, SSL is used. - - ```cli - $ dvc remote modify myremote use_ssl false - ``` - -- `ssl_verify` - whether or not to verify SSL certificates, or a path to a - custom CA certificates bundle to do so (implies `true`). The certs in - [AWS CLI config](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html#cli-configure-files-settings) - (if any) are used by default. - - ```cli - $ dvc remote modify myremote ssl_verify false - # or - $ dvc remote modify myremote ssl_verify path/to/ca_bundle.pem - ``` - -**Operational details** - -Make sure you have the following permissions enabled: `s3:ListBucket`, -`s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`. This enables the S3 API -methods that are performed by DVC (`list_objects_v2` or `list_objects`, -`head_object`, `upload_file`, `download_file`, `delete_object`, `copy`). - -- `listobjects` - whether or not to use `list_objects`. By default, - `list_objects_v2` is used. Useful for ceph and other S3 emulators. - - ```cli - $ dvc remote modify myremote listobjects true - ``` - -- `sse` - server-side encryption algorithm to use: `AES256` or `aws:kms`. By - default, no encryption is used. - - ```cli - $ dvc remote modify myremote sse AES256 - ``` - -- `sse_kms_key_id` - identifier of the key to encrypt data uploaded when using - [SSE-KMS](https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html) - (see `sse`). This parameter will be passed directly to AWS S3, so DVC supports - any value that S3 supports, including both key IDs and aliases. - - ```cli - $ dvc remote modify --local myremote sse_kms_key_id 'key-alias' - ``` - -- `sse_customer_key` - key to encrypt data uploaded when using customer-provided - encryption keys - ([SSE-C](https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html)). - instead of `sse`. The value should be a base64-encoded 256 bit key. - - ```cli - $ dvc remote modify --local myremote sse_customer_key 'mysecret' - ``` - -- `sse_customer_algorithm` - server-side encryption algorithm to use with - `sse_customer_key`. This parameter will be passed directly to AWS S3, so DVC - supports any value that S3 supports. `AES256` by default. - - ```cli - $ dvc remote modify myremote sse_customer_algorithm 'AES256' - ``` - -- `acl` - set object level access control list (ACL) such as `private`, - `public-read`, etc. By default, no ACL is specified. - - ```cli - $ dvc remote modify myremote acl bucket-owner-full-control - ``` - -- `grant_read`\* - grants `READ` permissions at object level access control list - for specific grantees\*\*. Grantee can read object and its metadata. - - ```cli - $ dvc remote modify myremote grant_read \ - id=aws-canonical-user-id,id=another-aws-canonical-user-id - ``` - -- `grant_read_acp`\* - grants `READ_ACP` permissions at object level access - control list for specific grantees\*\*. Grantee can read the object's ACP. - - ```cli - $ dvc remote modify myremote grant_read_acp \ - id=aws-canonical-user-id,id=another-aws-canonical-user-id - ``` + -- `grant_write_acp`\* - grants `WRITE_ACP` permissions at object level access - control list for specific grantees\*\*. Grantee can modify the object's ACP. + Note that this option is enabled on **Google Drive** remotes by default. - ```cli - $ dvc remote modify myremote grant_write_acp \ - id=aws-canonical-user-id,id=another-aws-canonical-user-id - ``` - -- `grant_full_control`\* - grants `FULL_CONTROL` permissions at object level - access control list for specific grantees\*\*. Equivalent of grant_read + - grant_read_acp + grant_write_acp + ```cli - $ dvc remote modify myremote grant_full_control \ - id=aws-canonical-user-id,id=another-aws-canonical-user-id + $ dvc remote modify myremote verify true ``` - > \* `grant_read`, `grant_read_acp`, `grant_write_acp` and - > `grant_full_control` params are mutually exclusive with `acl`. - > - > \*\* default ACL grantees are overwritten. Grantees are AWS accounts - > identifiable by `id` (AWS Canonical User ID), `emailAddress` or `uri` - > (predefined group). - > - > **References** - > - > - [ACL Overview - Permissions](https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html#permissions) - > - [Put Object ACL](https://docs.aws.amazon.com/AmazonS3/latest/API/API_PutObjectAcl.html) - -Note that S3 remotes can also be configured via environment variables (instead -of `dvc remote modify`). These are tried if none of the params above are set. - -Authentication example: - -```cli -$ dvc remote add -d myremote s3://mybucket/path -$ export AWS_ACCESS_KEY_ID='mykey' -$ export AWS_SECRET_ACCESS_KEY='mysecret' -$ dvc push -``` - -For more on the supported env vars, please see the -[boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables) - -
- -
+## Supported storage types -### S3-compatible storage +Each type of storage has different config options you can set. See all the +details in the pages linked below. -- `endpointurl` - URL to connect to the S3-compatible storage server or service - (e.g. [Minio](https://min.io/), - [DigitalOcean Spaces](https://www.digitalocean.com/products/spaces/), - [IBM Cloud Object Storage](https://www.ibm.com/cloud/object-storage) etc.): +### Cloud providers - ```cli - $ dvc remote modify myremote \ - endpointurl https://storage.example.com - ``` +- [Amazon S3] (AWS) and [S3-compatible] e.g. MinIO -Any other S3 parameter (see previous section) can also be set for S3-compatible -storage. Whether they're effective depends on each storage platform. - -
+[amazon s3]: /doc/user-guide/data-management/remote-storage/amazon-s3 +[s3-compatible]: + /doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon
@@ -1268,34 +1008,6 @@ by HDFS. Read more about by expanding the WebHDFS section in
-## Example: Customize an S3 remote - -Let's first set up a _default_ S3 remote. - -> 💡 Before adding an S3 remote, be sure to -> [Create a Bucket](https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html). - -```cli -$ dvc remote add -d myremote s3://mybucket/path -Setting 'myremote' as a default remote. -``` - -Modify its access profile: - -```cli -$ dvc remote modify myremote profile myprofile -``` - -Now the project config file should look like this: - -```ini -['remote "myremote"'] - url = s3://mybucket/path - profile = myuser -[core] - remote = myremote -``` - ## Example: Some Azure authentication methods Using a default identity (e.g. credentials set by `az cli`): diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 3659efd1cb..e544fbf7d5 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -124,7 +124,11 @@ "source": false, "children": [ "large-dataset-optimization", - "remote-storage", + { + "slug": "remote-storage", + "source": "remote-storage/index.md", + "children": ["amazon-s3"] + }, "cloud-versioning", "importing-external-data", "managing-external-data" diff --git a/content/docs/start/data-management/data-versioning.md b/content/docs/start/data-management/data-versioning.md index 687b97496e..75e9fb2bcd 100644 --- a/content/docs/start/data-management/data-versioning.md +++ b/content/docs/start/data-management/data-versioning.md @@ -95,7 +95,7 @@ outs: ## Storing and sharing You can upload DVC-tracked data or model files with `dvc push`, so they're -safely stored [remotely]. This also means they can be retrieved on other +safely stored [remotely]. This also means that they can be retrieved on other environments later with `dvc pull`. First, we need to set up a remote storage location: @@ -145,10 +145,12 @@ $ git commit .dvc\config -m "Configure local remote" -> While the term "local remote" may seem contradictory, it doesn't have to be. -> The "local" part refers to the type of location: another directory in the file -> system. "Remote" is what we call storage for DVC projects. It's -> essentially a local data backup. + + +Learn more about +[local remotes](/doc/user-guide/data-management/remote-storage#file-systems-local-remotes). + + @@ -214,7 +216,7 @@ $ dvc pull -See `dvc remote` for more information on remote storage. +See [Remote Storage] for more information on remote storage. diff --git a/content/docs/user-guide/data-management/remote-storage.md b/content/docs/user-guide/data-management/remote-storage.md deleted file mode 100644 index a4457ff5e6..0000000000 --- a/content/docs/user-guide/data-management/remote-storage.md +++ /dev/null @@ -1,113 +0,0 @@ -# Remote Storage - -_DVC remotes_ provide optional/additional storage to backup and share your data -and ML model. For example, you can download data artifacts created by colleagues -without spending time and resources to regenerate them locally. See `dvc push` -and `dvc pull`. - - - -DVC remotes are similar to [Git remotes], but for cached data. - -[git remotes]: https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes - - - -This is somewhat like GitHub or GitLab providing hosting for source code -repositories. However, DVC does not provide or recommend a specific storage -service. Instead, it adopts a bring-your-own-platform approach, supporting a -wide variety of [storage types](#supported-storage-types). - -The main uses of remote storage are: - -- Synchronize DVC-tracked data (previously cached). -- Centralize or distribute large file storage for sharing and collaboration. -- Back up different versions of your data and models. -- Save space in your working environment (by deleting pushed files/directories). - -## Configuration - -You can set up one or more remote storage locations, mainly with the -`dvc remote add` and `dvc remote modify` commands. These read and write to the -[`remote`] section of the project's configuration file (`.dvc/config`), which -you could edit manually as well. - -Typically, you'll first register a DVC remote by adding its name and URL (or -file path), e.g.: - -```cli -$ dvc remote add mybucket s3://my-bucket -``` - -Then, you'll usually need or want to configure the remote's authentication -credentials or other properties, etc. For example: - -```cli -$ dvc remote modify --local \ - mybucket credentialpath ~/.aws/alt - -$ dvc remote modify mybucket connect_timeout 300 -``` - - - -Make sure to use the `--local` flag when writing secrets to configuration. This -creates a second config file in `.dvc/config.local` that is ignored by Git. This -way your secrets do not get to the repository. See `dvc config` for more info. - -This also means each copy of the DVC repository may have to -re-configure remote storage authentication. - - - -
- -### Click to see the resulting config files. - -```ini -# .dvc/config -['remote "mybucket"'] - url = s3://my-bucket - connect_timeout = 300 -``` - -```ini -# .dvc/config.local -['remote "mybucket"'] - credentialpath = ~/.aws/alt -``` - -```ini -# .gitignore -.dvc/config.local -``` - -
- -Finally, you can `git commit` the changes to share the general configuration of -your remote (`.dvc/config`) via the Git repo. - -[`remote`]: /doc/command-reference/config#remote - -## Supported storage types - -> See more [details](/doc/command-reference/remote/add#supported-storage-types). - -### Cloud providers - -- Amazon S3 (AWS) -- S3-compatible e.g. MinIO -- Microsoft Azure Blob Storage -- Google Drive -- Google Cloud Storage (GCP) -- Aliyun OSS - -### Self-hosted / On-premises - -- SSH servers; Like `scp` -- HDFS & WebHDFS -- HTTP -- WebDAV -- Local directories, mounted drives; Like `rsync` - > Includes network resources e.g. network-attached storage (NAS) or other - > external devices diff --git a/content/docs/user-guide/data-management/remote-storage/amazon-s3.md b/content/docs/user-guide/data-management/remote-storage/amazon-s3.md new file mode 100644 index 0000000000..395982db00 --- /dev/null +++ b/content/docs/user-guide/data-management/remote-storage/amazon-s3.md @@ -0,0 +1,236 @@ +# Amazon S3 and Compatible Servers + + + +Start with `dvc remote add` to define the remote. Set a name and valid [S3] URL: + +```cli +$ dvc remote add -d myremote s3:/// +``` + +- `` - name of an [existing S3 bucket] +- `` - optional path to a [folder key] in your bucket + +Upon `dvc push` (or when needed) DVC will try to authenticate using your [AWS +CLI config]. This reads the default AWS credentials file (if available) or +[env vars](#environment-variables). + +[aws cli config]: + https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html + + + +The AWS user needs the following permissions: `s3:ListBucket`, `s3:GetObject`, +`s3:PutObject`, `s3:DeleteObject`. + + + +[s3]: https://aws.amazon.com/s3/ +[existing s3 bucket]: + https://docs.aws.amazon.com/AmazonS3/latest/gsg/CreatingABucket.html +[folder key]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/using-folders.html + +To use [custom auth](#custom-authentication) or further configure your DVC +remote, set any supported config param with `dvc remote modify`. + +## Cloud versioning + + + +Requires [S3 Versioning] enabled on the bucket. + + + +```cli +$ dvc remote modify myremote version_aware true +``` + +`version_aware` (`true` or `false`) enables [cloud versioning] features for this +remote. This lets you explore the bucket files under the same structure you see +in your project directory locally. + +[s3 versioning]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html +[cloud versioning]: /docs/user-guide/data-management/cloud-versioning + +## Custom authentication + +If you don't have the AWS CLI configured in your machine or if you want to +change the auth method for some reason. + + + +The `dvc remote modify --local` flag is needed to write sensitive user info to a +Git-ignored config file (`.dvc/config.local`) so that no secrets are leaked +through Git. See `dvc config` for more info. + + + +To use custom [AWS CLI config or credential files][aws-cli-config-files], or to +specify a [profile name], use `configpath`, `credentialpath`, or `profile`: + +```cli +$ dvc remote modify --local myremote \ + configpath 'path/to/config' +# or +$ dvc remote modify --local myremote \ + credentialpath 'path/to/credentials' +# and (optional) +$ dvc remote modify --local myremote profile 'myprofile' +``` + +[aws-cli-config-files]: + https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-files.html +[profile name]: + https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-profiles.html + +Another option is to use an AWS access key ID (`access_key_id`) and secret +access key (`secret_access_key`) pair, and if required, an [MFA] session token +(`session_token`): + +```cli +$ dvc remote modify --local myremote \ + access_key_id 'mysecret' +$ dvc remote modify --local myremote \ + secret_access_key 'mysecret' +$ dvc remote modify --local myremote \ + session_token 'mysecret' +``` + +[mfa]: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_mfa.html + +## S3-compatible servers (non-Amazon) + +Set the `endpointurl` parameter with the URL to connect to the S3-compatible +service (e.g. [MinIO], [IBM Cloud Object Storage], etc.). For example, let's set +up a [DigitalOcean Space] (equivalent to a bucket in S3) called `mystore` found +in the `nyc3` region: + +```cli +$ dvc remote add -d myremote s3://mystore/path +$ dvc remote modify myremote endpointurl \ + https://nyc3.digitaloceanspaces.com +``` + + + +Any other S3 parameter can also be set for S3-compatible storage. Whether +they're effective depends on each storage platform. + + + +[minio]: https://min.io/ +[digitalocean space]: https://www.digitalocean.com/products/spaces +[ibm cloud object storage]: https://www.ibm.com/cloud/object-storage + +## More configuration options + + + +See `dvc remote modify` for more command usage details. + + + +- `region` - specific AWS region + + ```cli + $ dvc remote modify myremote region 'us-east-2' + ``` + +- `read_timeout` - time in seconds until a timeout exception is thrown when + attempting to read from a connection (60 by default) + +- `connect_timeout` - time in seconds until a timeout exception is thrown when + attempting to make a connection (60 by default) + +- `listobjects` (`true` or `false`) - whether to use the `list_objects()` S3 API + method instead of the default `list_objects_v2()`. Useful for Ceph and other + S3 emulators + +- `use_ssl` (`true` or `false`) - whether to use SSL. Used by default. + +- `ssl_verify` - whether to verify SSL certificates (`true` or `false`), or a + path to a custom CA certificates bundle to do so (implies `true`). Any certs + found in the [AWS CLI config file][aws-cli-config-files] (`ca_bundle`) are + used by default. + + ```cli + $ dvc remote modify myremote ssl_verify false + # or + $ dvc remote modify myremote \ + ssl_verify 'path/to/ca_bundle.pem' + ``` + +- `sse` (`AES256` or `aws:kms`) - [server-side encryption] algorithm to use. + None by default + + ```cli + $ dvc remote modify myremote sse 'AES256' + ``` + +- `sse_kms_key_id` - encryption key ID (or alias) when using [SSE-KMS] (see + `sse`) + +- `sse_customer_key` - key to encrypt data uploaded when using customer-provided + keys ([SSE-C]) instead of `sse`. The value should be a base64-encoded 256 bit + key. + +- `sse_customer_algorithm` - algorithm to use with `sse_customer_key`. `AES256` + by default + +- `acl` - object-level access control list ([ACL]) such as `private`, + `public-read`, etc. None by default. Cannot be used with the `grant_` params + below. + + ```cli + $ dvc remote modify myremote \ + acl 'bucket-owner-full-control' + ``` + +- `grant_read` - grant `READ` [permissions] at object-level ACL to specific + [grantees]. Cannot be used with `acl`. + + ```cli + $ dvc remote modify myremote grant_read \ + 'id=myuser,id=anotheruser' + ``` + +- `grant_read_acp` - grant `READ_ACP` permissions at object-level ACL to + specific grantees. Cannot be used with `acl`. + +- `grant_write_acp` - grant `WRITE_ACP` permissions at object-level ACL to + specific grantees. Cannot be used with `acl`. + +- `grant_full_control` - grant `FULL_CONTROL` permissions at object-level ACL to + specific grantees. Cannot be used with `acl`. + +[server-side encryption]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/serv-side-encryption.html +[sse-kms]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingKMSEncryption.html +[sse-c]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/ServerSideEncryptionCustomerKeys.html +[acl]: https://docs.aws.amazon.com/AmazonS3/latest/dev/acl-overview.html +[grantees]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html#specifying-grantee +[permissions]: + https://docs.aws.amazon.com/AmazonS3/latest/userguide/acl-overview.html#permissions + +## Environment variables + +Authentication and other configuration can also be set via [`boto3` env vars]. +These are tried if no config params are set in the project. +Example: + +```cli +$ dvc remote add -d myremote s3://mybucket +$ export AWS_ACCESS_KEY_ID='mysecret' +$ export AWS_SECRET_ACCESS_KEY='mysecret' +$ dvc push +``` + +[`boto3` env vars]: + https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables diff --git a/content/docs/user-guide/data-management/remote-storage/index.md b/content/docs/user-guide/data-management/remote-storage/index.md new file mode 100644 index 0000000000..61183be785 --- /dev/null +++ b/content/docs/user-guide/data-management/remote-storage/index.md @@ -0,0 +1,167 @@ +# Remote Storage + +_DVC remotes_ provide optional/additional storage to back up and share your data +and ML models. For example, you can download data artifacts created by +colleagues without spending time and resources to regenerate them locally. See +also `dvc push` and `dvc pull`. + + + +DVC remotes are similar to [Git remotes] (e.g. GitHub or GitLab hosting), but +for cached data instead of code. + +[git remotes]: https://git-scm.com/book/en/v2/Git-Basics-Working-with-Remotes + + + +DVC does not provide or recommend a specific storage service (unlike code +repos). You can bring your own platform from a wide variety of +[supported storage types](#supported-storage-types). + +Main uses of remote storage: + +- Synchronize large files and directories tracked by DVC. +- Centralize or distribute data storage for sharing and collaboration. +- Back up different versions of datasets and models (saving space locally). + +## Configuration + +You can set up one or more storage locations with `dvc remote` commands. These +read and write to the [`remote`] section of the project's config file +(`.dvc/config`), which you could edit manually as well. + +For example, let's define a remote storage location on an S3 bucket: + +[`remote`]: /doc/command-reference/config#remote + +```cli +$ dvc remote add myremote s3://mybucket +``` + + + +DVC reads existing configuration you may have locally for major cloud providers +(AWS, Azure, GCP) so that many times all you need to do is `dvc remote add`! + + + +You may also need to customize authentication or other config with +`dvc remote modify`: + +```cli +$ dvc remote modify --local \ + mybucket credentialpath ~/.aws/alt +$ dvc remote modify mybucket connect_timeout 300 +``` + + + +The `--local` flag is needed to write sensitive user info to a Git-ignored +config file (`.dvc/config.local`) so that no secrets are leaked (see +`dvc config`). This means that each copy of the DVC repository has +to re-configure these values. + + + +
+ +### Click to see the resulting config files. + +```ini +# .dvc/config +['remote "mybucket"'] + url = s3://my-bucket + connect_timeout = 300 +``` + +```ini +# .dvc/config.local +['remote "mybucket"'] + credentialpath = ~/.aws/alt +``` + +```ini +# .gitignore +.dvc/config.local +``` + +
+ +Finally, you can `git commit` the changes to share the remote location with your +team. + +## Supported storage types + + + +Guides for each storage type are in progress. For storage types that do not link +to a specific guide, see +[`dvc remote add`](/doc/command-reference/remote/add#supported-storage-types) +and +[`dvc remote modify`](/doc/command-reference/remote/modify#supported-storage-types). + + + +### Cloud providers + +- [Amazon S3] (AWS) and [S3-compatible] e.g. MinIO +- Microsoft Azure Blob Storage +- Google Drive +- Google Cloud Storage (GCP) +- Aliyun OSS + +[amazon s3]: /doc/user-guide/data-management/remote-storage/amazon-s3 +[s3-compatible]: + /doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon + +### Self-hosted / On-premises + +- SSH servers; Like `scp` +- HDFS & WebHDFS +- HTTP +- WebDAV + +## File systems (local remotes) + + + +Not related to the `--local` option of `dvc remote` and `dvc config`! + + + +You can also use system directories, mounted drives, network resources e.g. +network-attached storage (NAS), and other external devices as storage. We call +all these "local remotes". + + + +Here, the word "local" refers to where the storage is found: typically another +directory in the same file system. And "remote" is how we call storage for +DVC projects. + + + +Using an absolute path (recommended because it's saved as-is in DVC config): + +```cli +$ dvc remote add -d myremote /tmp/dvcstore +``` + +```ini +# .dvc/config +['remote "myremote"'] + url = /tmp/dvcstore +``` + +When using a relative path, it will be saved **relative to the config file +location**, but resolved against the current working directory. + +```cli +$ dvc remote add -d myremote ../dvcstore +``` + +```ini +# .dvc/config +['remote "myremote"'] + url = ../../dvcstore +```