Skip to content

Commit

Permalink
guide: Azure and GCP remote pages (#4284)
Browse files Browse the repository at this point in the history
* ref: start Remote Reference (config)

* Restyled by prettier (#4265)

Co-authored-by: Restyled.io <[email protected]>

* guide: move Remote Storage ref into Data Mgmt

* start: links to new Remotes guide and

and some typo fixes

* guide: finalize S3 storage page and

and remove repeated content from cmd refs (link to guide)

* guide: move "local remotes" to Remotes (index page) and

update admonitions and links

* ref: remove S3 examples

* guide: Azure remote page and start GCS

* guide: finish GCS page and

improvements to the other ones (S3, Azure)

* guide: small link fix in GDrive how-to

* guide: emphasize that remotes use regular cloud storage config

* Update content/docs/user-guide/data-management/remote-storage/amazon-s3.md

* guide: drop `worktree` cloud versioning from Remotes Config

per #4264 (comment)

* guide: move cloud versioning near the top of Remote Config

per #4264 (review)

* fix a link

* typo

* reformat all storage types (Data Mgmt/ Remote Storage)

* guide: move admon about pending Remote guides up

rel. #4284 (review)

* link all remote types (instead of admon)

per #4284 (review)

* Restyled by prettier (#4333)

Co-authored-by: Restyled.io <[email protected]>

* Update content/docs/user-guide/data-management/remote-storage/amazon-s3.md

Co-authored-by: Jorge Orpinel <[email protected]>

---------

Co-authored-by: restyled-io[bot] <32688539+restyled-io[bot]@users.noreply.github.com>
Co-authored-by: Restyled.io <[email protected]>
Co-authored-by: Dave Berenbaum <[email protected]>
  • Loading branch information
4 people authored Feb 27, 2023
1 parent 57cfcc8 commit e6020ba
Show file tree
Hide file tree
Showing 9 changed files with 382 additions and 344 deletions.
50 changes: 6 additions & 44 deletions content/docs/command-reference/remote/add.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,32 +126,16 @@ The following are the supported types of storage protocols and platforms.
### Cloud providers

- [Amazon S3] (AWS) and [S3-compatible] e.g. MinIO
- Microsoft [Azure Blob Storage]
- [Google Cloud Storage] (GCP)

[amazon s3]: /doc/user-guide/data-management/remote-storage/amazon-s3
[s3-compatible]:
/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

<details>

### Microsoft Azure Blob Storage

```cli
$ dvc remote add -d myremote azure://mycontainer/path
$ dvc remote modify myremote account_name 'myuser'
```

By default, DVC authenticates using an `account_name` and its [default
credential] (if any), which uses environment variables (e.g. set by `az cli`) or
a Microsoft application.

[default credential]:
https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential

To use a custom authentication method, use the parameters described in
`dvc remote modify`. See some
[examples](/doc/command-reference/remote/modify#example-some-azure-authentication-methods).

</details>
[azure blob storage]:
/doc/user-guide/data-management/remote-storage/azure-blob-storage
[google cloud storage]:
/doc/user-guide/data-management/remote-storage/google-cloud-storage

<details>

Expand Down Expand Up @@ -189,28 +173,6 @@ modified.

<details>

### Google Cloud Storage

> 💡 Before adding a GC Storage remote, be sure to
> [Create a storage bucket](https://cloud.google.com/storage/docs/creating-buckets).
```cli
$ dvc remote add -d myremote gs://mybucket/path
```

By default, DVC expects your GCP CLI is already
[configured](https://cloud.google.com/sdk/docs/authorizing). DVC will be using
default GCP key file to access Google Cloud Storage. To override some of these
parameters, use the parameters described in `dvc remote modify`.

> Make sure to run `gcloud auth application-default login` unless you use
> `GOOGLE_APPLICATION_CREDENTIALS` and/or service account, or other ways to
> authenticate. See details [here](https://stackoverflow.com/a/53307505/298182).
</details>

<details>

### Aliyun OSS

First you need to set up OSS storage on Aliyun Cloud. Then, use an S3 style URL
Expand Down
272 changes: 10 additions & 262 deletions content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
Expand Up @@ -136,210 +136,16 @@ details in the pages linked below.
### Cloud providers

- [Amazon S3] (AWS) and [S3-compatible] e.g. MinIO
- Microsoft [Azure Blob Storage]
- [Google Cloud Storage] (GCP)

[amazon s3]: /doc/user-guide/data-management/remote-storage/amazon-s3
[s3-compatible]:
/doc/user-guide/data-management/remote-storage/amazon-s3#s3-compatible-servers-non-amazon

<details>

### Microsoft Azure Blob Storage

> If any values given to the parameters below contain sensitive user info, add
> them with the `--local` option, so they're written to a Git-ignored config
> file.
- `url` (required) - remote location, in the `azure://<container>/<object>`
format:

```cli
$ dvc remote modify myremote url azure://mycontainer/path
```

Note that if the given container name isn't found in your account, DVC will
attempt to create it.

- `account_name` - storage account name. Required for every authentication
method except `connection_string` (which already includes it).

```cli
$ dvc remote modify myremote account_name 'myaccount'
```

<admon type="tip">

The `version_aware` option requires that
[Blob versioning](https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview)
be enabled on the specified Azure storage account and container.

</admon>

- `version_aware` - Use
[version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes)
cloud versioning features for this Azure remote. Files stored in the remote
will retain their original filenames and directory hierarchy, and different
versions of files will be stored as separate versions of the corresponding
object in the remote.

**Authentication**

By default, DVC authenticates using an `account_name` and its [default
credential] (if any), which uses environment variables (e.g. set by `az cli`) or
a Microsoft application.

[default credential]:
https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential

<details>

#### For Windows users

When using default authentication, you may need to enable some of these
exclusion parameters depending on your setup
([details][azure-default-cred-params]):

[azure-default-cred-params]:
https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.defaultazurecredential?view=azure-python#parameters

```cli
$ dvc remote modify --system myremote
exclude_environment_credential true
$ dvc remote modify --system myremote
exclude_visual_studio_code_credential true
$ dvc remote modify --system myremote
exclude_shared_token_cache_credential true
$ dvc remote modify --system myremote
exclude_managed_identity_credential true
```

</details>

To use a custom authentication method, you can either use this command to
configure the appropriate auth params, use environment variables, or rely on an
Azure config file (in that order). More details below.

> See some [Azure auth examples](#example-some-azure-authentication-methods).
#### Authenticate with DVC config parameters

The following parameters are listed in the order they are used by DVC when
attempting to authenticate with Azure:

1. `connection_string` is used for authentication if given (`account_name` is
ignored).
2. If `tenant_id` and `client_id`, `client_secret` are given, Active Directory
(AD) [service principal] auth is performed.
3. DVC will next try to connect with `account_key` or `sas_token` (in this
order) if either are provided.
4. If `allow_anonymous_login` is set to `True`, then DVC will try to connect
[anonymously].

[service principal]:
https://docs.microsoft.com/en-us/azure/active-directory/develop/howto-create-service-principal-portal
[anonymously]:
https://docs.microsoft.com/en-us/azure/storage/blobs/anonymous-read-access-configure

- `connection_string` - Azure Storage
[connection string](http://azure.microsoft.com/en-us/documentation/articles/storage-configure-connection-string/)
(recommended).

```cli
$ dvc remote modify --local myremote \
connection_string 'mysecret'
```

* `tenant_id` - tenant ID for AD _service principal_ authentication (requires
`client_id` and `client_secret` along with this):

```cli
$ dvc remote modify --local myremote tenant_id 'mytenant'
```

* `client_id` - client ID for _service principal_ authentication (when
`tenant_id` is set):

```cli
$ dvc remote modify --local myremote client_id 'myclient'
```

* `client_secret` - client Secret for _service principal_ authentication (when
`tenant_id` is set):

```cli
$ dvc remote modify --local myremote client_secret 'mysecret'
```

* `account_key` - storage account key:

```cli
$ dvc remote modify --local myremote account_key 'mykey'
```

* `sas_token` - shared access signature token:

```cli
$ dvc remote modify --local myremote sas_token 'mysecret'
```

* `allow_anonymous_login` - whether to fall back to anonymous login if no other
auth params are given (besides `account_name`). This will only work with
public buckets:

```cli
$ dvc remote modify myremote allow_anonymous_login true
```

#### Authenticate with environment variables

Azure remotes can also authenticate via env vars (instead of
`dvc remote modify`). These are tried if none of the params above are set.

For Azure connection string:

```cli
$ export AZURE_STORAGE_CONNECTION_STRING='mysecret'
```

For account name and key/token auth:

```cli
$ export AZURE_STORAGE_ACCOUNT='myaccount'
# and
$ export AZURE_STORAGE_KEY='mysecret'
# or
$ export AZURE_STORAGE_SAS_TOKEN='mysecret'
```

For _service principal_ auth (via certificate file):

```cli
$ export AZURE_TENANT_ID='directory-id'
$ export AZURE_CLIENT_ID='client-id'
$ export AZURE_CLIENT_CERTIFICATE_PATH='/path/to/certificate'
```

For simple username/password login:

```cli
$ export AZURE_CLIENT_ID='client-id'
$ export AZURE_USERNAME='myuser'
$ export AZURE_PASSWORD='mysecret'
```

> See also description here for some
> [env vars](https://docs.microsoft.com/en-us/python/api/azure-identity/azure.identity.environmentcredential)
> available.
#### Authenticate with an Azure config file

As a final option (if no params or env vars are set), some of the auth methods
can propagate from an Azure configuration file (typically managed with
[az config](https://docs.microsoft.com/en-us/cli/azure/config)):
`connection_string`, `account_name`, `account_key`, `sas_token` and
`container_name`. The default directory where it will be searched for is
`~/.azure` but this can be customized with the `AZURE_CONFIG_DIR` env var.

</details>
[azure blob storage]:
/doc/user-guide/data-management/remote-storage/azure-blob-storage
[google cloud storage]:
/doc/user-guide/data-management/remote-storage/google-cloud-storage

<details>

Expand Down Expand Up @@ -470,68 +276,6 @@ more information.

<details>

### Google Cloud Storage

> If any values given to the parameters below contain sensitive user info, add
> them with the `--local` option, so they're written to a Git-ignored config
> file.
- `url` - remote location, in the `gs://<bucket>/<object>` format:

```cli
$ dvc remote modify myremote url gs://mybucket/path
```

- `projectname` - override or provide a project name to use, if a default one is
not set.

```cli
$ dvc remote modify myremote projectname myproject
```

<admon type="tip">

The `version_aware` option requires that
[Object versioning](https://cloud.google.com/storage/docs/object-versioning) be
enabled on the specified bucket.

</admon>

- `version_aware` - Use
[version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes)
cloud versioning features for this Google Cloud Storage remote. Files stored
in the remote will retain their original filenames and directory hierarchy,
and different versions of files will be stored as separate versions of the
corresponding object in the remote.

**For service accounts:**

A service account is a Google account associated with your GCP project, and not
a specific user. Please refer to
[Using service accounts](https://cloud.google.com/iam/docs/service-accounts) for
more information.

- `credentialpath` - path to the file that contains the
[service account key](https://cloud.google.com/docs/authentication/getting-started#creating_a_service_account).
Make sure that the service account has read/write access (as needed) to the
file structure in the remote `url`.

```cli
$ dvc remote modify --local myremote \
credentialpath '/home/.../project-XXX.json'
```

Alternatively, the `GOOGLE_APPLICATION_CREDENTIALS` environment variable can be
set:

```cli
$ export GOOGLE_APPLICATION_CREDENTIALS='.../project-XXX.json'
```

</details>

<details>

### Aliyun OSS

> If any values given to the parameters below contain sensitive user info, add
Expand Down Expand Up @@ -1007,6 +751,8 @@ by HDFS. Read more about by expanding the WebHDFS section in
```

</details>
<<<<<<< HEAD
=======

## Example: Some Azure authentication methods

Expand Down Expand Up @@ -1046,3 +792,5 @@ $ dvc remote modify --local myremote account_name 'myaccount'
$ dvc remote modify --local myremote sas_token 'mysecret'
$ dvc push
```

> > > > > > > main
6 changes: 5 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -127,7 +127,11 @@
{
"slug": "remote-storage",
"source": "remote-storage/index.md",
"children": ["amazon-s3"]
"children": [
"amazon-s3",
"azure-blob-storage",
"google-cloud-storage"
]
},
"cloud-versioning",
"importing-external-data",
Expand Down
Loading

0 comments on commit e6020ba

Please sign in to comment.