From 02cc14614d7d8da5acaa0d0d5be10c51db8cba2e Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?Peter=20Rowlands=20=28=EB=B3=80=EA=B8=B0=ED=98=B8=29?= Date: Fri, 30 Dec 2022 23:33:44 +0900 Subject: [PATCH] ref: document cloud versioned remotes (#4165) * ref: document cloud versioned remotes * ref: document worktree update * review updates * add dev/experimental admon * add links to/from import-url for cloud versioning * review fixes * Apply suggestions from code review Co-authored-by: Dave Berenbaum --- content/docs/command-reference/import-url.md | 16 +-- .../docs/command-reference/remote/modify.md | 75 ++++++++++++ content/docs/command-reference/update.md | 22 +++- content/docs/sidebar.json | 3 +- .../data-management/cloud-versioning.md | 113 ++++++++++++++++++ 5 files changed, 219 insertions(+), 10 deletions(-) create mode 100644 content/docs/user-guide/data-management/cloud-versioning.md diff --git a/content/docs/command-reference/import-url.md b/content/docs/command-reference/import-url.md index 6f0d632c31..3ef4759b60 100644 --- a/content/docs/command-reference/import-url.md +++ b/content/docs/command-reference/import-url.md @@ -108,13 +108,15 @@ DVC supports several types of external locations (protocols): [ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) is necessary to track if the specified URL changed. -DVC also supports capturing cloud versioning information when importing data -from certain cloud storage providers. When the `--version-aware` option is -provided or when the `url` argument includes a supported cloud versioning ID, -DVC will import the specified version of the given data. When using versioned -storage, DVC will always [pull](/doc/command-reference/pull) the versioned data -from its original source location. Versioned data will also not be -[pushed](/doc/command-reference/push) to remote storage. +DVC also supports capturing +[cloud versioning](/doc/user-guide/data-management/cloud-versioning) information +when importing data from certain cloud storage providers. When the +`--version-aware` option is provided or when the `url` argument includes a +supported cloud versioning ID, DVC will import the specified version of the +given data. When using versioned storage, DVC will always +[pull](/doc/command-reference/pull) the versioned data from its original source +location. Versioned data will also not be [pushed](/doc/command-reference/push) +to remote storage. | Type | Description | Versioned `url` format example | | ------- | ---------------------------- | ------------------------------------------------------ | diff --git a/content/docs/command-reference/remote/modify.md b/content/docs/command-reference/remote/modify.md index 6c4f081a32..d96251e104 100644 --- a/content/docs/command-reference/remote/modify.md +++ b/content/docs/command-reference/remote/modify.md @@ -346,6 +346,31 @@ $ dvc push For more on the supported env vars, please see the [boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables) +- `version_aware` - Use + [version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes) + cloud versioning features for this S3 remote. Files stored in the remote will + retain their original filenames and directory hierarchy, and different + versions of files will be stored as separate versions of the corresponding + object in the remote. + +- `worktree` - Use + [worktree](/docs/user-guide/data-management/cloud-versioning#worktree-remotes) + cloud versioning features for this S3 remote. Files stored in the remote will + retain their original filenames and directory hierarchy, and different + versions of files will be stored as separate versions of the corresponding + object in cloud storage. DVC will also attempt to ensure that the current + version of objects in the remote match the latest version of files in the DVC + repository. When both `version_aware` and `worktree` are set, `worktree` takes + precedence. + + + +The `version_aware` and `worktree` options require that +[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) +be enabled on the specified S3 bucket. + + +
@@ -548,6 +573,31 @@ can propagate from an Azure configuration file (typically managed with `container_name`. The default directory where it will be searched for is `~/.azure` but this can be customized with the `AZURE_CONFIG_DIR` env var. +- `version_aware` - Use + [version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes) + cloud versioning features for this Azure remote. Files stored in the remote + will retain their original filenames and directory hierarchy, and different + versions of files will be stored as separate versions of the corresponding + object in the remote. + +- `worktree` - Use + [worktree](/docs/user-guide/data-management/cloud-versioning#worktree-remotes) + cloud versioning features for this Azure remote. Files stored in the remote + will retain their original filenames and directory hierarchy, and different + versions of files will be stored as separate versions of the corresponding + object in cloud storage. DVC will also attempt to ensure that the current + version of objects in the remote match the latest version of files in the DVC + repository. When both `version_aware` and `worktree` are set, `worktree` takes + precedence. + + + +The `version_aware` and `worktree` options require that +[Blob versioning](https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview) +be enabled on the specified Azure storage account and container. + + +
@@ -722,6 +772,31 @@ set: $ export GOOGLE_APPLICATION_CREDENTIALS='.../project-XXX.json' ``` +- `version_aware` - Use + [version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes) + cloud versioning features for this Google Cloud Storage remote. Files stored + in the remote will retain their original filenames and directory hierarchy, + and different versions of files will be stored as separate versions of the + corresponding object in the remote. + +- `worktree` - Use + [worktree](/docs/user-guide/data-management/cloud-versioning#worktree-remotes) + cloud versioning features for this Google Cloud Storage remote. Files stored + in the remote will retain their original filenames and directory hierarchy, + and different versions of files will be stored as separate versions of the + corresponding object in cloud storage. DVC will also attempt to ensure that + the current version of objects in the remote match the latest version of files + in the DVC repository. When both `version_aware` and `worktree` are set, + `worktree` takes precedence. + + + +The `version_aware` and `worktree` options require that +[Object versioning](https://cloud.google.com/storage/docs/object-versioning) be +enabled on the specified bucket. + + +
diff --git a/content/docs/command-reference/update.md b/content/docs/command-reference/update.md index 2ca0803c54..0d1a36f610 100644 --- a/content/docs/command-reference/update.md +++ b/content/docs/command-reference/update.md @@ -2,7 +2,9 @@ Update files or directories imported from external DVC repositories or [URLs](/doc/command-reference/import-url#description), and the corresponding -import `.dvc` files. +import `.dvc` files, or update files or directories from a +[worktree](/doc/user-guide/data-management/cloud-versioning#worktree-remotes) +remote. ## Synopsis @@ -38,6 +40,22 @@ to update an imported artifact to a different revision. $ dvc update --rev master ``` +### Worktree update + +When using a +[worktree](/doc/user-guide/data-management/cloud-versioning#worktree-remotes) +remote, `dvc update` will update the specified target to match the current +version of the corresponding file or directory from the remote storage. If the +current version of the specified target is a deleted file or an empty directory, +`dvc update` will fail. + + + +Note that the `--rev`, `--no-download` and `--to-remote` flags are not +compatible when updating from a worktree remote. + + + ## Options - `--rev ` - commit hash, branch or tag name, etc. (any @@ -51,7 +69,7 @@ $ dvc update --rev master For stages created with `dvc import-url` and a [cloud-versioned URL](/doc/command-reference/import-url#--version-aware), `--rev` can be used to specify a object version ID to use. By default, the - import will be updated to the latest version from cloud storage. + import will be updated to the current version from cloud storage. - `-R`, `--recursive` - determines the files to update by searching each target directory and its subdirectories for import `.dvc` files to inspect. If there diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 39751733d5..b3b9d91495 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -126,7 +126,8 @@ "children": [ "large-dataset-optimization", "importing-external-data", - "managing-external-data" + "managing-external-data", + "cloud-versioning" ] }, { diff --git a/content/docs/user-guide/data-management/cloud-versioning.md b/content/docs/user-guide/data-management/cloud-versioning.md new file mode 100644 index 0000000000..51f8f32b57 --- /dev/null +++ b/content/docs/user-guide/data-management/cloud-versioning.md @@ -0,0 +1,113 @@ +# Cloud Versioning + + + +Cloud versioning features are currently under active development and should be +considered experimental. These features are subject to frequent change, and the +documentation may not always reflect changes available in the latest DVC +release. + + + +When cloud versioning is enabled, DVC will store files in the remote according +to their original directory location and filenames. Different versions of a file +will then be stored as separate versions of the corresponding object in cloud +storage. This is useful for cases where users prefer to retain their original +filenames and directory hierarchy in remote storage (instead of using DVC's +usual +[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) +format). + + + +Note that not all DVC functionality is supported when using cloud versioned +remotes, and using cloud versioning comes with the tradeoff of losing certain +benefits of content-addressable storage. + + + +
+ +### Expand for more details on the differences between cloud versioned and content-addressable storage + +`dvc remote` storage normally uses +[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) +to organize versioned data. Different versions of files are stored in the remote +according to hash of their data content instead of according to their original +filenames and directory location. This allows DVC to optimize certain remote +storage lookup and data sync operations, and provides data de-duplication at the +file level. However, this comes with the drawback of losing human-readable +filenames without the use of the DVC CLI (`dvc get --show-url`) or API +(`dvc.api.get_url()`). + +When using cloud versioning, DVC does not provide de-duplication, and certain +remote storage performance optimizations will be unavailable. + +
+ +## Supported storage providers + +Cloud versioning features are only avaible for certain storage providers. +Currently, it is supported on the following `dvc remote` types: + +- Amazon S3 (requires + [S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) + enabled buckets) +- Microsoft Azure Blob Storage (requires + [Blob versioning](https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview) + enabled storage accounts and containers) +- Google Cloud Storage (requires + [Object versioning](https://cloud.google.com/storage/docs/object-versioning) + enabled buckets) + +## Version-aware remotes + +When the `version_aware` option is enabled on a `dvc remote`: + +- `dvc push` will utilize cloud versioning when storing data in the remote. Data + will retain its original directory structure and filenames, and each version + of a file tracked by DVC will be stored as a new version of the corresponding + object in cloud storage. +- `dvc fetch` and `dvc pull` will download the corresponding version of an + object from cloud storage. + + + +Note that when `version_aware` is in use, DVC does not delete current versions +or restore noncurrent versions of objects in cloud storage. So the current +version of an object in cloud storage may not match the version of a file in your DVC repository. + + + +## Worktree remotes + +`worktree` remotes behave similarly to `version_aware` remotes, but with one key +difference. For `worktree` remotes, DVC will also attempt to ensure that the +current version of objects in cloud storage match the latest versions of files +in your DVC repository. + +So in addition to the command behaviors described for `version_aware` remotes, +when the `worktree` option is enabled on a `dvc remote`: + +- `dvc push` will also ensure that the current version of objects in remote + storage match the latest versions of files in your DVC repository repository. + Additionally, DVC will delete the current version of any objects which were + present in cloud storage but that do not exist in your current DVC repository + workspace. +- `dvc update` can be used to update a DVC-tracked file or directory in your DVC + repository to match the current version of the corresponding object(s) from + cloud storage. + + + +Note that deleting current versions in cloud storage does not delete any objects +(and does not delete any data). It only means that the current version of a +given object will show that the object does not exist. + + + +## Importing versioned data + +DVC supports importing cloud versioned data from supported storage providers. +Refer to the documentation for `dvc import-url` and `dvc update` for more +information.