Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud versioning (initial guide and ref updates) #4165

Merged
merged 7 commits into from
Dec 30, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 9 additions & 7 deletions content/docs/command-reference/import-url.md
Original file line number Diff line number Diff line change
Expand Up @@ -108,13 +108,15 @@ DVC supports several types of external locations (protocols):
[ETag](https://en.wikipedia.org/wiki/HTTP_ETag#Strong_and_weak_validation) is
necessary to track if the specified URL changed.

DVC also supports capturing cloud versioning information when importing data
from certain cloud storage providers. When the `--version-aware` option is
provided or when the `url` argument includes a supported cloud versioning ID,
DVC will import the specified version of the given data. When using versioned
storage, DVC will always [pull](/doc/command-reference/pull) the versioned data
from its original source location. Versioned data will also not be
[pushed](/doc/command-reference/push) to remote storage.
DVC also supports capturing
[cloud versioning](/doc/user-guide/data-management/cloud-versioning) information
when importing data from certain cloud storage providers. When the
`--version-aware` option is provided or when the `url` argument includes a
supported cloud versioning ID, DVC will import the specified version of the
given data. When using versioned storage, DVC will always
[pull](/doc/command-reference/pull) the versioned data from its original source
location. Versioned data will also not be [pushed](/doc/command-reference/push)
to remote storage.

| Type | Description | Versioned `url` format example |
| ------- | ---------------------------- | ------------------------------------------------------ |
Expand Down
75 changes: 75 additions & 0 deletions content/docs/command-reference/remote/modify.md
Original file line number Diff line number Diff line change
Expand Up @@ -346,6 +346,31 @@ $ dvc push
For more on the supported env vars, please see the
[boto3 docs](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html#using-environment-variables)

- `version_aware` - Use
[version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes)
cloud versioning features for this S3 remote. Files stored in the remote will
retain their original filenames and directory hierarchy, and different
versions of files will be stored as separate versions of the corresponding
object in the remote.

- `worktree` - Use
[worktree](/docs/user-guide/data-management/cloud-versioning#worktree-remotes)
cloud versioning features for this S3 remote. Files stored in the remote will
retain their original filenames and directory hierarchy, and different
versions of files will be stored as separate versions of the corresponding
object in cloud storage. DVC will also attempt to ensure that the current
version of objects in the remote match the latest version of files in the DVC
repository. When both `version_aware` and `worktree` are set, `worktree` takes
precedence.

<admon type="tip">

The `version_aware` and `worktree` options require that
[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html)
be enabled on the specified S3 bucket.

</admon>

</details>

<details>
Expand Down Expand Up @@ -548,6 +573,31 @@ can propagate from an Azure configuration file (typically managed with
`container_name`. The default directory where it will be searched for is
`~/.azure` but this can be customized with the `AZURE_CONFIG_DIR` env var.

- `version_aware` - Use
[version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes)
cloud versioning features for this Azure remote. Files stored in the remote
will retain their original filenames and directory hierarchy, and different
versions of files will be stored as separate versions of the corresponding
object in the remote.

- `worktree` - Use
[worktree](/docs/user-guide/data-management/cloud-versioning#worktree-remotes)
cloud versioning features for this Azure remote. Files stored in the remote
will retain their original filenames and directory hierarchy, and different
versions of files will be stored as separate versions of the corresponding
object in cloud storage. DVC will also attempt to ensure that the current
version of objects in the remote match the latest version of files in the DVC
repository. When both `version_aware` and `worktree` are set, `worktree` takes
precedence.

<admon type="tip">

The `version_aware` and `worktree` options require that
[Blob versioning](https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview)
be enabled on the specified Azure storage account and container.

</admon>

</details>

<details>
Expand Down Expand Up @@ -722,6 +772,31 @@ set:
$ export GOOGLE_APPLICATION_CREDENTIALS='.../project-XXX.json'
```

- `version_aware` - Use
[version-aware](/docs/user-guide/data-management/cloud-versioning#version-aware-remotes)
cloud versioning features for this Google Cloud Storage remote. Files stored
in the remote will retain their original filenames and directory hierarchy,
and different versions of files will be stored as separate versions of the
corresponding object in the remote.

- `worktree` - Use
[worktree](/docs/user-guide/data-management/cloud-versioning#worktree-remotes)
cloud versioning features for this Google Cloud Storage remote. Files stored
in the remote will retain their original filenames and directory hierarchy,
and different versions of files will be stored as separate versions of the
corresponding object in cloud storage. DVC will also attempt to ensure that
the current version of objects in the remote match the latest version of files
in the DVC repository. When both `version_aware` and `worktree` are set,
`worktree` takes precedence.

<admon type="tip">

The `version_aware` and `worktree` options require that
[Object versioning](https://cloud.google.com/storage/docs/object-versioning) be
enabled on the specified bucket.

</admon>

</details>

<details>
Expand Down
22 changes: 20 additions & 2 deletions content/docs/command-reference/update.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,9 @@

Update files or directories imported from external <abbr>DVC repositories</abbr>
or [URLs](/doc/command-reference/import-url#description), and the corresponding
import `.dvc` files.
import `.dvc` files, or update files or directories from a
[worktree](/doc/user-guide/data-management/cloud-versioning#worktree-remotes)
remote.

## Synopsis

Expand Down Expand Up @@ -38,6 +40,22 @@ to update an imported artifact to a different revision.
$ dvc update --rev master
```

### Worktree update

When using a
[worktree](/doc/user-guide/data-management/cloud-versioning#worktree-remotes)
remote, `dvc update` will update the specified target to match the current
version of the corresponding file or directory from the remote storage. If the
current version of the specified target is a deleted file or an empty directory,
`dvc update` will fail.

<admon type="warn">

Note that the `--rev`, `--no-download` and `--to-remote` flags are not
compatible when updating from a worktree remote.

</admon>

## Options

- `--rev <commit>` - commit hash, branch or tag name, etc. (any
Expand All @@ -51,7 +69,7 @@ $ dvc update --rev master
For stages created with `dvc import-url` and a
[cloud-versioned URL](/doc/command-reference/import-url#--version-aware),
`--rev` can be used to specify a object version ID to use. By default, the
import will be updated to the latest version from cloud storage.
import will be updated to the current version from cloud storage.

- `-R`, `--recursive` - determines the files to update by searching each target
directory and its subdirectories for import `.dvc` files to inspect. If there
Expand Down
3 changes: 2 additions & 1 deletion content/docs/sidebar.json
Original file line number Diff line number Diff line change
Expand Up @@ -126,7 +126,8 @@
"children": [
"large-dataset-optimization",
"importing-external-data",
"managing-external-data"
"managing-external-data",
"cloud-versioning"
]
},
{
Expand Down
113 changes: 113 additions & 0 deletions content/docs/user-guide/data-management/cloud-versioning.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# Cloud Versioning

<admon type="warn">

Cloud versioning features are currently under active development and should be
considered experimental. These features are subject to frequent change, and the
documentation may not always reflect changes available in the latest DVC
release.

</admon>

When cloud versioning is enabled, DVC will store files in the remote according
to their original directory location and filenames. Different versions of a file
will then be stored as separate versions of the corresponding object in cloud
storage. This is useful for cases where users prefer to retain their original
filenames and directory hierarchy in remote storage (instead of using DVC's
usual
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
format).

<admon type="warn">

Note that not all DVC functionality is supported when using cloud versioned
remotes, and using cloud versioning comes with the tradeoff of losing certain
benefits of content-addressable storage.
Comment on lines +21 to +25
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specific functionality is not available? The hidden section below only states that "DVC does not provide de-duplication, and certain remote storage performance optimizations will be unavailable".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much anything that isn't explicitly documented as a cloud-versioning-enabled command/feature

But off the top of my head, some of the things that don't work are:

  • dvc import from a DVC repo that uses cloud versioned remote(s)
  • run cache (it works locally but you cannot push/pull it with cloud versioned remotes)
  • gc -c
  • status -c
  • exp sharing (anything that requires exp push/pull)
  • dvc push with revision flags (you cannot dvc push --all-branches/--all-tags/--all-commits to cloud versioned remotes, but dvc pull will work with the flags)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth it to rephrase this and other texts then, to emphasize that A LOT of features don't work with cloud versioned data remotes.


</admon>

<details>

### Expand for more details on the differences between cloud versioned and content-addressable storage

`dvc remote` storage normally uses
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
to organize versioned data. Different versions of files are stored in the remote
according to hash of their data content instead of according to their original
filenames and directory location. This allows DVC to optimize certain remote
storage lookup and data sync operations, and provides data de-duplication at the
file level. However, this comes with the drawback of losing human-readable
filenames without the use of the DVC CLI (`dvc get --show-url`) or API
(`dvc.api.get_url()`).

When using cloud versioning, DVC does not provide de-duplication, and certain
remote storage performance optimizations will be unavailable.

</details>

## Supported storage providers

Cloud versioning features are only avaible for certain storage providers.
Currently, it is supported on the following `dvc remote` types:

- Amazon S3 (requires
[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html)
enabled buckets)
- Microsoft Azure Blob Storage (requires
[Blob versioning](https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview)
enabled storage accounts and containers)
- Google Cloud Storage (requires
[Object versioning](https://cloud.google.com/storage/docs/object-versioning)
enabled buckets)

## Version-aware remotes

When the `version_aware` option is enabled on a `dvc remote`:

- `dvc push` will utilize cloud versioning when storing data in the remote. Data
will retain its original directory structure and filenames, and each version
of a file tracked by DVC will be stored as a new version of the corresponding
object in cloud storage.
- `dvc fetch` and `dvc pull` will download the corresponding version of an
object from cloud storage.

<admon type="warn">

Note that when `version_aware` is in use, DVC does not delete current versions
or restore noncurrent versions of objects in cloud storage. So the current
version of an object in cloud storage may not match the version of a file in your DVC repository.
Comment on lines +74 to +78
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-explain this in simpler terms? I'm not sure I get it but it seems key (and the differenve vs. worktree remotes).

BTW do we have a dummy versioned bucket to play with by any chance?

Copy link
Contributor Author

@pmrowla pmrowla Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure there is a simpler way to explain it. The assumption here is that users have at least a basic understanding of how cloud versioning works in S3/azure/GCS.

Deleting a file from a versioned bucket in cloud storage does not delete any data. It sets a flag that says the file is "deleted" (i.e. it becomes hidden by default). So if you delete a file on S3, and then ask S3 for the latest version of that file, it will say the file does not exist. But if you want to access the older versions of that file, you can still get them from S3 (as long as you ask for the specific version you want, and not just "latest/current version of file"). Likewise, by default S3 gives you the most recent (current) version of a file unless you request a specific version. So when you look at the bucket in the S3 console, you would only see the latest/current versions of what is in your bucket.

Also, it's important to note that in versioned storage you can only ever add new versions of a file (you cannot revert to an older version so that the old version becomes the current/latest version). Doing a "revert" operation actually creates a new copy of the older version of the file (and the duplicate costs against your S3 storage quota, even though they are identical).


For worktree, we tell S3 to "delete" files that are not in the DVC repo (so the latest/current version of a bucket looks like a mirror of the DVC repo). Likewise, for files that do exist in both the DVC repo and the bucket, we make sure the "latest/current" version always matches the DVC repo. If you have someone manually editing files in the bucket, or two people pushing from different branches in the DVC repo, each dvc push will overwrite the latest/current version with a new copy of files (that match the repo at the time a user does dvc push). (this will result in duplicated data)

For version_aware, we don't bother deleting anything, we only care about whether or not the version we care about was pushed at some point in time. Meaning the "latest/current" version of files in your bucket may not actually represent the latest version of your DVC repo. As long as the version of a file from our DVC repo was pushed at some point in time (and that pushed version still exists in the bucket) we won't push duplicate data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there should be a simpler way to explain it, or maybe it's not needed. TBH I don't even know what the point of the admonition is. What's the warning? Esp. if we're assuming the user knows how versioned storage works -- shouldn't it be obvious that DVC can't delete anything (since it can't be done) ?

whether or not the version we care about was pushed at some point in time. Meaning the "latest/current" version of files in your bucket may not actually represent the latest version of your DVC repo

I think this is closer to what we're trying to emphasize here. Maybe something like:

"Because versioned storage does not allow true deletions or directly restoring old versions [link somewhere], there may be situations where the latest data in your DVC repo does not match what you see as latest versions in the bucket."

It doesn't fully explain how those situations happen but I think that's what we're trying to make a note about.

p.s. sorry very late reply cc @dberenbaum


</admon>

## Worktree remotes

`worktree` remotes behave similarly to `version_aware` remotes, but with one key
difference. For `worktree` remotes, DVC will also attempt to ensure that the
current version of objects in cloud storage match the latest versions of files
in your DVC repository.

So in addition to the command behaviors described for `version_aware` remotes,
when the `worktree` option is enabled on a `dvc remote`:

- `dvc push` will also ensure that the current version of objects in remote
storage match the latest versions of files in your DVC repository repository.
Additionally, DVC will delete the current version of any objects which were
present in cloud storage but that do not exist in your current DVC repository
workspace.
- `dvc update` can be used to update a DVC-tracked file or directory in your DVC
repository to match the current version of the corresponding object(s) from
cloud storage.

<admon type="info">

Note that deleting current versions in cloud storage does not delete any objects
(and does not delete any data). It only means that the current version of a
given object will show that the object does not exist.

</admon>

## Importing versioned data

DVC supports importing cloud versioned data from supported storage providers.
Refer to the documentation for `dvc import-url` and `dvc update` for more
information.