-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud versioning (initial guide and ref updates) #4165
Changes from all commits
7accfe8
e695ade
72f7d11
1aa6109
d28a41e
a6b1fde
8c148d0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,113 @@ | ||
# Cloud Versioning | ||
|
||
<admon type="warn"> | ||
|
||
Cloud versioning features are currently under active development and should be | ||
considered experimental. These features are subject to frequent change, and the | ||
documentation may not always reflect changes available in the latest DVC | ||
release. | ||
|
||
</admon> | ||
|
||
When cloud versioning is enabled, DVC will store files in the remote according | ||
to their original directory location and filenames. Different versions of a file | ||
will then be stored as separate versions of the corresponding object in cloud | ||
storage. This is useful for cases where users prefer to retain their original | ||
filenames and directory hierarchy in remote storage (instead of using DVC's | ||
usual | ||
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) | ||
format). | ||
|
||
<admon type="warn"> | ||
|
||
Note that not all DVC functionality is supported when using cloud versioned | ||
remotes, and using cloud versioning comes with the tradeoff of losing certain | ||
benefits of content-addressable storage. | ||
|
||
</admon> | ||
|
||
<details> | ||
|
||
### Expand for more details on the differences between cloud versioned and content-addressable storage | ||
|
||
`dvc remote` storage normally uses | ||
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) | ||
to organize versioned data. Different versions of files are stored in the remote | ||
according to hash of their data content instead of according to their original | ||
filenames and directory location. This allows DVC to optimize certain remote | ||
storage lookup and data sync operations, and provides data de-duplication at the | ||
file level. However, this comes with the drawback of losing human-readable | ||
filenames without the use of the DVC CLI (`dvc get --show-url`) or API | ||
(`dvc.api.get_url()`). | ||
|
||
When using cloud versioning, DVC does not provide de-duplication, and certain | ||
remote storage performance optimizations will be unavailable. | ||
|
||
</details> | ||
|
||
## Supported storage providers | ||
|
||
Cloud versioning features are only avaible for certain storage providers. | ||
Currently, it is supported on the following `dvc remote` types: | ||
|
||
- Amazon S3 (requires | ||
[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) | ||
enabled buckets) | ||
- Microsoft Azure Blob Storage (requires | ||
[Blob versioning](https://learn.microsoft.com/en-us/azure/storage/blobs/versioning-overview) | ||
enabled storage accounts and containers) | ||
- Google Cloud Storage (requires | ||
[Object versioning](https://cloud.google.com/storage/docs/object-versioning) | ||
enabled buckets) | ||
|
||
## Version-aware remotes | ||
|
||
When the `version_aware` option is enabled on a `dvc remote`: | ||
|
||
- `dvc push` will utilize cloud versioning when storing data in the remote. Data | ||
will retain its original directory structure and filenames, and each version | ||
of a file tracked by DVC will be stored as a new version of the corresponding | ||
object in cloud storage. | ||
- `dvc fetch` and `dvc pull` will download the corresponding version of an | ||
object from cloud storage. | ||
|
||
<admon type="warn"> | ||
|
||
Note that when `version_aware` is in use, DVC does not delete current versions | ||
or restore noncurrent versions of objects in cloud storage. So the current | ||
version of an object in cloud storage may not match the version of a file in your DVC repository. | ||
Comment on lines
+74
to
+78
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can we re-explain this in simpler terms? I'm not sure I get it but it seems key (and the differenve vs. worktree remotes).
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm not really sure there is a simpler way to explain it. The assumption here is that users have at least a basic understanding of how cloud versioning works in S3/azure/GCS. Deleting a file from a versioned bucket in cloud storage does not delete any data. It sets a flag that says the file is "deleted" (i.e. it becomes hidden by default). So if you delete a file on S3, and then ask S3 for the latest version of that file, it will say the file does not exist. But if you want to access the older versions of that file, you can still get them from S3 (as long as you ask for the specific version you want, and not just "latest/current version of file"). Likewise, by default S3 gives you the most recent (current) version of a file unless you request a specific version. So when you look at the bucket in the S3 console, you would only see the latest/current versions of what is in your bucket. Also, it's important to note that in versioned storage you can only ever add new versions of a file (you cannot revert to an older version so that the old version becomes the current/latest version). Doing a "revert" operation actually creates a new copy of the older version of the file (and the duplicate costs against your S3 storage quota, even though they are identical). For For There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, there should be a simpler way to explain it, or maybe it's not needed. TBH I don't even know what the point of the admonition is. What's the warning? Esp. if we're assuming the user knows how versioned storage works -- shouldn't it be obvious that DVC can't delete anything (since it can't be done) ?
I think this is closer to what we're trying to emphasize here. Maybe something like: "Because versioned storage does not allow true deletions or directly restoring old versions [link somewhere], there may be situations where the latest data in your DVC repo does not match what you see as latest versions in the bucket." It doesn't fully explain how those situations happen but I think that's what we're trying to make a note about.
|
||
|
||
</admon> | ||
|
||
## Worktree remotes | ||
|
||
`worktree` remotes behave similarly to `version_aware` remotes, but with one key | ||
difference. For `worktree` remotes, DVC will also attempt to ensure that the | ||
current version of objects in cloud storage match the latest versions of files | ||
in your DVC repository. | ||
|
||
So in addition to the command behaviors described for `version_aware` remotes, | ||
when the `worktree` option is enabled on a `dvc remote`: | ||
|
||
- `dvc push` will also ensure that the current version of objects in remote | ||
storage match the latest versions of files in your DVC repository repository. | ||
Additionally, DVC will delete the current version of any objects which were | ||
present in cloud storage but that do not exist in your current DVC repository | ||
workspace. | ||
- `dvc update` can be used to update a DVC-tracked file or directory in your DVC | ||
repository to match the current version of the corresponding object(s) from | ||
cloud storage. | ||
|
||
<admon type="info"> | ||
|
||
Note that deleting current versions in cloud storage does not delete any objects | ||
(and does not delete any data). It only means that the current version of a | ||
given object will show that the object does not exist. | ||
|
||
</admon> | ||
|
||
## Importing versioned data | ||
|
||
DVC supports importing cloud versioned data from supported storage providers. | ||
Refer to the documentation for `dvc import-url` and `dvc update` for more | ||
information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What specific functionality is not available? The hidden section below only states that "DVC does not provide de-duplication, and certain remote storage performance optimizations will be unavailable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much anything that isn't explicitly documented as a cloud-versioning-enabled command/feature
But off the top of my head, some of the things that don't work are:
dvc import
from a DVC repo that uses cloud versioned remote(s)gc -c
status -c
exp
sharing (anything that requiresexp push/pull
)dvc push
with revision flags (you cannotdvc push --all-branches/--all-tags/--all-commits
to cloud versioned remotes, butdvc pull
will work with the flags)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth it to rephrase this and other texts then, to emphasize that A LOT of features don't work with cloud versioned data remotes.