-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cloud versioning (initial guide and ref updates) #4165
Conversation
Wasn't really sure where to document the new remote configurations since we don't actually document much about how traditional remotes work either, so I added a new page in |
1167a40
to
a892829
Compare
@pmrowla Let's get this merged as soon as you're back. A couple more ideas to consider:
|
a892829
to
72f7d11
Compare
@dberenbaum I pushed changes for the current batch of review comments, and took care of these two points as well (for the import-url case I just added the relevant links for now). It looks like the heroku build is failing but I don't have the permissions to see why, just let me know whatever needs to be fixed (may be something w/my |
@iterative/websites The build looks okay locally. Could you take a look? |
object in the remote. This requires that | ||
[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html) | ||
be enabled on the specified S3 bucket. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: should we call this out with an admon?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
moved this into a tip
admon for each provider
### Expand for more details on the differences between cloud versioned and content-addressible storage | ||
|
||
`dvc remote` storage normally uses | ||
[content-addressible storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[content-addressible storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) | |
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory) |
Note that there are 2 drafts in the works (from me) that attempt to better document data versioning (including cloud versioning) and remote configuration, respectively. Thanks for https://dvc.org/doc/user-guide/data-management/cloud-versioning BTW, it's a great start. |
<admon type="warn"> | ||
|
||
Note that not all DVC functionality is supported when using cloud versioned | ||
remotes, and using cloud versioning comes with the tradeoff of losing certain | ||
benefits of content-addressable storage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What specific functionality is not available? The hidden section below only states that "DVC does not provide de-duplication, and certain remote storage performance optimizations will be unavailable".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pretty much anything that isn't explicitly documented as a cloud-versioning-enabled command/feature
But off the top of my head, some of the things that don't work are:
dvc import
from a DVC repo that uses cloud versioned remote(s)- run cache (it works locally but you cannot push/pull it with cloud versioned remotes)
gc -c
status -c
exp
sharing (anything that requiresexp push/pull
)dvc push
with revision flags (you cannotdvc push --all-branches/--all-tags/--all-commits
to cloud versioned remotes, butdvc pull
will work with the flags)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may be worth it to rephrase this and other texts then, to emphasize that A LOT of features don't work with cloud versioned data remotes.
<admon type="warn"> | ||
|
||
Note that when `version_aware` is in use, DVC does not delete current versions | ||
or restore noncurrent versions of objects in cloud storage. So the current | ||
version of an object in cloud storage may not match the version of a file in your DVC repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we re-explain this in simpler terms? I'm not sure I get it but it seems key (and the differenve vs. worktree remotes).
BTW do we have a dummy versioned bucket to play with by any chance?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really sure there is a simpler way to explain it. The assumption here is that users have at least a basic understanding of how cloud versioning works in S3/azure/GCS.
Deleting a file from a versioned bucket in cloud storage does not delete any data. It sets a flag that says the file is "deleted" (i.e. it becomes hidden by default). So if you delete a file on S3, and then ask S3 for the latest version of that file, it will say the file does not exist. But if you want to access the older versions of that file, you can still get them from S3 (as long as you ask for the specific version you want, and not just "latest/current version of file"). Likewise, by default S3 gives you the most recent (current) version of a file unless you request a specific version. So when you look at the bucket in the S3 console, you would only see the latest/current versions of what is in your bucket.
Also, it's important to note that in versioned storage you can only ever add new versions of a file (you cannot revert to an older version so that the old version becomes the current/latest version). Doing a "revert" operation actually creates a new copy of the older version of the file (and the duplicate costs against your S3 storage quota, even though they are identical).
For worktree
, we tell S3 to "delete" files that are not in the DVC repo (so the latest/current version of a bucket looks like a mirror of the DVC repo). Likewise, for files that do exist in both the DVC repo and the bucket, we make sure the "latest/current" version always matches the DVC repo. If you have someone manually editing files in the bucket, or two people pushing from different branches in the DVC repo, each dvc push
will overwrite the latest/current version with a new copy of files (that match the repo at the time a user does dvc push
). (this will result in duplicated data)
For version_aware
, we don't bother deleting anything, we only care about whether or not the version we care about was pushed at some point in time. Meaning the "latest/current" version of files in your bucket may not actually represent the latest version of your DVC repo. As long as the version of a file from our DVC repo was pushed at some point in time (and that pushed version still exists in the bucket) we won't push duplicate data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there should be a simpler way to explain it, or maybe it's not needed. TBH I don't even know what the point of the admonition is. What's the warning? Esp. if we're assuming the user knows how versioned storage works -- shouldn't it be obvious that DVC can't delete anything (since it can't be done) ?
whether or not the version we care about was pushed at some point in time. Meaning the "latest/current" version of files in your bucket may not actually represent the latest version of your DVC repo
I think this is closer to what we're trying to emphasize here. Maybe something like:
"Because versioned storage does not allow true deletions or directly restoring old versions [link somewhere], there may be situations where the latest data in your DVC repo does not match what you see as latest versions in the bucket."
It doesn't fully explain how those situations happen but I think that's what we're trying to make a note about.
p.s. sorry very late reply cc @dberenbaum
No description provided.