Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cloud versioning (initial guide and ref updates) #4165

Merged
merged 7 commits into from
Dec 30, 2022
Merged

Conversation

pmrowla
Copy link
Contributor

@pmrowla pmrowla commented Dec 6, 2022

No description provided.

@pmrowla pmrowla self-assigned this Dec 6, 2022
@shcheklein shcheklein temporarily deployed to dvc-org-ref-cloud-versi-bh5ptm December 6, 2022 10:21 Inactive
@pmrowla
Copy link
Contributor Author

pmrowla commented Dec 6, 2022

Wasn't really sure where to document the new remote configurations since we don't actually document much about how traditional remotes work either, so I added a new page in user-guide/data-management/ for now

@shcheklein shcheklein temporarily deployed to dvc-org-ref-cloud-versi-bh5ptm December 6, 2022 10:30 Inactive
@github-actions
Copy link
Contributor

github-actions bot commented Dec 6, 2022

Link Check Report

There were no links to check!

@pmrowla pmrowla force-pushed the ref-cloud-versioning branch from 1167a40 to a892829 Compare December 6, 2022 10:39
@shcheklein shcheklein temporarily deployed to dvc-org-ref-cloud-versi-bh5ptm December 6, 2022 10:39 Inactive
@shcheklein shcheklein temporarily deployed to dvc-org-ref-cloud-versi-0yncqk December 8, 2022 17:01 Inactive
@dberenbaum
Copy link
Contributor

@pmrowla Let's get this merged as soon as you're back. A couple more ideas to consider:

@pmrowla pmrowla force-pushed the ref-cloud-versioning branch from a892829 to 72f7d11 Compare December 26, 2022 08:49
@shcheklein shcheklein had a problem deploying to dvc-org-ref-cloud-versi-0yncqk December 26, 2022 08:49 Failure
@shcheklein shcheklein had a problem deploying to dvc-org-ref-cloud-versi-0yncqk December 26, 2022 09:04 Failure
@pmrowla
Copy link
Contributor Author

pmrowla commented Dec 26, 2022

@pmrowla Let's get this merged as soon as you're back. A couple more ideas to consider:

@dberenbaum I pushed changes for the current batch of review comments, and took care of these two points as well (for the import-url case I just added the relevant links for now).

It looks like the heroku build is failing but I don't have the permissions to see why, just let me know whatever needs to be fixed (may be something w/my <admon> tag usage?).

@dberenbaum
Copy link
Contributor

@iterative/websites The build looks okay locally. Could you take a look?

Comment on lines 354 to 356
object in the remote. This requires that
[S3 Versioning](https://docs.aws.amazon.com/AmazonS3/latest/userguide/Versioning.html)
be enabled on the specified S3 bucket.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: should we call this out with an admon?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

moved this into a tip admon for each provider

@shcheklein shcheklein temporarily deployed to dvc-org-ref-cloud-versi-0yncqk December 30, 2022 02:37 Inactive
### Expand for more details on the differences between cloud versioned and content-addressible storage

`dvc remote` storage normally uses
[content-addressible storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[content-addressible storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
[content-addressable storage](/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)

@shcheklein shcheklein temporarily deployed to dvc-org-ref-cloud-versi-0yncqk December 30, 2022 14:08 Inactive
@dberenbaum dberenbaum merged commit 02cc146 into main Dec 30, 2022
@dberenbaum dberenbaum deleted the ref-cloud-versioning branch December 30, 2022 14:33
@jorgeorpinel jorgeorpinel removed their request for review January 26, 2023 09:20
@jorgeorpinel
Copy link
Contributor

jorgeorpinel commented Jan 26, 2023

we don't actually document much about how traditional remotes work either, so I added a new page in user-guide/data-management/ for now

Note that there are 2 drafts in the works (from me) that attempt to better document data versioning (including cloud versioning) and remote configuration, respectively.

Thanks for https://dvc.org/doc/user-guide/data-management/cloud-versioning BTW, it's a great start.

@jorgeorpinel jorgeorpinel added C: guide Content of /doc/user-guide C: ref Content of /doc/*-reference labels Jan 26, 2023
@jorgeorpinel jorgeorpinel changed the title ref: document cloud versioned remotes Cloud versioned remotes Jan 26, 2023
@jorgeorpinel jorgeorpinel changed the title Cloud versioned remotes Cloud versioning Jan 26, 2023
@jorgeorpinel jorgeorpinel mentioned this pull request Jan 26, 2023
2 tasks
@jorgeorpinel jorgeorpinel changed the title Cloud versioning Cloud versioning (initial guide and ref updates) Jan 26, 2023
Comment on lines +21 to +25
<admon type="warn">

Note that not all DVC functionality is supported when using cloud versioned
remotes, and using cloud versioning comes with the tradeoff of losing certain
benefits of content-addressable storage.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What specific functionality is not available? The hidden section below only states that "DVC does not provide de-duplication, and certain remote storage performance optimizations will be unavailable".

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pretty much anything that isn't explicitly documented as a cloud-versioning-enabled command/feature

But off the top of my head, some of the things that don't work are:

  • dvc import from a DVC repo that uses cloud versioned remote(s)
  • run cache (it works locally but you cannot push/pull it with cloud versioned remotes)
  • gc -c
  • status -c
  • exp sharing (anything that requires exp push/pull)
  • dvc push with revision flags (you cannot dvc push --all-branches/--all-tags/--all-commits to cloud versioned remotes, but dvc pull will work with the flags)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may be worth it to rephrase this and other texts then, to emphasize that A LOT of features don't work with cloud versioned data remotes.

Comment on lines +74 to +78
<admon type="warn">

Note that when `version_aware` is in use, DVC does not delete current versions
or restore noncurrent versions of objects in cloud storage. So the current
version of an object in cloud storage may not match the version of a file in your DVC repository.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we re-explain this in simpler terms? I'm not sure I get it but it seems key (and the differenve vs. worktree remotes).

BTW do we have a dummy versioned bucket to play with by any chance?

Copy link
Contributor Author

@pmrowla pmrowla Jan 26, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not really sure there is a simpler way to explain it. The assumption here is that users have at least a basic understanding of how cloud versioning works in S3/azure/GCS.

Deleting a file from a versioned bucket in cloud storage does not delete any data. It sets a flag that says the file is "deleted" (i.e. it becomes hidden by default). So if you delete a file on S3, and then ask S3 for the latest version of that file, it will say the file does not exist. But if you want to access the older versions of that file, you can still get them from S3 (as long as you ask for the specific version you want, and not just "latest/current version of file"). Likewise, by default S3 gives you the most recent (current) version of a file unless you request a specific version. So when you look at the bucket in the S3 console, you would only see the latest/current versions of what is in your bucket.

Also, it's important to note that in versioned storage you can only ever add new versions of a file (you cannot revert to an older version so that the old version becomes the current/latest version). Doing a "revert" operation actually creates a new copy of the older version of the file (and the duplicate costs against your S3 storage quota, even though they are identical).


For worktree, we tell S3 to "delete" files that are not in the DVC repo (so the latest/current version of a bucket looks like a mirror of the DVC repo). Likewise, for files that do exist in both the DVC repo and the bucket, we make sure the "latest/current" version always matches the DVC repo. If you have someone manually editing files in the bucket, or two people pushing from different branches in the DVC repo, each dvc push will overwrite the latest/current version with a new copy of files (that match the repo at the time a user does dvc push). (this will result in duplicated data)

For version_aware, we don't bother deleting anything, we only care about whether or not the version we care about was pushed at some point in time. Meaning the "latest/current" version of files in your bucket may not actually represent the latest version of your DVC repo. As long as the version of a file from our DVC repo was pushed at some point in time (and that pushed version still exists in the bucket) we won't push duplicate data.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, there should be a simpler way to explain it, or maybe it's not needed. TBH I don't even know what the point of the admonition is. What's the warning? Esp. if we're assuming the user knows how versioned storage works -- shouldn't it be obvious that DVC can't delete anything (since it can't be done) ?

whether or not the version we care about was pushed at some point in time. Meaning the "latest/current" version of files in your bucket may not actually represent the latest version of your DVC repo

I think this is closer to what we're trying to emphasize here. Maybe something like:

"Because versioned storage does not allow true deletions or directly restoring old versions [link somewhere], there may be situations where the latest data in your DVC repo does not match what you see as latest versions in the bucket."

It doesn't fully explain how those situations happen but I think that's what we're trying to make a note about.

p.s. sorry very late reply cc @dberenbaum

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C: guide Content of /doc/user-guide C: ref Content of /doc/*-reference
Projects
No open projects
Archived in project
Development

Successfully merging this pull request may close these issues.

4 participants