Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Data sharing scenarios #784
Data sharing scenarios #784
Changes from 1 commit
1203c11
ef27cb2
e2486a5
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again, see my other comments. There are at least two possibilities - shared cache or shared remote. In case of NAS it's actually beneficial to share cache (and use some regular cloud remote to still do backups).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sharing cache in the case of a NAS may cause problems when we try to use
dvc gc
. I remember seeing some discussions about this.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and we even introduced a special flag - to pass multiple projects at once to
dvc gc
. Gc in DVC is a big pain still but it does not change the fact I mentioned above.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The option
-p, --projects
ofdvc gc
gets a path to a project (at least this is how I understand the man page, I have never tried it).In the case of a NAS mounted storage I assume that the collaborating projects are located on different machines, isn't it? So, the option
-p, --projects
cannot be used in this case.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are probably ways still to run GC - clone projects to a single machine. It's not ideal (all about GC not) but it's a maintenance operations vs day to day workflow that is being optimized with links if you share cache vs sharing remote directly.
Also, I think it is the same problem with other your cases and in one of them it's about people sharing the same machine, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would first understand the options in terms of organizing data, understand which of them are more general then others, then would try to come up with a couple of sections that explain them in a general way. And by general I mean concepts like - cache is shared or not? people use a single machine or not? etc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how my initial concern is resolved or addressed here. Please 🙏 , don't resolve them on your own - it makes it extremely hard to do reviews (check and follow up the previously raised concerns).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be precise - I don't see much value in three (two?) sections that explain different variations of the mounted remote. And to be even more precise - I haven't see a mounted share remote case. The only benefit I see - unsupported storage type. I would just create a How to or FAQ or something with one-two paragraphs explanation on options - mount, use rsync/rclone, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added this page:
and this interactive example:
that explain the case of mounted cache, which is more efficient if we share data through a NAS (with caveat of being careful with the command
dvc gc
).There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does not answer my question unless again I'm missing the whole point of this PR. Could you please elaborate on this:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is strange to see that it contains only examples, no explanation of what's happening whatsoever. I would expect it too explain remotes way better - this is a primary purpose of this.
SSH example is too complicated -
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a modified version of this page: https://dvc.org/doc/use-cases/sharing-data-and-model-files
That one does not have much explanations either and things are explained mostly by the example.
Actually I don't find it feasible to explain a solution without using at least a few DVC commands, and for those commands to make sense they have to be used in the context of an example. So, the description mainly describes the situation, and the solution is described by the examples. The hope is that once the reader has understood the solution he can generalize and adopt it for his own case.
I am planning to explain the details of the remotes (and their types) on another section. This section is about data sharing scenarios, so let's just refer to the remote details, but not include them here.
Yes, the Git repository is usually located on GitHub. But this is just an example, an assumption to keep things simple and interactive.
I tried to keep the analogy with Git. In Git a central bare repository is usually named
project.git
. So, a central DVC storage/cache is nameproject.cache
.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we need both that was point. I'm confused why do we need both. I think the remote section is enough. There should be some "DVC workflow" section that explains from a high level perspective the workflow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that these UG pages:
https://dvc-org-pr-807.herokuapp.com/doc/user-guide/external-data (from another PR)
can be merged with the pages of this PR. I think they should be separate sections.