-
Notifications
You must be signed in to change notification settings - Fork 394
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
added a BC: workspace document #2197
Conversation
@@ -6,3 +6,46 @@ tooltip: >- | |||
models, etc. Typically, it's also a Git repository. It will contain your DVC | |||
project. | |||
--- | |||
|
|||
<!-- keywords: data science project architecture, machine learning project architecture, machine learning workflow, data science workflow, machine learning file system, data science file system, data science project structure, machine learning project structure, notebook version control --> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note to remove this comment later before merging (but for now it's useful to have the list during review 👍)
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please bear with us as we review this first concept as we will probably establish some criteria for all concepts along the way. Some general comments below to begin with:
# Workspace | ||
|
||
A data science project consists of data obtained from many different sources. | ||
Most of the time it needs to convert the the format of this data into a form | ||
that is required by the training models and supplying to data science / machine | ||
learning workflows. Sometimes it needs to be split into multiple files or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK to have some context but it should probably be a short intro paragraph only? Maybe just say something about project structures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in BC documents, we need to make separation of concerns. If our aim is a quick introduction to DVC concepts, we don't need such story material to begin with. This was solely for SEO purposes.
IMHO I can write blog posts / tutorials and tell stories with all kinds of SEO phrases there instead of trying to use these in docs. As you can see it's possible but I don't think it increases the quality of docs.
I can paraphrase into a single paragraph or remove them completely. I'm fine with all.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to make separation of concerns...
We do already have tooltips for shot definitions. But you can't search for "pipeline dvc" on Google and find a tooltip (i.e. we want to have landing pages for our docs).
Another goal of these is to extract explanations out of the cmd refs (e.g. in https://dvc.org/doc/command-reference/dag which shouldn't be where pipelines are explained)
In my mind most concept pages are 2-3 paragraph long in general (we'll see as they come). There could also be some cool diagrams potentially (feel free to contribute sketches, we can fix them up later). They can still have a story, maybe just a very straightforward one? (motivation -> explanation -> uses/implications). WDYT @shcheklein ?
I'm putting link to the last of these highlighted project words.
p.s. I usually "tooltipy" just one instance of the term per page or big section, but I try to make it the first or 2nd one (as long as it doesn't distract e.g. no other links are nearby).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
p.p.s motivation -> explanation -> uses/implications
- I see that's more or less the structure here already 👍
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
contents through DVC commands. | ||
|
||
Files and directories in the workspace can be added to DVC (`dvc add`) or they | ||
can be downloaded from external sources (`dvc get`, `dvc import`, |
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
This comment was marked as resolved.
This comment was marked as resolved.
Sorry, something went wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But actually add
can also download data to the workspace (see --out
and --to-remote
options). Also, import*
commands download AND track data. You may want to rephrase this part accordingly 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sometimes I think there is no command that hasn't got a duplicate, somehow :) I try to mention commands in passing, if we'd consider each and every option to commands, we'll need to duplicate the command reference here IMHO.
We may just delete the commands if you would like.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can also list all possibilities for each functionality, like
In the workspace, you can
- Import files and directories (
dvc add --out
,dvc import-url
...)
but I think this will turn the document into a list of commands and options.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to list every command usage of course, agreed!
My point was that these 3 commands mentioned actually overlap in a way that makes the current text slightly incorrect. In any case, the main use case of add
is not to "add" but to "track", actually. Please check each cmd ref to try to find the right terms when needed 🙂
"Download" is correct for get/import
but add
can also download (and they can all "transfer") so I'd avoid that term probably. And in fact I wouldn't even mention get
here, since it doesn't require a DVC project/workspace. For import
I'd try to use the cmd name as the relevant action (to "import") I guess...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the main use case of add is not to "add" but to "track"
For import I'd try to use the cmd name as the relevant action (to "import")
then again import*
also track the downloaded data 😅 ("adds"). Maybe it should be a single sentence about tracking and put all add
, import
, import-url
in the same parenthesis.
OK I see I neglected this accidentally. Sorry, checking now @iesahin 👍 |
These may be split into multiple files or directories or (as the project | ||
structure needs) have different versions for different requirements, e.g., a | ||
smaller / simplified version might be required in prototyping for faster | ||
feedback and shorter training times. A single workspace to manage all artifacts |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
or have different versions for different requirements...
Let's not go into versioning here, I think. At least not by implying they're all in the workspace because in DVC the workspace only holds one version (the rest are cached and managed via Git, metafiles, etc.
(Mentioned in #2197 (comment))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
versioning needs and managing dependencies make it increasingly difficult
p.s. This is better way to very subtly mention versioning (could even link to the corresponding Use Case doc).
of a project is desirable, although versioning needs and managing dependencies | ||
make it increasingly difficult. | ||
|
||
DVC allows a single directory to contain all your project artifacts. The |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a single directory to contain all your project artifacts
Not exactly. File contents are org'd in the cache with a special file structure (see https://dvc.org/doc/user-guide/project-structure/internal-files#structure-of-the-cache-directory)
The workspace is the directory containing the visible part of your project
That part is correct. And contradicts the previous part 🙂 (because "visible part" implies there's a hidden part which must be in other dirs).
Let's open this p with that sentence.
workspace is the directory containing the _visible_ part of your | ||
<abbr>project</abbr>, e.g., the raw data, source code, model files. You can have |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"e.g., the raw data, source code, model files you're currently using"
<abbr>project</abbr>, e.g., the raw data, source code, model files. You can have | ||
multiple versions of data, models, and other kinds of artifacts within the | ||
workspace and limit your focus to a subset of these. You can record your |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can have multiple versions of data ... within the workspace
Again contradicting 😕
workspace and limit your focus to a subset of these. You can record your | ||
progress in a commit and analyze your data and model history. DVC provides a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
record your progress in a commit and analyze...
Again probably too many details about versioning. Doesn't really fall within the 'workspace' concept, I think. This can be simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need to rename your models for minor changes...
or save tens of different renamed files for training
Those are better mentions of versioning (clear benefits i.e. how much you'd have to suffer without DVC)
save cleaned up data in different directories
That specific example isn't great because that's still pretty common even with DVC (e.g. in our own example-get-started repo we have a prepared/
dir).
progress in a commit and analyze your data and model history. DVC provides a | ||
_machine learning file system_ to manipulate your data and models using its |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This "ML file system" keyword is pretty tricky. No need to force it (just skip it if you can't find a correct way to use it).
I can only think of something like "DVC turns your project into a sort of machine learning file system for..." but not sure.
programs. DVC can keep track of all of these in a single directory called the | ||
workspace. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DVC can keep track of all of these in a single directory called the workspace.
Again contradicting and also, repetitive at this point.
DVC supports all typical operations of a versioned data file system through its | ||
commands. Behind the scene these operations use <abbr>metafiles</abbr> like the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DVC supports all typical operations of a versioned data file system through its commands.
Maybe open the previous paragraph with that?
p.s. having this I think def. no need for the "ml file system" keyword. But keeping "machine learning" somewhere would be nice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's try to incorporate the metafile mentions to the main (2nd) paragraph somehow. After simplifying it per my previous comments, there should be enough room in there. that way there's no need for this 4th p.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More feedback 👍
I guess I should take this over 👍 |
Context change makes me miserable. I can take care of this after the GS project if you have other tasks. Thank you @jorgeorpinel |
For some reason I can't fetch |
This doesn't fix any issue completely.
Added workspace.md doc to
basic-concepts
dir. Expanded this from previous iterations.Added Basic Concepts section to the sidebar and added this document.
tooltip definition is not changed.
This PR closes #1127 and is related to #550.