Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: absorb What is DVC? into other existing docs, et al. #1581

Merged
merged 44 commits into from
Aug 10, 2020
Merged
Changes from 1 commit
Commits
Show all changes
44 commits
Select commit Hold shift + click to select a range
6690907
guide: What is DVC? -> into UG index
jorgeorpinel Jul 15, 2020
ce72b11
how-to: create section with questions from WID / Collab Issues
jorgeorpinel Jul 15, 2020
b466986
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Jul 20, 2020
3f0b0f0
how-to: make subsection of the user-guide, and
jorgeorpinel Jul 20, 2020
5a5901b
guide: hide Best Practices how to for now
jorgeorpinel Jul 20, 2020
a94d9f8
guide: rename how to and best practices title
jorgeorpinel Jul 20, 2020
92ae254
guide: What->Why in index to avoid redundancy with What section
jorgeorpinel Jul 20, 2020
d7762e6
guide: concepts->principles in What is DVC?
jorgeorpinel Jul 20, 2020
82554fe
guide: move troubleshooting inside How To
jorgeorpinel Jul 20, 2020
05a7a7d
guide: collapse What is DVC? into single doc, and
jorgeorpinel Jul 20, 2020
3e46fca
guide: fix redirect test for troubleshooting how to
jorgeorpinel Jul 20, 2020
93cc607
guide: revise What is DVC? up to Core Principles and
jorgeorpinel Jul 20, 2020
8a7c086
guide: finish revising What is DVC?
jorgeorpinel Jul 20, 2020
c30e966
guide: more updates to What is DVC? (per 1.x) and
jorgeorpinel Jul 20, 2020
733593c
guide: review intro and reorg Related Technologies
jorgeorpinel Jul 20, 2020
28226d6
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Jul 22, 2020
7f421dd
guide: add Questions header to best practices (hidden)
jorgeorpinel Jul 22, 2020
d390a6f
guide: hide GAPI PP
jorgeorpinel Jul 22, 2020
c97e93b
guide: revise Git-LFS section of related techs
jorgeorpinel Jul 23, 2020
30d38df
guide: revise all Git* related techs
jorgeorpinel Jul 23, 2020
a450d39
guide: revise remaining related techs
jorgeorpinel Jul 23, 2020
a93f24f
guide: remove img from basic concepts
jorgeorpinel Jul 23, 2020
1a6948e
guide: move troubleshooting back out of How To
jorgeorpinel Jul 23, 2020
3e7b18a
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Aug 3, 2020
a556e6c
cases: move Why DVC? to Use Cases index and
jorgeorpinel Aug 4, 2020
eb5fbf9
guide: move Basic Principles from What is DVC? into Basic Concepts guide
jorgeorpinel Aug 4, 2020
c476f60
guide: remove "User Manual" term from index
jorgeorpinel Aug 4, 2020
3508a19
nav: remove ... from How To entry
jorgeorpinel Aug 4, 2020
2842e3a
tests: finis rolling back troubleshooting guide move
jorgeorpinel Aug 4, 2020
02c2b01
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Aug 8, 2020
7411f53
cases: fix a link to related techs guide
jorgeorpinel Aug 8, 2020
9c75ae8
Merge branch 'master' into guide/what-is-dvc
jorgeorpinel Aug 8, 2020
98ffea3
guide: propper structure in related techs
jorgeorpinel Aug 9, 2020
95521e6
guide: update remote storage core concept in what is dvc
jorgeorpinel Aug 9, 2020
fbd7e96
guide: improve Core Features of What is DVC?
jorgeorpinel Aug 9, 2020
86fbf43
guide: simplify data versioning core feature
jorgeorpinel Aug 9, 2020
de43edd
guide: update What is DVC? intro
jorgeorpinel Aug 9, 2020
bebd665
guide: simplify Core Features in What is DVC?
jorgeorpinel Aug 9, 2020
d8a71f8
guide: features before concepts (index)
jorgeorpinel Aug 9, 2020
722cbb3
guide: review term "features" in basic concepts
jorgeorpinel Aug 9, 2020
e6d5f78
guide: undo starting How To subsection
jorgeorpinel Aug 10, 2020
ca9203a
guide: undo changes to troubleshooting
jorgeorpinel Aug 10, 2020
eebb2e6
guide: a few more copy edits for What is DVC
jorgeorpinel Aug 10, 2020
c720552
guide: remove Basic Concepts page
jorgeorpinel Aug 10, 2020
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
guide: revise What is DVC? up to Core Principles and
and create hidden basic-concepts guide
jorgeorpinel committed Jul 20, 2020
commit 93cc607e23981730e24873810219c78e0c662256
4 changes: 4 additions & 0 deletions content/docs/user-guide/basic-concepts.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
# Basic Concepts of DVC
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, how do we connect this with glossary?

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Jul 23, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just extracted some info from What is DVC? to this new basic concepts guide that will kickstart #550 but it's not added to the navigation, so it's hidden for now. I know this doc needs lots of work and figure out the glossary connection but I'm leaving all that for a separate PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would say we don't have that much useful info here to keep it

Copy link
Contributor Author

@jorgeorpinel jorgeorpinel Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a good amount now. And more in #1655

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't see much new compared to the glossary in this format. For now it just means two similar places to keep updated for us.

Also it's strange a bit to have advanced concepts in basic concepts. Also, split itself is quite artificial.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify. My take on #550 is that it's way more than a single page like this. Good example of a basic concept is cache, or run cache - they alone can take a proper page I guess.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted. We can break it up in #1655. Should we move the discussion there and get this one merged? Or is having the guide here blocking the approval? Because I can just remove it and any links to it for now, and re-introduce it in that other PR. Please lmk

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, let's remove it for now and simplify the PR

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, are we removing this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh I read that wrong. I though you said "let's leave it for now". OK, removed now.


- **Cache directory**: Directory with all data files on a local hard drive or in
cloud storage, but not in the Git repository. See `dvc cache dir`.
36 changes: 13 additions & 23 deletions content/docs/user-guide/index.md
Original file line number Diff line number Diff line change
@@ -1,31 +1,21 @@
# User Guide

Our guides describe the main DVC concepts and features comprehensively,
explaining when and how to use them, as well as connections between them. These
guides don't focus on specific scenarios, but have a general scope – like a
_user manual_. Their topics range from more technical foundations, impacting
more parts of DVC, to more advanced and specific things you can do. We also
include a few guides related to contributing to
[this open-source project](https://github.com/iterative/dvc).
Our guides describe the major concepts and features of DVC comprehensively,
explaining when and how to use them, as well as relationship between these. We
don't focus on specific scenarios in this section, but rather on a general scope
– think _User Manual_. The topics here range from more foundational, impacting
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
more parts of DVC, to more technical and advanced things you can do. We also
include a few misc. guides, for example related to
[contributing to DVC](/doc/user-guide/contributing/core) itself.

## Why DVC?
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

Even with all the success we've seen today in machine learning (ML),
specifically deep learning and its applications in business, the data science
community still lacks good practices for organizing their projects and
collaborating effectively. This is a critical challenge: we need to evolve
towards ML algorithms and methods no longer being tribal knowledge and making
them easy to implement, reuse, and manage.

Today the data science community is still lacking good practices for organizing
their projects and effectively collaborating. ML algorithms and methods are no
longer simple tribal knowledge but are still difficult to implement, manage and
reuse.

> One of the biggest challenges in reusing, and hence the managing of ML
> projects, is its reproducibility.
---
Even with all the success we've seen today in machine learning (ML), especially
with deep learning and its applications in business, the data science community
still lacks good practices for organizing their projects and collaborating
effectively. This is a critical challenge: while ML algorithms and methods are
no longer tribal knowledge, they are still difficult to implement, reuse, and
manage.

Please choose from the navigation sidebar to the left, or click the `Next`
button below ↘
53 changes: 28 additions & 25 deletions content/docs/user-guide/what-is-dvc.md
Original file line number Diff line number Diff line change
@@ -1,19 +1,32 @@
# What Is DVC?

Data Version Control, or DVC, is a new type of experiment management software
built on top of Git. DVC reduces the gap between existing tools and data science
needs, allowing users to take advantage of experiment management while reusing
existing skills and intuition.
**Data Version Control** is a new type of data versioning, workflow and
experiment management software, that builds onto [Git](https://git-scm.com/)
(although it can work stand-alone). DVC reduces the gap between established
engineering tool sets and data science needs, allowing users to take advantage
of [new features](#core-features) while reusing existing skills and intuition.

![](/img/reproducibility.png)_DVC codifies data and ML experiments_
![](/img/reproducibility.png) _DVC codifies data and ML experiments_

Leveraging an underlying source code management system eliminates the need to
use 3rd-party services. Data science experiment sharing and collaboration can be
done through regular Git features (commit messages, merges, pull requests, etc)
the same way it works for software engineers.
Data science experiment sharing and collaboration can be done through regular
Git features (commits, branching, pull requests, etc.), the same way it works
for software engineers.

DVC is [open source](https://github.com/iterative/dvc/blob/master/LICENSE)
software!

## Core Principles

- **Workflow**: Set of experiments and relationships among them. Workflow
corresponds to the entire Git repository.

- **Pipeline**: Dependency graph or series of commands to reproduce data
processing results. The commands are connected by their inputs
(<abbr>dependencies</abbr>) and <abbr>outputs</abbr>. Pipelines are defined by
special [stage files](/doc/command-reference/run) (similar to
[Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction)).
Refer to [pipeline](/doc/command-reference/pipeline) for more information.

- **Experiment**: Equivalent to a
[Git revision](https://git-scm.com/docs/revisions). Each experiment (extract
new features, change model hyperparameters, data cleaning, add a new data
@@ -26,28 +39,18 @@ the same way it works for software engineers.
[reference](https://git-scm.com/book/en/v2/Git-Internals-Git-References) to an
experiment state.

- **Reproducibility**: Action to reproduce an experiment state. This action
generates output files (or directories) based on a set of input files and
source code. This action usually changes experiment state.
- **Reproducibility**: Action to reproduce an experiment state. This regenerates
output files (or directories) based on a set of input files and source code.
This action usually changes experiment state.

- **Pipeline**: Dependency graph or series of commands to reproduce data
processing results. The commands are connected by their inputs
(<abbr>dependencies</abbr>) and <abbr>outputs</abbr>. Pipelines are defined by
special [stage files](/doc/command-reference/run) (similar to
[Makefiles](https://www.gnu.org/software/make/manual/make.html#Introduction)).
Refer to [pipeline](/doc/command-reference/pipeline) for more information.

- **Workflow**: Set of experiments and relationships among them. Workflow
corresponds to the entire Git repository.
> This is one of the biggest challenges in reusing, and hence managing ML
> projects.
- **Data files**: Cached files (for large files). Data files are stored outside
of the Git repository on a local/shared hard drive or remote storage, but
[DVC-files](/doc/user-guide/dvc-files-and-directories) describing that data
are stored in Git for DVC needs (to maintain pipelines and reproducibility).

- **Cache directory**: Directory with all data files on a local hard drive or in
cloud storage, but not in the Git repository. See `dvc cache dir`.

- **Cloud storage** support: available complement to the core DVC features. This
is how a data scientist transfers large data files or shares a GPU-trained
model with those without GPUs available.
@@ -78,4 +81,4 @@ and this approach will not require storing binary files in your Git repository.
The diagram below describes all the DVC commands and relationships between a
local cache and remote storage:

![](/img/flow-large.png)_DVC data management_
![](/img/flow-large.png) _DVC data management_