Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

guide: introduce Basic Concepts guide #1655

Closed
wants to merge 5 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
69 changes: 57 additions & 12 deletions content/docs/user-guide/basic-concepts.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,29 +4,65 @@ DVC streamlines large data files and binary models into a single Git
environment. This approach will not require storing binary files in your Git
repository.

### Data Files
## DVC Project

Initialized by running `dvc init` in a directory, it will contain all the
[DVC files and directories](/doc/user-guide/dvc-files-and-directories),
including the <abbr>cache</abbr>, `dvc.yaml` and `.dvc` files, etc. Any other
files referenced from special DVC files are also considered part of the project
(for example [metrics files](/doc/command-reference/metrics)).

> `dvc destroy` can be used to remove all DVC-specific files from the directory,
> in effect deleting the DVC project.

## DVC repository

<abbr>DVC project</abbr> initialized in a Git repository. This enables the
versioning features of DVC (recommended). Files tracked by Git are considered
part of the DVC project when referenced from special DVC files such as
`dvc.lock`, for example source code that is used as a stage
<abbr>dependency</abbr>.

## Data Files

Large files (or directories) that are tracked and <abbr>cached</abbr> by DVC.
Data files are stored outside of the Git repository, on a local/shared hard
drive, and/or remote storage. `.dvc` files describing the data are put into Git
as placeholders, for DVC needs (to maintain pipelines and reproducibility).
Data files are too large to be added to a Git repository. DVC stores them on a
local/shared hard drive, and/or _remote storage_. `dvc.lock` or `.dvc` files
describing the data are put in the <abbr>project</abbr> as placeholders for DVC
needs (to maintain pipelines and reproducibility). These can be committed to Git
instead of the data files themselves.

Examples of data files are raw datasets, extracted features, ML models,
performance data, etc.

> A.k.a. <abbr>data artifacts</abbr> and <abbr>outputs</abbr>

### DVC Cache
## Workspace

It's comprised by the non-internal <abbr>project</abbr> files, as well as the
currently present set of _data files_ and directories (see `dvc checkout`).
Similar to the
[working tree](https://git-scm.com/docs/gitglossary#def_working_tree) in Git.

## DVC Cache

A DVC project's <abbr>cache</abbr> is an
[internal directory](/doc/user-guide/dvc-files-and-directories#structure-of-cache-directory)
used to store all data files outside of the Git repository. It's a local hard
drive or external location. See `dvc cache dir`.

### Processing Stage
## Remote Storage

Storage location external to the DVC project, which is used to share and backup
all or parts of the <abbr>cache</abbr>. See `dvc remote` for more details.

## Processing Stage

An individual process that transforms a data input (<abbr>dependency</abbr>)
into some result (usually a data <abbr>output</abbr>). DVC stages execute
terminal commands to (re)generate their results.

### Data Pipeline
## Data Pipeline

Dependency graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)),
or series of [data processing stages](#stage) to (re)produce certain results.
Expand All @@ -35,7 +71,7 @@ defined in special `dvc.yaml` files. Refer to `dvc dag` for more information.

See [Data Pipelines](/doc/start/data-pipelines) for a hands-on explanation.

### Reproducibility
## Reproducibility

Action to reproduce an experiment state. This regenerates output files (or
directories) based on a set of input files and source code. This action usually
Expand All @@ -44,9 +80,7 @@ changes experiment state.
> This is one of the biggest challenges in reusing, and hence managing ML
> projects.

## Advanced Concepts

### Experiment
## Experiment

An attempt at a data science task. Each one can be performed in a separate Git
branch or tag, and its states identified by different
Expand All @@ -57,7 +91,18 @@ experiment into the <abbr>repository</abbr> history.

> See [Experiments](/doc/start/experiments) for a hands-on explanation.

### Workflow
## Run Cache

DVC's run-cache is an automatic performance feature that stores both the context
and results of past experiment runs. It's located in the `.dvc/cache/runs`
directory.

`dvc run` and `dvc repro` look in the run-cache first before executing any
stages, to see if this exact same configuration has been run before (and if so
use the cached results). The run-cache can be uploaded and downloaded to/from
remote storage, along with the rest of the <abbr>cache</abbr>.

## Workflow

Set of experiments and relationships among them. Corresponds to the entire
<abbr>project</abbr> and may contain several [data pipelines](#data-pipelines).
22 changes: 10 additions & 12 deletions content/docs/user-guide/dvc-files-and-directories.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,26 +5,24 @@ directory (`.dvc/`) with the
[internal directories and files](#internal-directories-and-files) needed for DVC
operation.

Additionally, there are a few special kind of files created by certain
[DVC commands](/doc/command-reference):
Additionally, there are a few special kinds of files that support DVC's
features:

- Files ending with the `.dvc` extension are placeholders to track data files
and directories. A <abbr>DVC project</abbr> usually has one
[`.dvc` file](#dvc-files) per large data file or dataset directory being
tracked.
- The [`dvc.yaml` file](#dvcyaml-file) or _pipeline(s) file_ specifies stages
that form the pipeline(s) of a project, and their connections (_dependency
graph_ or DAG).
and directories. A <abbr>DVC project</abbr> usually has one `.dvc` file per
large data file or dataset directory being tracked.
- `dvc.yaml` files (or _pipelines files_) specify stages that form the
pipeline(s) of a project, and how they connect (_dependency graph_ or DAG).

These typically come with a matching `dvc.lock` file to record the pipeline
state and track its <abbr>data artifacts</abbr>.
These typically have a matching `dvc.lock` file to record the pipeline state
and track its <abbr>data artifacts</abbr>.

Both `.dvc` files and `dvc.yaml` use human-friendly YAML schemas, described
below. We encourage you to get familiar with them so you may create, generate,
and edit them on your own.

All these should be versioned with Git (in Git-enabled
<abbr>repositories</abbr>).
Both the internal directory and these special files should be versioned with Git
(in Git-enabled <abbr>repositories</abbr>).

## .dvc files

Expand Down