Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

user-guide: Basic Concepts #947

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
11 changes: 6 additions & 5 deletions public/static/docs/command-reference/gc.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,11 +12,12 @@ usage: dvc gc [-h] [-q | -v] [-a] [-T] [-c] [-r <name>]
## Description

This command deletes (garbage collects) data files or directories that may exist
in the cache (or [remote storage](/doc/command-reference/remote) if `-c` is
used) but no longer referenced in [DVC-files](/doc/user-guide/dvc-file-format)
currently in the <abbr>workspace</abbr>. By default, this command only cleans up
the local cache, which is typically located on the same machine as the project
in question. This usually helps to free up disk space.
in the cache (or [remote storage](/doc/command-reference/remote#description) if
`-c` is used) but no longer referenced in
[DVC-files](/doc/user-guide/dvc-file-format) currently in the
<abbr>workspace</abbr>. By default, this command only cleans up the local cache,
which is typically located on the same machine as the project in question. This
usually helps to free up disk space.

There are important things to note when using Git to version the
<abbr>project</abbr>:
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/command-reference/get.md
Original file line number Diff line number Diff line change
Expand Up @@ -97,8 +97,8 @@ of the external Git repo. Instead, the corresponding DVC-file
[train.dvc](https://github.com/iterative/example-get-started/blob/master/train.dvc)
is found, that specifies `model.pkl` in its outputs (`outs`). DVC then
[pulls](/doc/command-reference/pull) the file from the default
[remote](/doc/command-reference/remote) of the external DVC project (found in
its
[remote](/doc/command-reference/remote#description) of the external DVC project
(found in its
[config file](https://github.com/iterative/example-get-started/blob/master/.dvc/config)).

> A recommended use for downloading binary files from DVC repositories, as done
Expand Down
5 changes: 3 additions & 2 deletions public/static/docs/command-reference/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,8 +11,9 @@ DVC is a command line tool. The typical DVC workflow goes as follows:
be tracked by DVC after the code is executed.
- Sharing a Git repository with the source code of your ML
[pipeline](/doc/command-reference/pipeline) will not include the project's
<abbr>cache</abbr>. Use [remote storage](/doc/command-reference/remote) and
`dvc push` to share this cache (data tracked by DVC).
<abbr>cache</abbr>. Use
[remote storage](/doc/command-reference/remote#description) and `dvc push` to
share this cache (data tracked by DVC).
- Use `dvc repro` to automatically reproduce your full pipeline, iteratively as
input data or source code change.

Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/command-reference/pipeline/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ commands that take an input and produce an <abbr>output</abbr>). A pipeline may
produce intermediate data, and has a final result. Machine Learning (ML)
pipelines typically start a with large raw datasets, include intermediate
featurization and training stages, and produce a final model, as well as
accuracy [metrics](/doc/command-reference/metrics).
accuracy [metrics](/doc/command-reference/metrics#description).

In DVC, pipeline stages and commands, their data I/O, interdependencies, and
results (intermediate or final) are specified with `dvc add` and `dvc run`,
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/command-reference/pull.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,8 +24,8 @@ downloading data to and from remote storage. These commands are analogous to
`git pull` and `git push`, respectively.
[Data sharing](/doc/use-cases/sharing-data-and-model-files) across environments
and preserving data versions (input datasets, intermediate results, models,
[metrics](/doc/command-reference/metrics), etc) remotely (S3, SSH, GCS, etc.)
are the most common use cases for these commands.
[metrics](/doc/command-reference/metrics#description), etc) remotely (S3, SSH,
GCS, etc.) are the most common use cases for these commands.

The `dvc pull` command allows one to retrieve data from remote storage.
`dvc pull` has the same effect as running `dvc fetch` and `dvc checkout`
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/command-reference/push.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,8 @@ downloading data to and from remote storage. These commands are analogous to
`git pull` and `git push`, respectively.
[Data sharing](/doc/use-cases/sharing-data-and-model-files) across environments
and preserving data versions (input datasets, intermediate results, models,
[metrics](/doc/command-reference/metrics), etc) remotely (S3, SSH, GCS, etc.)
are the most common use cases for these commands.
[metrics](/doc/command-reference/metrics#description), etc) remotely (S3, SSH,
GCS, etc.) are the most common use cases for these commands.

The `dvc push` command allows one to upload data to remote storage. It doesn't
save any changes in the code or DVC-files. Those should be saved by using
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/get-started/compare-experiments.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,8 @@
DVC makes it easy to iterate on your project using Git commits with tags or Git
branches. It provides a way to try different ideas, keep track of them, switch
back and forth. To find the best performing experiment or track the progress,
[project metrics](/doc/command-reference/metrics) are supported in DVC (as
described in one of the previous chapters).
[project metrics](/doc/command-reference/metrics#description) are supported in
DVC (as described in one of the previous chapters).

Let's run evaluate for the latest `bigrams` experiment we created in previous
chapters. It mostly takes just running the `dvc repro`:
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/get-started/import-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@

We've seen how to [push](/doc/get-started/store-data) and
[pull](/doc/get-started/retrieve-data) data from/to a <abbr>DVC project</abbr>'s
[remote](/doc/command-reference/remote). But what if we wanted to integrate a
dataset or ML model produced in one project into another one?
[remote](/doc/command-reference/remote#description). But what if we wanted to
integrate a dataset or ML model produced in one project into another one?

One way is to manually download the data (with `wget` or `dvc get`, for example)
and use `dvc add` to track it, but the connection between the projects would be
Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/get-started/metrics.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ $ dvc run -f evaluate.dvc \

`evaluate.py` calculates AUC value using the test dataset. It reads features
from the `features/test.pkl` file and produces a
[metric](/doc/command-reference/metrics) file (`auc.metric`). Any
[metric](/doc/command-reference/metrics#description) file (`auc.metric`). Any
<abbr>output</abbr> (in this case just a plain text file containing a single
numeric value) can be marked as a metric, for example by using the `-M` option
of `dvc run`.
Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/get-started/store-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,7 @@

Now, that your data files are managed by DVC (see
[Add Files](/doc/get-started/add-files)), you can push them from your repository
to the default [remote](/doc/command-reference/remote) storage\*:
to the default [remote](/doc/command-reference/remote#description) storage\*:

```dvc
$ dvc push
Expand Down
7 changes: 4 additions & 3 deletions public/static/docs/install/linux.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,10 @@
$ pip install dvc
```

Depending on the type of the [remote storage](/doc/command-reference/remote) you
plan to use, you might need to install optional dependencies: `[s3]`, `[azure]`,
`[gdrive]`, `[gs]`, `[oss]`, `[ssh]`. Use `[all]` to include them all.
Depending on the type of the
[remote storage](/doc/command-reference/remote#description) you plan to use, you
might need to install optional dependencies: `[s3]`, `[azure]`, `[gdrive]`,
`[gs]`, `[oss]`, `[ssh]`. Use `[all]` to include them all.

> Please restart your terminal or re-source the shell configuration file
> (`.bashrc`, `.zshrc`, etc.) if you get `Command 'dvc' not found` after
Expand Down
7 changes: 4 additions & 3 deletions public/static/docs/install/macos.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,9 +38,10 @@ from the [release page](https://github.com/iterative/dvc/releases/) on GitHub.
$ pip install dvc
```

Depending on the type of the [remote storage](/doc/command-reference/remote) you
plan to use, you might need to install optional dependencies: `[s3]`, `[azure]`,
`[gdrive]`, `[gs]`, `[oss]`, `[ssh]`. Use `[all]` to include them all.
Depending on the type of the
[remote storage](/doc/command-reference/remote#description) you plan to use, you
might need to install optional dependencies: `[s3]`, `[azure]`, `[gdrive]`,
`[gs]`, `[oss]`, `[ssh]`. Use `[all]` to include them all.

<details>

Expand Down
7 changes: 4 additions & 3 deletions public/static/docs/install/windows.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,9 +44,10 @@ You can install with `pip` from a command line terminal like
$ pip install dvc
```

Depending on the type of the [remote storage](/doc/command-reference/remote) you
plan to use, you might need to install optional dependencies: `[s3]`, `[azure]`,
`[gdrive]`, `[gs]`, `[oss]`, `[ssh]`. Use `[all]` to include them all.
Depending on the type of the
[remote storage](/doc/command-reference/remote#description) you plan to use, you
might need to install optional dependencies: `[s3]`, `[azure]`, `[gdrive]`,
`[gs]`, `[oss]`, `[ssh]`. Use `[all]` to include them all.

<details>

Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/tutorials/deep/define-ml-pipeline.md
Original file line number Diff line number Diff line change
Expand Up @@ -366,8 +366,8 @@ pipeline's reproducibility, we use stage file name `Dvcfile`. (This will be
discussed in more detail in the next chapter.)

Note that the <abbr>output</abbr> file `data/eval.txt` was transformed by DVC
into a [metric](/doc/command-reference/metrics) file in accordance with the `-M`
option.
into a [metric](/doc/command-reference/metrics#description) file in accordance
with the `-M` option.

The result of the last three `dvc run` commands execution is three stage files
and a modified .gitignore file. All the changes should be committed with Git:
Expand Down
8 changes: 4 additions & 4 deletions public/static/docs/tutorials/deep/reproducibility.md
Original file line number Diff line number Diff line change
Expand Up @@ -42,7 +42,7 @@ Our NLP model was based on [unigrams](https://en.wikipedia.org/wiki/N-gram)
only. Let's improve the model by adding bigrams. The bigrams model will extract
signals not only from separate words but also from two-word combinations. This
eventually increases the number of features for the model and hopefully improves
the target [metric](/doc/command-reference/metrics).
the target [metric](/doc/command-reference/metrics#description).

Before editing the `code/featurization.py` file, please create and checkout a
new branch `bigrams`.
Expand Down Expand Up @@ -194,8 +194,8 @@ Reproducing 'Dvcfile':
python code/evaluate.py
```

Validate the [metric](/doc/command-reference/metrics) and commit all the
changes.
Validate the [metric](/doc/command-reference/metrics#description) and commit all
the changes.

```dvc
$ cat data/eval.txt
Expand Down Expand Up @@ -272,7 +272,7 @@ Reproducing 'Dvcfile':
python code/evaluate.py
```

Check the target [metric](/doc/command-reference/metrics):
Check the target [metric](/doc/command-reference/metrics#description):

```dvc
$ cat data/eval.txt
Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/tutorials/deep/sharing-data.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,9 +21,9 @@ can be done using the CLI as shown below.
> Note that we are using the `dvc-public` S3 bucket as an example and you don't
> have write access to it, so in order to follow the tutorial you will need to
> either create your own S3 bucket or use other types of
> [remote storage](/doc/command-reference/remote). E.g. you can set up a local
> remote as we did in the [Configure](/doc/get-started/configure) chapter of
> _Get Started_.
> [remote storage](/doc/command-reference/remote#description). E.g. you can set
> up a local remote as we did in the [Configure](/doc/get-started/configure)
> chapter of _Get Started_.

```dvc
$ dvc remote add -d upstream s3://dvc-public/remote/tutorial/nlp
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/tutorials/interactive.md
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,8 @@ Learn basic concepts and features of DVC with interactive lessons:

6. [Importing Data](https://katacoda.com/dvc/courses/basics/importing) <br/>
Download and track data from the
[remote storage](/doc/command-reference/remote) of any DVC project that is
hosted on a Git repository.
[remote storage](/doc/command-reference/remote#description) of any DVC
project that is hosted on a Git repository.

## Simple ML Scenarios

Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/tutorials/pipelines.md
Original file line number Diff line number Diff line change
Expand Up @@ -249,7 +249,7 @@ $ dvc run -d code/train_model.py -d data/matrix-train.pkl \
```

Finally, evaluate the model on the test dataset and get the
[metric](/doc/command-reference/metrics) file:
[metric](/doc/command-reference/metrics#description) file:

```dvc
$ dvc run -d code/evaluate.py -d data/model.pkl \
Expand Down Expand Up @@ -352,7 +352,7 @@ bag_of_words = CountVectorizer(stop_words='english',
```

Reproduce all required stages to get to the target
[metric](/doc/command-reference/metrics) file:
[metric](/doc/command-reference/metrics#description) file:

```dvc
$ dvc repro evaluate.dvc
Expand Down
9 changes: 5 additions & 4 deletions public/static/docs/tutorials/versioning.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ more.

Now that we're done with preparations, let's add some data and then train the
first model. We'll capture everything with DVC, including the input dataset and
model [metrics](/doc/command-reference/metrics).
model [metrics](/doc/command-reference/metrics#description).

```dvc
$ dvc get https://github.com/iterative/dataset-registry \
Expand Down Expand Up @@ -139,8 +139,9 @@ Next, we train our first model with `train.py`. Because of the small dataset,
this training process should be small enough to run on most computers in a
reasonable amount of time (a few minutes). This command <abbr>outputs</abbr> a
bunch of files, among them `model.h5` and `metrics.csv`, weights of the trained
model, and [metrics](/doc/command-reference/metrics) history. The simplest way
to capture the current version of the model is to use `dvc add` again:
model, and [metrics](/doc/command-reference/metrics#description) history. The
simplest way to capture the current version of the model is to use `dvc add`
again:

```dvc
$ python train.py
Expand Down Expand Up @@ -302,7 +303,7 @@ above (with cats and dogs images) is a good example.
On the other hand, there are files that are the result of running some code. In
our example, `train.py` produces binary files (e.g.
`bottlneck_features_train.npy`), the model file `model.h5`, and the
[metrics](/doc/command-reference/metrics) file `metrics.csv`.
[metrics](/doc/command-reference/metrics#description) file `metrics.csv`.

When you have a script that takes some data as an input and produces other data
<abbr>outputs</abbr>, a better way to capture them is to use `dvc run`:
Expand Down
4 changes: 2 additions & 2 deletions public/static/docs/understanding-dvc/collaboration-issues.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,8 +23,8 @@ formalized. Common questions need to be answered in an unified, principled way.
- How do you track which of your
[hyperparameter](<https://en.wikipedia.org/wiki/Hyperparameter_(machine_learning)>)
changes contributed the most to producing or improving your target
[metric](/doc/command-reference/metrics)? How do you monitor the degree of
each change?
[metric](/doc/command-reference/metrics#description)? How do you monitor the
degree of each change?

### Navigating through experiments

Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/understanding-dvc/related-technologies.md
Original file line number Diff line number Diff line change
Expand Up @@ -98,7 +98,7 @@ Luigi, etc.
workflow for machine learning and reproducible experiments. When a DVC or
Git-annex repository is cloned via `git clone`, data files won't be copied to
the local machine, as file contents are stored in separate
[remotes](/doc/command-reference/remote). With DVC,
[remotes](/doc/command-reference/reremote#descriptionmote). With DVC,
[DVC-files](/doc/user-guide/dvc-file-format), which provide the reproducible
workflow, are always included in the Git repository. Hence, they can be
executed locally with minimal effort.
Expand Down Expand Up @@ -129,8 +129,8 @@ Luigi, etc.

- `git-lfs` was not made with data science scenarios in mind, so it does not
provide related features (e.g. pipelines,
[metrics](/doc/command-reference/metrics)), and thus GitHub has a limit of 2
GB per repository.
[metrics](/doc/command-reference/metrics#description)), and thus GitHub has a
limit of 2 GB per repository.

---

Expand Down
13 changes: 7 additions & 6 deletions public/static/docs/use-cases/data-registries.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,9 +27,9 @@ Advantages of using a DVC **data registry**:
(`dvc get` and `dvc import` commands, similar to software package management
systems like `pip`).
- Persistence: The DVC registry-controlled
[remote storage](/doc/command-reference/remote) (e.g. an S3 bucket) improves
data security. There are less chances someone can delete or rewrite a model,
for example.
[remote storage](/doc/command-reference/remote#description) (e.g. an S3
bucket) improves data security. There are less chances someone can delete or
rewrite a model, for example.
- Storage Optimization: Track data
[shared](/doc/use-cases/sharing-data-and-model-files) by multiple projects
centralized in a single location (with the ability to create distributed
Expand Down Expand Up @@ -78,8 +78,8 @@ $ git commit -m "Track 1.8 GB 10,000 song dataset in music/"

The actual data is stored in the project's <abbr>cache</abbr> and should be
[pushed](/doc/command-reference/push) to one or more
[remote storage](/doc/command-reference/remote) locations, so the registry can
be accessed from other locations or by other people:
[remote storage](/doc/command-reference/remote#description) locations, so the
registry can be accessed from other locations or by other people:

```
$ dvc remote add -d myremote s3://bucket/path
Expand Down Expand Up @@ -225,7 +225,8 @@ $ tree --filelimit=100
```

And let's not forget to `dvc push` data changes to the
[remote storage](/doc/command-reference/remote), so others can obtain them!
[remote storage](/doc/command-reference/remote#description), so others can
obtain them!

```
$ dvc push
Expand Down
6 changes: 3 additions & 3 deletions public/static/docs/use-cases/sharing-data-and-model-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
Like Git, DVC allows for a distributed environment and collaboration. We make it
easy to consistently get all your data files and directories into any machine,
along with matching source code. All you need to do is to setup
[remote storage](/doc/command-reference/remote) for your <abbr>DVC
[remote storage](/doc/command-reference/remote#description) for your <abbr>DVC
project</abbr>, and push the data there, so others can reach it. Currently DVC
supports Amazon S3, Microsoft Azure Blob Storage, Google Drive, Google Cloud
Storage, SSH, HDFS, and other remote locations. The list is constantly growing.
Expand All @@ -12,8 +12,8 @@ Storage, SSH, HDFS, and other remote locations. The list is constantly growing.
![](/static/img/model-sharing-digram.png)

As an example, let's take a look at how you could setup an S3
[remote storage](/doc/command-reference/remote) for a <abbr>DVC project</abbr>,
and push/pull to/from it.
[remote storage](/doc/command-reference/remote#description) for a <abbr>DVC
project</abbr>, and push/pull to/from it.

## Create an S3 bucket

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -10,8 +10,8 @@ It's useful when dealing with files that are too large for Git to handle
properly in general. DVC saves information about your data in special
[DVC-files](/doc/user-guide/dvc-file-format), and these metafiles can be used
for versioning. To actually store the data, DVC supports various types of
[remote storage](/doc/command-reference/remote). This allows easily saving and
sharing data alongside code.
[remote storage](/doc/command-reference/remote#description). This allows easily
saving and sharing data alongside code.

![](/static/img/model-versioning-diagram.png)

Expand Down Expand Up @@ -116,7 +116,7 @@ and model files are their latest versions.
![](/static/img/versioning.png)

To share your data with others you need to setup a
[data storage](/doc/command-reference/remote). See the
[data storage](/doc/command-reference/remote#description). See the
[Sharing Data And Model Files](/doc/use-cases/sharing-data-and-model-files) use
case to get an overview on how to do this.

Expand Down
2 changes: 1 addition & 1 deletion public/static/docs/user-guide/dvc-file-format.md
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ An output entry consists of these fields:
- `md5`: MD5 hash for the output
- `cache`: Whether or not dvc should cache the output
- `metric`: Whether or not this file is a
[metric](/doc/command-reference/metrics) file
[metric](/doc/command-reference/metrics#description) file

A metric entry consists of these fields:

Expand Down
Loading