Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add params #1128

Merged
merged 13 commits into from
Apr 11, 2020
Merged
116 changes: 116 additions & 0 deletions content/docs/command-reference/params/diff.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,116 @@
# params diff

Show changes in [project parameters](/doc/command-reference/params), between
commits in the <abbr>DVC repository</abbr>, or between a commit and the
<abbr>workspace</abbr>.

## Synopsis

```usage
usage: dvc params diff [-h] [-q | -v] [--show-json] [a_rev] [b_rev]

positional arguments:
a_rev Old Git commit to compare (defaults to HEAD)
b_rev New Git commit to compare (defaults to the
current workspace)
```

## Description

This command means to provide a quick way to compare parameters from your
previous experiments with the current ones of your pipeline, as long as you're
using params that DVC is aware of (see `--params` in `dvc run`). Run without
arguments, this command compares all existing parameters currently present in
the <abbr>workspace</abbr> (uncommitted changes) with the latest committed
version. The command shows only parameters that were used in any of stages and
ignores parameters that were not used.

## Options

- `--show-json` - prints the command's output in easily parsable JSON format,
instead of a human-readable table.

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output. Exit with 0 if no
problems arise, otherwise 1.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

Let's create a simple parameters file and a stage with params dependency (See
`dvc params` and `dvc run` to learn more):

```dvc
$ cat params.yaml
lr: 0.0041

train:
epochs: 70
layers: 9

processing:
threshold: 0.98
bow_size: 15000
```

Define a pipeline stage with dependencies to parameters:

```dvc
$ dvc run -d users.csv -o model.pkl \
-p lr,train \
python train.py
```

Let's print parameter values that we are tracking in this <abbr>project</abbr>:

```dvc
$ dvc params diff
Path Param Old New
params.yaml lr None 0.0041
params.yaml train.layers None 9
params.yaml train.epochs None 70
```

The command showed the difference between the workspace and the last commited
version of the `params.yaml` file which does not exist yet. This is why all
`Old` values are `None`.

Note, not all the parameter were printed. `dvc params diff` prints only changed
parameters that were used in one of the stages and ignors parameters from the
group `processing` that were not used.

In a project with parameter file history you will see both `Old` and `New`
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
values:

```dvc
$ dvc params diff
Path Param Old New
params.yaml lr 0.0041 0.0043
params.yaml train.layers 9 7
params.yaml train.epochs 70 110
```

To compare parameters with a specific commit, tag or revision it should be
specified as an additional command line parameter:

```dvc
$ dvc params diff e12b167
Path Param Old New
params.yaml lr 0.0038 0.0043
params.yaml train.epochs 70 110
```

Note, the `train.layers` parameter dissapeared because its value was not changed
between the current version in the workspace and the defined one.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

To see the difference between two specific commits, both need to be specified:

```dvc
$ dvc params diff e12b167 HEAD^
Path Param Old New
params.yaml lr 0.0038 0.0041
params.yaml train.layers 10 9
params.yaml train.epochs 50 70
```
136 changes: 136 additions & 0 deletions content/docs/command-reference/params/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,136 @@
# params

A set of commands to manage, and display project parameters:
[diff](/doc/command-reference/params/diff).
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Synopsis

```usage
usage: dvc params [-h] [-q | -v] {diff} ...

positional arguments:
COMMAND
diff Show changes in params between commits in the
DVC repository, or between a commit and the workspace.
```

## Description

In order to track parameters and hyperparameters associated to machie learning
experiments DVC has a special type of <abbr>dependencies</abbr> - parameters.
(See the `--params` option of `dvc run`.) Parameters are project-specific string
or array values e.g. `epochs`, `learning-rate`, `batch_size`, `num_classes` etc.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

In contrast to a regular file <abbr>dependencies</abbr>, parameters are pairs of
a file dependency (parameter file) and a parameter name inside the file.
Supported file formats for parameter file: YAML and JSON. The default parameters
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
file name is `params.yaml`. Parameters are organized in a tree hierarchy in the
file. DVC addresses the parameters by the tree path.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

The parameters concept helps to define stage dependencies more granularly when
not only a file change invalidate a stage and requires the stage execution but a
particular parameter or a set of parameters change is required for the stage
invalidation. As a result, it prevents situations when many pipeline stages
depends on a single file and any change in the file invalidates all of these
stages.

Supported parameter value types are: string, number values, float values and
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
arrays. DVC itself does not ascribe any specific meaning for these parameter
values. Usually these values are defined by users and serve as a way to
generalize and parametrize an machine learning algorithm or data processing
code.

[run](/doc/command-reference/run) command defines parameters and
[diff](/doc/command-reference/params/diff) command is available to manage
<abbr>DVC project</abbr> parameters.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

## Options

- `-h`, `--help` - prints the usage/help message, and exit.

- `-q`, `--quiet` - do not write anything to standard output.

- `-v`, `--verbose` - displays detailed tracing information.

## Examples

First, let's create a simple parameters file:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ cat params.yaml
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
lr: 0.0041

train:
epochs: 70
layers: 9

processing:
threshold: 0.98
bow_size: 15000
```

Define a pipeline stage with dependencies to parameters `lr`, `layers` and
`epochs` in the default parameters file `params.yaml`. A whole parameter paths
can be used to specify `layers` and `epochs` parameters from `train` group:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ dvc run -d users.csv -o model.pkl \
-p lr,train.epochs,train.layers \
python train.py
```

> `-p` (`--params`) is telling DVC to mark `lr`, `train.epochs` and
> `train.layers` as parameters while `train.epochs` and `train.layers` are full
> paths to these two params in the YAML file. JSON files use the same parameters
> addressation.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

The entire group of parameters `train` can be referenced instead of spefifying
each of the group parameters separately:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ dvc run -d users.csv -o model.pkl \
-p lr,train \
python train.py
```

You can find that each parameter and it's value were saved in the dvc-file.
These values will be compared to the values from the parameter files during the
next `dvc repro` to define if dependency to the parameter file is invalidated:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```yaml
md5: 05d178cfa0d1474b6c5800aa1e1b34ac
cmd: python train.py
deps:
- md5: 3aec0a6cf36720a1e9b0995a01016242
path: users.csv
- path: params.yaml
params:
lr: 0.0041
train.epochs: 70
train.layers: 9
```

In the examples above the default parameters file `params.yaml` was used. The
parameter file name can be redefined by prefix:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ dvc run -d logs/ -o users.csv \
-p parse_params.yaml:threshold,classes_num \
python train.py
```

Now let's print parameter values that we are tracking in this
<abbr>project</abbr>:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ dvc params diff
Path Param Old New
params.yaml lr None 0.0041
params.yaml train.layers None 9
params.yaml train.epochs None 70
```

The command showed the difference between the workspace and the last commited
version of the `params.yaml` file which does not exist yet. This is why all
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
`Old` values are `None`. See `params diff` to learn more about the `diff`
command.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
52 changes: 43 additions & 9 deletions content/docs/command-reference/run.md
Original file line number Diff line number Diff line change
@@ -6,10 +6,12 @@ command and execute the command.
## Synopsis

```usage
usage: dvc run [-h] [-q | -v] [-d <path>] [-o <path>] [-O <path>]
[-m <path>] [-M <path>] [-f <filename>] [-c <path>]
[-w <path>] [--no-exec] [-y] [--overwrite-dvcfile]
usage: dvc run [-h] [-q | -v] [-d DEPS] [-o OUTS] [-O OUTS_NO_CACHE]
[-p PARAMS] [-m METRICS] [-M METRICS_NO_CACHE] [-f FILE]
[-c CWD] [-w WDIR] [--no-exec] [-y] [--overwrite-dvcfile]
[--ignore-build-cache] [--remove-outs] [--no-commit]
[--outs-persist OUTS_PERSIST]
[--outs-persist-no-cache OUTS_PERSIST_NO_CACHE]
[--always-changed]
command
Copy link
Contributor

@jorgeorpinel jorgeorpinel Apr 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One last comment: dvc run is getting gargantuan 🐋

Isn't there an issue out there somewhere to split it into 2 commands? May be time to revisit that 😬


@@ -21,10 +23,11 @@ positional arguments:

`dvc run` provides an interface to describe stages: individual commands and the
data input and output that go into creating a result. By specifying a list of
dependencies (`-d` option) and <abbr>outputs</abbr> (`-o`, `-O`, `-m`, or `-M`
options) DVC can later connect each stage by building a dependency graph
([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)). This graph is
used by DVC to restore a full data [pipeline](/doc/command-reference/pipeline).
dependencies (`-d` option), params (`-p` option) and <abbr>outputs</abbr> (`-o`,
`-O`, `-m`, or `-M` options) DVC can later connect each stage by building a
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved
dependency graph ([DAG](https://en.wikipedia.org/wiki/Directed_acyclic_graph)).
This graph is used by DVC to restore a full data
[pipeline](/doc/command-reference/pipeline).

The remaining terminal input provided to `dvc run` after the command options
(`-`/`--` flags) will become the required `command` argument. Please wrap the
@@ -93,6 +96,14 @@ data pipeline (e.g. random numbers, time functions, hardware dependency, etc.)
> Note that a DVC-file without dependencies is considered always changed, so
> `dvc repro` always executes it.

- `-p [<filename>:]<params_list>`, `--params [<filename>:]<params_list>` -
specify a subset of parameters from a parameter file the stage depends on. The
params subset can be specified by coma separated params list:
`-p learning_rate,epochs`. By default, the params file is `params.yaml` but
this value can be redefined with params prefix:
`-p parse_params.yaml:threshold` See `dvc params` to learn more about using
parameters.
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

- `-o <path>`, `--outs <path>` - specify a file or directory that is the result
of running the `command`. Multiple outputs can be specified:
`-o model.pkl -o output.log`. DVC builds a dependency graph (pipeline) to
@@ -187,9 +198,10 @@ $ mkdir example && cd example
$ git init
$ dvc init
$ mkdir data
$ dvc run -d data -o metric -f metric.dvc "echo '1' >> metric"
$ dvc run -d data -o metric -f metric.dvc \
"echo '{ \"AUC\": 0.86252 }' >> metric"
Running command:
echo '1' >> metric
echo '{ "AUC": 0.86252 }' >> metric
WARNING: 'data' is empty.

To track the changes with git, run:
@@ -218,6 +230,28 @@ $ dvc run -d parsingxml.R -d Posts.xml \
Rscript parsingxml.R Posts.xml Posts.csv
```

Use a subset of hyperparameters from the default params file `params.yaml`. The
parameters should be readed from user's code. DVC can granulary track
dependencies for the defined subset of parameters for `dvc repro`:
jorgeorpinel marked this conversation as resolved.
Show resolved Hide resolved

```dvc
$ cat params.yaml
seed: 20180226

train:
lr: 0.0041
epochs: 75
layers: 9

processing:
threshold: 0.98
bow_size: 15000

$ dvc run -d matrix-train.p -d train_model.py -o model.p \
-p seed,train.lr,train.epochs
python train_model.py matrix-train.p model.p
```

Extract an XML file from an archive to the `data/` folder:

```dvc
11 changes: 11 additions & 0 deletions content/docs/sidebar.json
Original file line number Diff line number Diff line change
@@ -284,6 +284,17 @@
"label": "move",
"slug": "move"
},
{
"label": "param",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

params?

"slug": "params",
"source": "params/index.md",
"children": [
{
"label": "params diff",
"slug": "diff"
}
]
},
{
"label": "pipeline",
"slug": "pipeline",