diff --git a/content/docs/command-reference/dag.md b/content/docs/command-reference/dag.md index 8c663f788f..5519861ab4 100644 --- a/content/docs/command-reference/dag.md +++ b/content/docs/command-reference/dag.md @@ -25,7 +25,7 @@ the `dvc.yaml` files found in the project. Provide a `target` stage name to show the pipeline up to that point. [directed acyclic graph]: - /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag + /doc/user-guide/pipelines/defining-pipelines#directed-acyclic-graph-dag ### Paginating the output diff --git a/content/docs/command-reference/exp/index.md b/content/docs/command-reference/exp/index.md index c733a2d77e..77ddf44685 100644 --- a/content/docs/command-reference/exp/index.md +++ b/content/docs/command-reference/exp/index.md @@ -49,8 +49,11 @@ science/ machine learning experiments. 📖 See [Experiment Management](/doc/user-guide/experiment-management) for more info. -> ⚠️ Note that DVC assumes that experiments are deterministic (see **Avoiding -> unexpected behavior** in `dvc stage add`). +> ⚠️ Note that DVC assumes that experiments are deterministic (see [Avoiding +> unexpected behavior]). + +[avoiding unexpected behavior]: + /doc/user-guide/project-structure/dvcyaml-files#avoiding-unexpected-behavior ## Options diff --git a/content/docs/command-reference/exp/init.md b/content/docs/command-reference/exp/init.md index 5794a9b50e..dd1d45e183 100644 --- a/content/docs/command-reference/exp/init.md +++ b/content/docs/command-reference/exp/init.md @@ -97,7 +97,7 @@ See the [Pipelines guide] for more on that topic. /doc/user-guide/project-structure/dvcyaml-files#stage-commands [checkpoints]: /doc/user-guide/experiment-management/checkpoints [dvc experiments]: /doc/user-guide/experiment-management/experiments-overview -[pipelines guide]: /doc/user-guide/data-pipelines/defining-pipelines +[pipelines guide]: /doc/user-guide/pipelines/defining-pipelines ## Options diff --git a/content/docs/command-reference/move.md b/content/docs/command-reference/move.md index 2fe4e00d46..49c6fe6276 100644 --- a/content/docs/command-reference/move.md +++ b/content/docs/command-reference/move.md @@ -93,7 +93,7 @@ Often the output of a stage is a dependency in another stage, creating a [dependency graph]. In this case, you may want to also update the `path` in the `deps` field of `dvc.yaml`. -[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines +[dependency graph]: /doc/user-guide/pipelines/defining-pipelines diff --git a/content/docs/command-reference/params/index.md b/content/docs/command-reference/params/index.md index b86eb66c0c..81e90b413c 100644 --- a/content/docs/command-reference/params/index.md +++ b/content/docs/command-reference/params/index.md @@ -75,7 +75,7 @@ is outdated upon `dvc repro` (or `dvc status`). [hyperparameters]: /doc/user-guide/experiment-management/running-experiments#tuning-hyperparameters [use the same params file]: - /doc/user-guide/data-pipelines/defining-pipelines#parameter-dependencies + /doc/user-guide/pipelines/defining-pipelines#parameter-dependencies [more details]: /doc/user-guide/project-structure/dvcyaml-files#parameters [templating]: /doc/user-guide/project-structure/dvcyaml-files#templating [stage commands]: /doc/user-guide/project-structure/dvcyaml-files#stage-commands diff --git a/content/docs/command-reference/repro.md b/content/docs/command-reference/repro.md index ba4e0b9efa..647f1194bc 100644 --- a/content/docs/command-reference/repro.md +++ b/content/docs/command-reference/repro.md @@ -68,7 +68,7 @@ It stores all the data files, intermediate or final results in the hash values of changed dependencies and outputs in the `dvc.lock` and `.dvc` files. -[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines +[dependency graph]: /doc/user-guide/pipelines/defining-pipelines [always changed]: /doc/command-reference/status#local-workspace-status ### Parallel stage execution @@ -160,10 +160,8 @@ up-to-date and only execute the final stage. option, as all possible targets are already included. - `--no-run-cache` - execute stage command(s) even if they have already been run - with the same dependencies and outputs (see the - [details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful - for example if the stage command/s is/are non-deterministic - ([not recommended](/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior)). + with the same dependencies and outputs (see the [details]). Useful for example + if the stage command/s is/are non-deterministic ([not recommended]). - `--force-downstream` - in cases like `... -> A (changed) -> B -> C` it will reproduce `A` first and then `B`, even if `B` was previously executed with the @@ -185,11 +183,8 @@ up-to-date and only execute the final stage. corresponding pipelines, including the target stages themselves. This option has no effect if `targets` are not provided. -- `--pull` - attempts to download outputs of stages found in the - [run-cache](/doc/user-guide/project-structure/internal-files#run-cache) during - reproduction. Uses the - [default remote storage](/doc/command-reference/remote/default). See also - `dvc pull` +- `--pull` - attempts to download outputs of stages found in the [run-cache] + during reproduction. Uses the [default remote storage]. See also `dvc pull` - `-h`, `--help` - prints the usage/help message, and exit. @@ -200,6 +195,12 @@ up-to-date and only execute the final stage. - `-v`, `--verbose` - displays detailed tracing information. +[details]: /doc/user-guide/project-structure/internal-files#run-cache +[not recommended]: + /doc/user-guide/project-structure/dvcyaml-files#avoiding-unexpected-behavior +[run-cache]: /doc/user-guide/project-structure/internal-files#run-cache +[default remote storage]: /doc/command-reference/remote/default + ## Examples > To get hands-on experience with data science and machine learning pipelines, diff --git a/content/docs/command-reference/run.md b/content/docs/command-reference/run.md index 8006801088..1328c24484 100644 --- a/content/docs/command-reference/run.md +++ b/content/docs/command-reference/run.md @@ -107,7 +107,7 @@ Relevant notes: [manual process](/doc/command-reference/move#renaming-stage-outputs) to update `dvc.yaml` and the project's cache accordingly. -[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines +[dependency graph]: /doc/user-guide/pipelines/defining-pipelines ### For displaying and comparing data science experiments @@ -216,10 +216,8 @@ data science experiments. asking for confirmation. - `--no-run-cache` - execute the stage command(s) even if they have already been - run with the same dependencies and outputs (see the - [details](/doc/user-guide/project-structure/internal-files#run-cache)). Useful - for example if the stage command/s is/are non-deterministic - ([not recommended](/doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior)). + run with the same dependencies and outputs (see the [details]). Useful for + example if the stage command/s is/are non-deterministic ([not recommended]). - `--no-commit` - do not store the outputs of this execution in the cache (`dvc.yaml` and `dvc.lock` are still created or updated); useful to avoid @@ -231,7 +229,7 @@ data science experiments. when reproducing the pipeline. - `--external` - allow writing outputs outside of the DVC repository. See - [Managing External Data](/doc/user-guide/managing-external-data). + [Managing External Data]. - `--desc ` - user description of the stage (optional). This doesn't affect any DVC operations. @@ -243,6 +241,11 @@ data science experiments. - `-v`, `--verbose` - displays detailed tracing information. +[details]: /doc/user-guide/project-structure/internal-files#run-cache +[not recommended]: + /doc/user-guide/project-structure/dvcyaml-files#avoiding-unexpected-behavior +[managing external data]: /doc/user-guide/managing-external-data + ## Examples Let's create a stage (that counts the number of lines in a `test.txt` file): diff --git a/content/docs/command-reference/stage/add.md b/content/docs/command-reference/stage/add.md index dcfd76b781..797364f05f 100644 --- a/content/docs/command-reference/stage/add.md +++ b/content/docs/command-reference/stage/add.md @@ -46,7 +46,7 @@ graph] and execute them. See the guide on [defining pipeline stages] for more details. [defining pipeline stages]: - /doc/user-guide/data-pipelines/defining-pipelines#pipelines + /doc/user-guide/pipelines/defining-pipelines#pipelines @@ -111,7 +111,7 @@ Relevant notes: [manual process](/doc/command-reference/move#renaming-stage-outputs) to update `dvc.yaml` and the project's cache accordingly. -[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines +[dependency graph]: /doc/user-guide/pipelines/defining-pipelines ### For displaying and comparing data science experiments diff --git a/content/docs/command-reference/stage/index.md b/content/docs/command-reference/stage/index.md index 1bb7c73939..dd6598cf27 100644 --- a/content/docs/command-reference/stage/index.md +++ b/content/docs/command-reference/stage/index.md @@ -26,4 +26,4 @@ organize data science projects, or build detailed machine learning pipelines. examine `dvc.yaml` files manually. Learn more about -[defining stages](/doc/user-guide/data-pipelines/defining-pipelines#stages). +[defining stages](/doc/user-guide/pipelines/defining-pipelines#stages). diff --git a/content/docs/start/data-management/pipelines.md b/content/docs/start/data-management/pipelines.md index d61177df46..e5259d0660 100644 --- a/content/docs/start/data-management/pipelines.md +++ b/content/docs/start/data-management/pipelines.md @@ -171,7 +171,7 @@ $ dvc stage add -n featurize \ The `dvc.yaml` file is updated automatically and should include two stages now. -[dag]: /doc/user-guide/data-pipelines/defining-pipelines +[dag]: /doc/user-guide/pipelines/defining-pipelines
diff --git a/content/docs/user-guide/basic-concepts/pipeline.md b/content/docs/user-guide/basic-concepts/pipeline.md index 8c58710ed5..bede9879f9 100644 --- a/content/docs/user-guide/basic-concepts/pipeline.md +++ b/content/docs/user-guide/basic-concepts/pipeline.md @@ -6,6 +6,5 @@ tooltip: >- YAML format ([`dvc.yaml`](/doc/user-guide/project-structure/dvcyaml-files)). This guarantees DVC can reproduce them consistently. DVC also helps automate their execution and caches their results. See [Defining - Pipelines](/doc/user-guide/data-pipelines/defining-pipelines) for more - details. + Pipelines](/doc/user-guide/pipelines/defining-pipelines) for more details. --- diff --git a/content/docs/user-guide/basic-concepts/stage.md b/content/docs/user-guide/basic-concepts/stage.md index 1a5292a367..bf429be5b9 100644 --- a/content/docs/user-guide/basic-concepts/stage.md +++ b/content/docs/user-guide/basic-concepts/stage.md @@ -6,5 +6,5 @@ tooltip: >- some milestone as part of your project's workflow. For example, `python train.py` may generate a machine learning model. DVC stages include data input(s) and resulting output(s), if any. [Learn - more](/doc/user-guide/data-pipelines/defining-pipelines#stages). + more](/doc/user-guide/pipelines/defining-pipelines#stages). --- diff --git a/content/docs/user-guide/experiment-management/running-experiments.md b/content/docs/user-guide/experiment-management/running-experiments.md index 998f1b36bc..c3dd9873c4 100644 --- a/content/docs/user-guide/experiment-management/running-experiments.md +++ b/content/docs/user-guide/experiment-management/running-experiments.md @@ -44,7 +44,7 @@ once. > 📖 `dvc exp run` is an experiment-specific alternative to `dvc repro`. [reproduction targets]: /doc/command-reference/repro#options -[dependency graph]: /doc/user-guide/data-pipelines/defining-pipelines +[dependency graph]: /doc/user-guide/pipelines/defining-pipelines ## Tuning (hyper)parameters diff --git a/content/docs/user-guide/pipelines/index.md b/content/docs/user-guide/pipelines/index.md index 5a9f96a823..5cfea8de38 100644 --- a/content/docs/user-guide/pipelines/index.md +++ b/content/docs/user-guide/pipelines/index.md @@ -16,4 +16,4 @@ consistent to reproduce. See [Get Started: Data Pipelines](/doc/start/data-management/pipelines) for a hands-on introduction to this topic. -[define]: /doc/user-guide/data-pipelines/defining-pipelines +[define]: /doc/user-guide/pipelines/defining-pipelines diff --git a/content/docs/user-guide/project-structure/dvcyaml-files.md b/content/docs/user-guide/project-structure/dvcyaml-files.md index 9e766ac0e5..6b2f1bbb43 100644 --- a/content/docs/user-guide/project-structure/dvcyaml-files.md +++ b/content/docs/user-guide/project-structure/dvcyaml-files.md @@ -94,6 +94,29 @@ parametrize `cmd` strings. +
+ +### 💡 Avoiding unexpected behavior + +We don't want to tell anyone how to write their code or what programs to use! +However, please be aware that in order to prevent unexpected results when DVC +reproduces pipeline stages, the underlying code should ideally follow these +rules: + +- Read/write exclusively from/to the specified dependencies and + outputs (including parameters files, metrics, and plots). +- Completely rewrite outputs. Do not append or edit. +- Stop reading and writing files when the `command` exits. + +Also, if your pipeline reproducibility goals include consistent output data, its +code should be +[deterministic](https://en.wikipedia.org/wiki/Deterministic_algorithm) (produce +the same output for any given input): avoid code that increases +[entropy](https://en.wikipedia.org/wiki/Software_entropy) (e.g. random numbers, +time functions, hardware dependencies, etc.). + +
+ ### Parameters Parameters are simple key/value pairs consumed by the `command` diff --git a/content/docs/user-guide/project-structure/internal-files.md b/content/docs/user-guide/project-structure/internal-files.md index f8b705ada2..f3f2776a83 100644 --- a/content/docs/user-guide/project-structure/internal-files.md +++ b/content/docs/user-guide/project-structure/internal-files.md @@ -168,4 +168,4 @@ run-cache to remote storage for sharing and/or as a back up. > [Avoiding unexpected behavior]). [avoiding unexpected behavior]: - /doc/user-guide/data-pipelines/defining-pipelines#avoiding-unexpected-behavior + /doc/user-guide/project-structure/dvcyaml-files#avoiding-unexpected-behavior diff --git a/content/docs/user-guide/related-technologies.md b/content/docs/user-guide/related-technologies.md index b6d0967004..c633e12d12 100644 --- a/content/docs/user-guide/related-technologies.md +++ b/content/docs/user-guide/related-technologies.md @@ -78,7 +78,7 @@ _Luigi_, etc. - See also our sister project, [CML](https://cml.dev/), that helps fill some of these gaps. -[dependency graphs]: /doc/user-guide/data-pipelines/defining-pipelines +[dependency graphs]: /doc/user-guide/pipelines/defining-pipelines ## Experiment management software @@ -133,4 +133,4 @@ _Luigi_, etc. > technical details (Linux). [directed acyclic graph]: - /doc/user-guide/data-pipelines/defining-pipelines#directed-acyclic-graph-dag + /doc/user-guide/pipelines/defining-pipelines#directed-acyclic-graph-dag diff --git a/content/docs/user-guide/what-is-dvc.md b/content/docs/user-guide/what-is-dvc.md index e695443e7a..d005c12168 100644 --- a/content/docs/user-guide/what-is-dvc.md +++ b/content/docs/user-guide/what-is-dvc.md @@ -51,7 +51,7 @@ can version experiments, manage large datasets, and make projects reproducible. [free]: https://github.com/iterative/dvc/blob/master/LICENSE [vs code extension]: /doc/vs-code-extension [command line]: /doc/command-reference -[pipelines]: /doc/user-guide/data-pipelines +[pipelines]: /doc/user-guide/pipelines ## DVC does not replace Git!