From 1fd3f9f30e502a463fdd2eb7b36b0c6bab8a543d Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sun, 9 Aug 2020 22:20:31 -0500 Subject: [PATCH 01/25] Revert "guide: undo starting How To subsection" This reverts commit e6d5f785caebc2d5ee6ce70527554a955f66c942. Extracted from https://github.com/iterative/dvc.org/pull/1581#discussion_r467671757 --- content/docs/sidebar.json | 7 ++- .../docs/user-guide/how-to/best-practices.md | 43 +++++++++++++++++++ .../update-tracked-files.md} | 0 3 files changed, 49 insertions(+), 1 deletion(-) create mode 100644 content/docs/user-guide/how-to/best-practices.md rename content/docs/user-guide/{updating-tracked-files.md => how-to/update-tracked-files.md} (100%) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 9cf5e1b783..f1febd1dee 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -102,7 +102,12 @@ "katacoda": "https://katacoda.com/dvc/courses/examples/dvcignore" } }, - "updating-tracked-files", + { + "label": "How To", + "slug": "how-to", + "source": false, + "children": ["update-tracked-files"] + }, "setup-google-drive-remote", "large-dataset-optimization", "external-dependencies", diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md new file mode 100644 index 0000000000..019ebca8ce --- /dev/null +++ b/content/docs/user-guide/how-to/best-practices.md @@ -0,0 +1,43 @@ +# Best Practices for DVC Projects + +Data scientists, engineers, or managers may already know or can easily find +answers to some of these questions. However, the variety of answers and +approaches makes data science collaboration a nightmare. **A systematic approach +is required.** + +## Questions on... + +### Source code and data versioning + +- How do you avoid discrepancies between + [revisions](https://git-scm.com/docs/revisions) of source code and versions of + data files, when the data cannot fit into a traditional repository? + +### Experiment time log + +- How do you track which of your + [hyperparameter]() + changes contributed the most to producing or improving your target + [metric](/doc/command-reference/metrics)? How do you monitor the degree of + each change? + +### Navigating through experiments + +- How do you recover a model from last week without wasting time waiting for the + model to retrain? + +- How do you quickly switch between a large dataset and a small subset without + modifying source code? + +### Reproducibility + +- How do you run a model's evaluation process again without retraining the model + and preprocessing a raw dataset? + +### Managing and sharing large data files + +- How do you share models trained in a GPU environment with colleagues who don't + have access to a GPU? + +- How do you share the entire 147 GB of your ML project, with all of its data + sources, intermediate data files, and models? diff --git a/content/docs/user-guide/updating-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md similarity index 100% rename from content/docs/user-guide/updating-tracked-files.md rename to content/docs/user-guide/how-to/update-tracked-files.md From c66b888c4aa5eb180458bbca1054ae452cdd4882 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Fri, 21 Aug 2020 19:52:03 +0530 Subject: [PATCH 02/25] resolving conflict --- content/docs/user-guide/how-to/update-tracked-files.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/how-to/update-tracked-files.md b/content/docs/user-guide/how-to/update-tracked-files.md index 44ce0f16ea..e74a1d06b6 100644 --- a/content/docs/user-guide/how-to/update-tracked-files.md +++ b/content/docs/user-guide/how-to/update-tracked-files.md @@ -9,7 +9,7 @@ corruption when the DVC config option `cache.type` is set to `hardlink` or/and link types.) > For an example of the cache corruption problem see -> [issue #599](https://github.com/iterative/dvc/issues/599) in our Github +> [issue #599](https://github.com/iterative/dvc/issues/599) in our GitHub > repository. Assume `train.tsv` is tracked by DVC and you want to update it. Here updating From 8880af5155de62d9907da9570c83d6096e3ed0a5 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Sun, 23 Aug 2020 02:08:17 +0530 Subject: [PATCH 03/25] update best practices --- content/docs/sidebar.json | 2 +- content/docs/user-guide/how-to/best-practices.md | 5 +++++ 2 files changed, 6 insertions(+), 1 deletion(-) diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 260cee776b..bfc26205d7 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -102,7 +102,7 @@ "label": "How To", "slug": "how-to", "source": false, - "children": ["update-tracked-files"] + "children": ["best-practices", "update-tracked-files"] }, "setup-google-drive-remote", "large-dataset-optimization", diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index 019ebca8ce..0d5a4865e5 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -13,6 +13,11 @@ is required.** [revisions](https://git-scm.com/docs/revisions) of source code and versions of data files, when the data cannot fit into a traditional repository? +### Experiments + +- How do you document progress of training different types of models on your + data files in the same project? + ### Experiment time log - How do you track which of your From 93cb0366ec3dd78f31c4d1c2b9bafcaf186ac2bb Mon Sep 17 00:00:00 2001 From: imhardikj Date: Mon, 24 Aug 2020 22:13:35 +0530 Subject: [PATCH 04/25] Best practices update --- .../docs/user-guide/how-to/best-practices.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index 0d5a4865e5..98624f9667 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -1,5 +1,28 @@ # Best Practices for DVC Projects +This guide provides general tips and tricks related to DVC, which can be +utilized while working on a project. Using the practices listed here, you can +manage your projects with DVC more efficiently. + +### Manually editing dvc.yaml or .dvc files + +It's safe to edit `dvc.yaml` and `.dvc` files. You can manually change all the +fields present in these files. However, please keep in mind to not change the +`md5` or `checksum` fields in `.dvc` files as they contain hash values which DVC +uses to track the file or directory. + +### Using meta in dvc.yaml or .dvc files + +DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be +used to add any user specific information. It also supports YAML content. + +### Never store credentials in project config + +Do not store any user credentials in project config file. This file can be found +by default in `.dvc/config`. + +--- + Data scientists, engineers, or managers may already know or can easily find answers to some of these questions. However, the variety of answers and approaches makes data science collaboration a nightmare. **A systematic approach From 3a396544cabe39556a5668436156ce7888a1eaf7 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Wed, 26 Aug 2020 03:34:23 +0530 Subject: [PATCH 05/25] adding best pratices --- .../docs/user-guide/how-to/best-practices.md | 53 ++++++++++--------- 1 file changed, 28 insertions(+), 25 deletions(-) diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index 98624f9667..107647096b 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -1,28 +1,5 @@ # Best Practices for DVC Projects -This guide provides general tips and tricks related to DVC, which can be -utilized while working on a project. Using the practices listed here, you can -manage your projects with DVC more efficiently. - -### Manually editing dvc.yaml or .dvc files - -It's safe to edit `dvc.yaml` and `.dvc` files. You can manually change all the -fields present in these files. However, please keep in mind to not change the -`md5` or `checksum` fields in `.dvc` files as they contain hash values which DVC -uses to track the file or directory. - -### Using meta in dvc.yaml or .dvc files - -DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be -used to add any user specific information. It also supports YAML content. - -### Never store credentials in project config - -Do not store any user credentials in project config file. This file can be found -by default in `.dvc/config`. - ---- - Data scientists, engineers, or managers may already know or can easily find answers to some of these questions. However, the variety of answers and approaches makes data science collaboration a nightmare. **A systematic approach @@ -36,32 +13,54 @@ is required.** [revisions](https://git-scm.com/docs/revisions) of source code and versions of data files, when the data cannot fit into a traditional repository? + DVC replaces all large data files, models, etc. with small + [metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These + files point to the original data, which you can access by checking out the + required `revision`. + ### Experiments - How do you document progress of training different types of models on your data files in the same project? + You can make use of Git branches for each of the model and then utilise DVC + features while working on that branch. + ### Experiment time log - How do you track which of your [hyperparameter]() changes contributed the most to producing or improving your target - [metric](/doc/command-reference/metrics)? How do you monitor the degree of - each change? + [metric](doc/command-reference/metrics)? How do you monitor the degree of each + change? + + Hyperparameters are defined using the the `--params` option of `dvc run` and + the default parameters file is `params.yaml`. You can commit different + versions of `params.yaml` and then use `dvc metrics` or `dvc plots` to track + which parameter contributes most to the change. ### Navigating through experiments - How do you recover a model from last week without wasting time waiting for the model to retrain? + First you can checkout the required `revision`, followed by `dvc checkout` to + update DVC-tracked files and directories in your workspace. + - How do you quickly switch between a large dataset and a small subset without modifying source code? + You can change dependencies of relevant stage either by using `dvc run` with + `-f` option or by manually editing the stage in `dvc.yaml` file. + ### Reproducibility - How do you run a model's evaluation process again without retraining the model and preprocessing a raw dataset? + DVC provides a way to reproduce pipelines partially. You can use `dvc repro` + to execute evaluation stage without reproducing complete pipeline. + ### Managing and sharing large data files - How do you share models trained in a GPU environment with colleagues who don't @@ -69,3 +68,7 @@ is required.** - How do you share the entire 147 GB of your ML project, with all of its data sources, intermediate data files, and models? + + Cloud or local storage can be used to store the project's data. You can share + large data files and models with others if they are stored on + [remote storage](doc/command-reference/remote/add#supported-storage-types). From b2af801cc76c45e72420b219712aa68af5e1fc4c Mon Sep 17 00:00:00 2001 From: imhardikj Date: Fri, 28 Aug 2020 01:24:45 +0530 Subject: [PATCH 06/25] modifying best pratices --- .../docs/user-guide/how-to/best-practices.md | 134 +++++++++++++----- .../docs/user-guide/how-to/tips-and-tricks.md | 10 ++ 2 files changed, 108 insertions(+), 36 deletions(-) create mode 100644 content/docs/user-guide/how-to/tips-and-tricks.md diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index 107647096b..afa6085b1d 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -1,66 +1,132 @@ # Best Practices for DVC Projects -Data scientists, engineers, or managers may already know or can easily find -answers to some of these questions. However, the variety of answers and -approaches makes data science collaboration a nightmare. **A systematic approach -is required.** +Asking questions on data science collaboration to data scientists, engineers, or +managers, we'll get a variety of answers. DVC provides a systematic approach +towards managing and collaborating on data science projects. You can manage your +projects with DVC more efficiently using the practices listed here: -## Questions on... - -### Source code and data versioning +- Source code and data versioning -- How do you avoid discrepancies between + You can use DVC to avoid discrepancies between [revisions](https://git-scm.com/docs/revisions) of source code and versions of - data files, when the data cannot fit into a traditional repository? - - DVC replaces all large data files, models, etc. with small + data files, when the data doesn't fit into a traditional repository. DVC + replaces all large data files, models, etc. with small [metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These files point to the original data, which you can access by checking out the required `revision`. -### Experiments +- Experiments + + You can make use of Git branches to document progress of training different + types of models on your data files in the same project. Create a branch for + each of the model and then utilise DVC features while working on that branch. + +- Experiment time log + + [Hyperparameter]() + are defined using the the `--params` option of `dvc run` and the default + parameters file is `params.yaml`. You can commit different versions of + `params.yaml` and then use `dvc metrics` or `dvc plots` to track which of your + changes contributed the most in improving target + [metric](doc/command-reference/metrics). You can monitor the degree of each + change. + +- Navigating through experiments + + To recover a model from last week without wasting time required for the model + to retrain, first you can checkout the required `revision`. Followed by + `dvc checkout` to update DVC-tracked files and directories in your workspace. + +- Switching between datasets + + You can quickly switch between a large dataset and a small subset without + modifying source code. To achieve this yoe need to change dependencies of + relevant stage either by using `dvc run` with `-f` option or by manually + editing the stage in `dvc.yaml` file. + +- Reproducibility + + You can run a model's evaluation process again without actually retraining the + model and preprocessing a raw dataset. DVC provides a way to reproduce + pipelines partially. You can use `dvc repro` to execute evaluation stage + without reproducing complete pipeline: + + ```dvc + $ dvc repro evaluate + ``` + +- Managing and sharing large data files + + Cloud or local storage can be used to store the project's data. You can share + the entire 147 GB of your ML project, with all of its data sources, + intermediate data files, and models with others if they are stored on + [remote storage](doc/command-reference/remote/add#supported-storage-types). + Using this you can share models trained in a GPU environment with colleagues + who don't have access to a GPU. Have a look at this + [example](doc/command-reference/pull#example-download-from-specific-remote-storage) + to see how this works. + +- Manually editing dvc.yaml or .dvc files -- How do you document progress of training different types of models on your - data files in the same project? + It's safe to edit `dvc.yaml` and `.dvc` files. You can manually change all the + fields present in these files. However, please keep in mind to not change the + `md5` or `checksum` fields in `.dvc` files as they contain hash values which + DVC uses to track the file or directory. - You can make use of Git branches for each of the model and then utilise DVC - features while working on that branch. +- Never store credentials in project config + + Do not store any user credentials in project config file. This file can be + found by default in `.dvc/config`. Use `--local`, `--global`, or `--system` + command options with `dvc config` for storing sensitive, or user-specific + settings: + + ```dvc + $ dvc config --system remote.username [password] + ``` + +- Tracking outputs by Git + + If `outs` are small files in size and you want to track them with Git then you + can use `--outs-no-cache` option to define outputs while creating or modifying + a stage. DVC will not track will not track outputs in this case: + + ```dvc + $ dvc run -n train -d src/train.py -d data/features \ + ---outs-no-cache model.p \ + python src/train.py data/features model.pkl + ``` + +--- + +## Questions on... + +### Source code and data versioning + +- How do you avoid discrepancies between + [revisions](https://git-scm.com/docs/revisions) of source code and versions of + data files, when the data cannot fit into a traditional repository? ### Experiment time log - How do you track which of your [hyperparameter]() changes contributed the most to producing or improving your target - [metric](doc/command-reference/metrics)? How do you monitor the degree of each - change? - - Hyperparameters are defined using the the `--params` option of `dvc run` and - the default parameters file is `params.yaml`. You can commit different - versions of `params.yaml` and then use `dvc metrics` or `dvc plots` to track - which parameter contributes most to the change. + [metric](/doc/command-reference/metrics)? How do you monitor the degree of + each change? ### Navigating through experiments - How do you recover a model from last week without wasting time waiting for the model to retrain? - First you can checkout the required `revision`, followed by `dvc checkout` to - update DVC-tracked files and directories in your workspace. - - How do you quickly switch between a large dataset and a small subset without modifying source code? - You can change dependencies of relevant stage either by using `dvc run` with - `-f` option or by manually editing the stage in `dvc.yaml` file. - ### Reproducibility - How do you run a model's evaluation process again without retraining the model and preprocessing a raw dataset? - DVC provides a way to reproduce pipelines partially. You can use `dvc repro` - to execute evaluation stage without reproducing complete pipeline. - ### Managing and sharing large data files - How do you share models trained in a GPU environment with colleagues who don't @@ -68,7 +134,3 @@ is required.** - How do you share the entire 147 GB of your ML project, with all of its data sources, intermediate data files, and models? - - Cloud or local storage can be used to store the project's data. You can share - large data files and models with others if they are stored on - [remote storage](doc/command-reference/remote/add#supported-storage-types). diff --git a/content/docs/user-guide/how-to/tips-and-tricks.md b/content/docs/user-guide/how-to/tips-and-tricks.md new file mode 100644 index 0000000000..7670a13bf4 --- /dev/null +++ b/content/docs/user-guide/how-to/tips-and-tricks.md @@ -0,0 +1,10 @@ +# Tips and tricks for DVC Projects + +This guide provides general tips and tricks related to DVC, which can be +utilized while working on a project. Using the practices listed here, you can +manage your projects with DVC more efficiently. + +### Using meta in dvc.yaml or .dvc files + +DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be +used to add any user specific information. It also supports YAML content. From 2994cf8584a3bb4231443ab400511a0b75ee2526 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 27 Aug 2020 17:50:25 -0500 Subject: [PATCH 07/25] Update content/docs/user-guide/how-to/best-practices.md --- content/docs/user-guide/how-to/best-practices.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index afa6085b1d..b348711b27 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -19,7 +19,7 @@ projects with DVC more efficiently using the practices listed here: You can make use of Git branches to document progress of training different types of models on your data files in the same project. Create a branch for - each of the model and then utilise DVC features while working on that branch. + each of the models and then utilise DVC features while working on that branch. - Experiment time log From ba02f1767215f7d88cbfe423327ab2ea0789f0dd Mon Sep 17 00:00:00 2001 From: imhardikj Date: Sun, 30 Aug 2020 02:48:23 +0530 Subject: [PATCH 08/25] updates --- .../docs/user-guide/how-to/best-practices.md | 259 +++++++++--------- .../{how-to => }/tips-and-tricks.md | 9 +- 2 files changed, 135 insertions(+), 133 deletions(-) rename content/docs/user-guide/{how-to => }/tips-and-tricks.md (54%) diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index afa6085b1d..ea88597044 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -1,136 +1,131 @@ # Best Practices for DVC Projects -Asking questions on data science collaboration to data scientists, engineers, or -managers, we'll get a variety of answers. DVC provides a systematic approach -towards managing and collaborating on data science projects. You can manage your -projects with DVC more efficiently using the practices listed here: +DVC provides a systematic approach towards managing and collaborating on data +science projects. You can manage your projects with DVC more efficiently using +the practices listed here: + +## Source code and data versioning + +You can use DVC to avoid discrepancies between +[revisions](https://git-scm.com/docs/revisions) of source code and +[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC +replaces all large data files, models, etc. with small +[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These +files point to the original data, which you can access by first checking out the +required `revision` using Git followed by `dvc checkout` to update DVC tracked +data files/dir: + +```dvc +$ git checkout 95485f # Git commit of required data version +$ dvc checkout +``` + +If your dataset consist of multiple files like images, etc. then the best way to +track whole directory is with single `.dvc` file. You can use `dvc add` with +relative path to directory: + +```dvc +$ dvc add data/images +``` + +## Experiments and tracking parameters + +You can use DVC for tuning [parameters](doc/command-reference/params), improving +target [metrics](doc/command-reference/metrics) and visualizing the changes with +[plots](doc/command-reference/plots). In the first step tune parameters in +default `params.yaml` file and reproduce the pipeline: + +```dvc +$ dvc repro # Reproducing pipeline +$ git add -am "Epoch Experiment" +``` + +Commit the new changes in files using Git. Next step is to compare the +experiments. Use `dvc metrics` to find difference in target metric between two +commits: + +```dvc +$ dvc metrics diff rev1 rev2 +``` + +And finally you can plot target metrics using `dvc plots`: + +```dvc +$ dvc plots diff -x recall -y precision rev1 rev2 +``` + +If you want to recover a model from last week without wasting time required for +the model to retrain you can use DVC to navigate through your experiments. First +you can checkout the required `revision` using Git: + +```dvc +$ git checkout baseline-experiment # Git commit, tag or branch +$ dvc checkout +``` + +Followed by `dvc checkout` to update DVC-tracked files and directories in your +workspace. + +## Reproducibility + +You can run a model's evaluation process again without actually retraining the +model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines +partially. You can use `dvc repro` to execute evaluation stage without +reproducing complete pipeline: + +```dvc +$ dvc repro evaluate +``` + +## Managing and sharing large data files + +Cloud or local storage can be used to store the project's data. You can share +the entire 147 GB of your ML project, with all of its data sources, intermediate +data files, and models with others if they are stored on +[remote storage](doc/command-reference/remote/add#supported-storage-types). +Using this you can share models trained in a GPU environment with colleagues who +don't have access to a GPU. Have a look at this +[example](doc/command-reference/pull#example-download-from-specific-remote-storage) +to see how this works. + +## Manually editing dvc.yaml or .dvc files + +It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: + +```yaml +stages: + prepare: + cmd: python src/prepare.py data/data.xml + deps: + - data/data.xml + params: + - prepare.split + outs: + - data/prepared +``` + +You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` +files please remember not to change the `md5` or `checksum` fields as they +contain hash values which DVC uses to track the file or directory. + +## Never store credentials in project config + +Do not store any user credentials in project config file. This file can be found +by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command +options with `dvc config` for storing sensitive, or user-specific settings: + +```dvc +$ dvc config --system remote.username [password] +``` + +## Tracking outputs by Git -- Source code and data versioning +If your `output` files are small in size and you want to track them with Git +then you can use `--outs-no-cache` option to define outputs while creating or +modifying a stage. DVC will not track will not track outputs in this case: - You can use DVC to avoid discrepancies between - [revisions](https://git-scm.com/docs/revisions) of source code and versions of - data files, when the data doesn't fit into a traditional repository. DVC - replaces all large data files, models, etc. with small - [metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These - files point to the original data, which you can access by checking out the - required `revision`. - -- Experiments - - You can make use of Git branches to document progress of training different - types of models on your data files in the same project. Create a branch for - each of the model and then utilise DVC features while working on that branch. - -- Experiment time log - - [Hyperparameter]() - are defined using the the `--params` option of `dvc run` and the default - parameters file is `params.yaml`. You can commit different versions of - `params.yaml` and then use `dvc metrics` or `dvc plots` to track which of your - changes contributed the most in improving target - [metric](doc/command-reference/metrics). You can monitor the degree of each - change. - -- Navigating through experiments - - To recover a model from last week without wasting time required for the model - to retrain, first you can checkout the required `revision`. Followed by - `dvc checkout` to update DVC-tracked files and directories in your workspace. - -- Switching between datasets - - You can quickly switch between a large dataset and a small subset without - modifying source code. To achieve this yoe need to change dependencies of - relevant stage either by using `dvc run` with `-f` option or by manually - editing the stage in `dvc.yaml` file. - -- Reproducibility - - You can run a model's evaluation process again without actually retraining the - model and preprocessing a raw dataset. DVC provides a way to reproduce - pipelines partially. You can use `dvc repro` to execute evaluation stage - without reproducing complete pipeline: - - ```dvc - $ dvc repro evaluate - ``` - -- Managing and sharing large data files - - Cloud or local storage can be used to store the project's data. You can share - the entire 147 GB of your ML project, with all of its data sources, - intermediate data files, and models with others if they are stored on - [remote storage](doc/command-reference/remote/add#supported-storage-types). - Using this you can share models trained in a GPU environment with colleagues - who don't have access to a GPU. Have a look at this - [example](doc/command-reference/pull#example-download-from-specific-remote-storage) - to see how this works. - -- Manually editing dvc.yaml or .dvc files - - It's safe to edit `dvc.yaml` and `.dvc` files. You can manually change all the - fields present in these files. However, please keep in mind to not change the - `md5` or `checksum` fields in `.dvc` files as they contain hash values which - DVC uses to track the file or directory. - -- Never store credentials in project config - - Do not store any user credentials in project config file. This file can be - found by default in `.dvc/config`. Use `--local`, `--global`, or `--system` - command options with `dvc config` for storing sensitive, or user-specific - settings: - - ```dvc - $ dvc config --system remote.username [password] - ``` - -- Tracking outputs by Git - - If `outs` are small files in size and you want to track them with Git then you - can use `--outs-no-cache` option to define outputs while creating or modifying - a stage. DVC will not track will not track outputs in this case: - - ```dvc - $ dvc run -n train -d src/train.py -d data/features \ - ---outs-no-cache model.p \ - python src/train.py data/features model.pkl - ``` - ---- - -## Questions on... - -### Source code and data versioning - -- How do you avoid discrepancies between - [revisions](https://git-scm.com/docs/revisions) of source code and versions of - data files, when the data cannot fit into a traditional repository? - -### Experiment time log - -- How do you track which of your - [hyperparameter]() - changes contributed the most to producing or improving your target - [metric](/doc/command-reference/metrics)? How do you monitor the degree of - each change? - -### Navigating through experiments - -- How do you recover a model from last week without wasting time waiting for the - model to retrain? - -- How do you quickly switch between a large dataset and a small subset without - modifying source code? - -### Reproducibility - -- How do you run a model's evaluation process again without retraining the model - and preprocessing a raw dataset? - -### Managing and sharing large data files - -- How do you share models trained in a GPU environment with colleagues who don't - have access to a GPU? - -- How do you share the entire 147 GB of your ML project, with all of its data - sources, intermediate data files, and models? +```dvc +$ dvc run -n train -d src/train.py -d data/features \ + ---outs-no-cache model.p \ + python src/train.py data/features model.pkl +``` diff --git a/content/docs/user-guide/how-to/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md similarity index 54% rename from content/docs/user-guide/how-to/tips-and-tricks.md rename to content/docs/user-guide/tips-and-tricks.md index 7670a13bf4..35d7c989ba 100644 --- a/content/docs/user-guide/how-to/tips-and-tricks.md +++ b/content/docs/user-guide/tips-and-tricks.md @@ -4,7 +4,14 @@ This guide provides general tips and tricks related to DVC, which can be utilized while working on a project. Using the practices listed here, you can manage your projects with DVC more efficiently. -### Using meta in dvc.yaml or .dvc files +## Using meta in dvc.yaml or .dvc files DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be used to add any user specific information. It also supports YAML content. + +## Switching between datasets + +You can quickly switch between a large dataset and a small subset without +modifying source code. To achieve this yoe need to change dependencies of +relevant stage either by using `dvc run` with `-f` option or by manually editing +the stage in `dvc.yaml` file. From eb6786026405769d4f41bc2da247dc16baf13707 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Sun, 30 Aug 2020 03:36:20 +0530 Subject: [PATCH 09/25] Update best-practices.md --- content/docs/user-guide/how-to/best-practices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index ea88597044..37ff3d1b88 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -84,9 +84,9 @@ the entire 147 GB of your ML project, with all of its data sources, intermediate data files, and models with others if they are stored on [remote storage](doc/command-reference/remote/add#supported-storage-types). Using this you can share models trained in a GPU environment with colleagues who -don't have access to a GPU. Have a look at this +don't have access to a GPU. You can check this [example](doc/command-reference/pull#example-download-from-specific-remote-storage) -to see how this works. +to see how to download data from remote storage. ## Manually editing dvc.yaml or .dvc files From fb62cb1c10f2034bd03cb3dd2ca9292ed6840c14 Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Sun, 30 Aug 2020 03:45:38 +0530 Subject: [PATCH 10/25] Update best-practices.md --- content/docs/user-guide/how-to/best-practices.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md index 37ff3d1b88..ea19d3d87e 100644 --- a/content/docs/user-guide/how-to/best-practices.md +++ b/content/docs/user-guide/how-to/best-practices.md @@ -20,8 +20,8 @@ $ git checkout 95485f # Git commit of required data version $ dvc checkout ``` -If your dataset consist of multiple files like images, etc. then the best way to -track whole directory is with single `.dvc` file. You can use `dvc add` with +If your dataset consist of multiple files like images, etc., then the best way +to track whole directory is with single `.dvc` file. You can use `dvc add` with relative path to directory: ```dvc From 8121897759c6c4a0a8398ad78f4e2d4498b17e67 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Sat, 12 Sep 2020 22:06:20 +0530 Subject: [PATCH 11/25] removing best practice doc --- content/docs/sidebar.json | 2 +- .../docs/user-guide/how-to/best-practices.md | 131 ------------------ content/docs/user-guide/tips-and-tricks.md | 17 --- 3 files changed, 1 insertion(+), 149 deletions(-) delete mode 100644 content/docs/user-guide/how-to/best-practices.md delete mode 100644 content/docs/user-guide/tips-and-tricks.md diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index bfc26205d7..260cee776b 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -102,7 +102,7 @@ "label": "How To", "slug": "how-to", "source": false, - "children": ["best-practices", "update-tracked-files"] + "children": ["update-tracked-files"] }, "setup-google-drive-remote", "large-dataset-optimization", diff --git a/content/docs/user-guide/how-to/best-practices.md b/content/docs/user-guide/how-to/best-practices.md deleted file mode 100644 index ea19d3d87e..0000000000 --- a/content/docs/user-guide/how-to/best-practices.md +++ /dev/null @@ -1,131 +0,0 @@ -# Best Practices for DVC Projects - -DVC provides a systematic approach towards managing and collaborating on data -science projects. You can manage your projects with DVC more efficiently using -the practices listed here: - -## Source code and data versioning - -You can use DVC to avoid discrepancies between -[revisions](https://git-scm.com/docs/revisions) of source code and -[versions](/doc/use-cases/versioning-data-and-model-files) of data files. DVC -replaces all large data files, models, etc. with small -[metafiles](doc/user-guide/dvc-files-and-directories) (tracked by Git). These -files point to the original data, which you can access by first checking out the -required `revision` using Git followed by `dvc checkout` to update DVC tracked -data files/dir: - -```dvc -$ git checkout 95485f # Git commit of required data version -$ dvc checkout -``` - -If your dataset consist of multiple files like images, etc., then the best way -to track whole directory is with single `.dvc` file. You can use `dvc add` with -relative path to directory: - -```dvc -$ dvc add data/images -``` - -## Experiments and tracking parameters - -You can use DVC for tuning [parameters](doc/command-reference/params), improving -target [metrics](doc/command-reference/metrics) and visualizing the changes with -[plots](doc/command-reference/plots). In the first step tune parameters in -default `params.yaml` file and reproduce the pipeline: - -```dvc -$ dvc repro # Reproducing pipeline -$ git add -am "Epoch Experiment" -``` - -Commit the new changes in files using Git. Next step is to compare the -experiments. Use `dvc metrics` to find difference in target metric between two -commits: - -```dvc -$ dvc metrics diff rev1 rev2 -``` - -And finally you can plot target metrics using `dvc plots`: - -```dvc -$ dvc plots diff -x recall -y precision rev1 rev2 -``` - -If you want to recover a model from last week without wasting time required for -the model to retrain you can use DVC to navigate through your experiments. First -you can checkout the required `revision` using Git: - -```dvc -$ git checkout baseline-experiment # Git commit, tag or branch -$ dvc checkout -``` - -Followed by `dvc checkout` to update DVC-tracked files and directories in your -workspace. - -## Reproducibility - -You can run a model's evaluation process again without actually retraining the -model and preprocessing a raw dataset. DVC provides a way to reproduce pipelines -partially. You can use `dvc repro` to execute evaluation stage without -reproducing complete pipeline: - -```dvc -$ dvc repro evaluate -``` - -## Managing and sharing large data files - -Cloud or local storage can be used to store the project's data. You can share -the entire 147 GB of your ML project, with all of its data sources, intermediate -data files, and models with others if they are stored on -[remote storage](doc/command-reference/remote/add#supported-storage-types). -Using this you can share models trained in a GPU environment with colleagues who -don't have access to a GPU. You can check this -[example](doc/command-reference/pull#example-download-from-specific-remote-storage) -to see how to download data from remote storage. - -## Manually editing dvc.yaml or .dvc files - -It's safe to edit `dvc.yaml` and `.dvc` files. Here's a `dvc.yaml` example: - -```yaml -stages: - prepare: - cmd: python src/prepare.py data/data.xml - deps: - - data/data.xml - params: - - prepare.split - outs: - - data/prepared -``` - -You can manually edit all the fields present in `dvc.yaml`. However, in `.dvc` -files please remember not to change the `md5` or `checksum` fields as they -contain hash values which DVC uses to track the file or directory. - -## Never store credentials in project config - -Do not store any user credentials in project config file. This file can be found -by default in `.dvc/config`. Use `--local`, `--global`, or `--system` command -options with `dvc config` for storing sensitive, or user-specific settings: - -```dvc -$ dvc config --system remote.username [password] -``` - -## Tracking outputs by Git - -If your `output` files are small in size and you want to track them with Git -then you can use `--outs-no-cache` option to define outputs while creating or -modifying a stage. DVC will not track will not track outputs in this case: - -```dvc -$ dvc run -n train -d src/train.py -d data/features \ - ---outs-no-cache model.p \ - python src/train.py data/features model.pkl -``` diff --git a/content/docs/user-guide/tips-and-tricks.md b/content/docs/user-guide/tips-and-tricks.md deleted file mode 100644 index 35d7c989ba..0000000000 --- a/content/docs/user-guide/tips-and-tricks.md +++ /dev/null @@ -1,17 +0,0 @@ -# Tips and tricks for DVC Projects - -This guide provides general tips and tricks related to DVC, which can be -utilized while working on a project. Using the practices listed here, you can -manage your projects with DVC more efficiently. - -## Using meta in dvc.yaml or .dvc files - -DVC provides an optional `meta` field in `dvc.yaml` and `.dvc` file. It can be -used to add any user specific information. It also supports YAML content. - -## Switching between datasets - -You can quickly switch between a large dataset and a small subset without -modifying source code. To achieve this yoe need to change dependencies of -relevant stage either by using `dvc run` with `-f` option or by manually editing -the stage in `dvc.yaml` file. From a3a583744e6296fbb7816d12c3df03c7b59e4558 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Fri, 18 Sep 2020 17:58:06 +0530 Subject: [PATCH 12/25] Undo dvc add doc --- content/docs/sidebar.json | 8 +++- .../docs/user-guide/how-to/undo-dvc-add.md | 47 +++++++++++++++++++ 2 files changed, 54 insertions(+), 1 deletion(-) create mode 100644 content/docs/user-guide/how-to/undo-dvc-add.md diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 260cee776b..1c93f2b0b1 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -102,7 +102,13 @@ "label": "How To", "slug": "how-to", "source": false, - "children": ["update-tracked-files"] + "children": [ + { + "label": "Undo dvc add", + "slug": "undo-dvc-add" + }, + "update-tracked-files" + ] }, "setup-google-drive-remote", "large-dataset-optimization", diff --git a/content/docs/user-guide/how-to/undo-dvc-add.md b/content/docs/user-guide/how-to/undo-dvc-add.md new file mode 100644 index 0000000000..94fed2e550 --- /dev/null +++ b/content/docs/user-guide/how-to/undo-dvc-add.md @@ -0,0 +1,47 @@ +# Undo dvc add + +There are situations where you `dvc add` a data file by mistake and want DVC to +stop tracking that file. Follow the steps listed here to undo `dvc add`. + +Let us first add a file into our example project: + +```dvc +$ dvc add data.csv +$ tree +. +├── data.csv +└── data.csv.dvc +``` + +As you can see `dvc add` creates a `.dvc` file to track the added data. Now +let's reverse this action. + +In the first step, if you are using `symlink` or `hardlink` as +[link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +type for DVC cache, you will have to unprotect the tracked file +(see `dvc unprotect`): + +```dvc +$ dvc unprotect data.csv +``` + +Next, remove the corresponding `.dvc` file and `.gitignore` entry using +`dvc remove`: + +```dvc +$ dvc remove data.csv.dvc +``` + +Data file `data.csv` is now no longer being tracked by DVC. + +```dvc +$ git status + Untracked files: + data/data.xml +``` + +You can run `dvc gc` to remove the unused file contents from the cache. + +```dvc +$ dvc gc -w +``` From c030f4f1cc96a3c3744ef4491e5f310cfd0c2297 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Sat, 19 Sep 2020 14:04:21 -0500 Subject: [PATCH 13/25] Update content/docs/user-guide/how-to/undo-dvc-add.md --- content/docs/user-guide/how-to/undo-dvc-add.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/how-to/undo-dvc-add.md b/content/docs/user-guide/how-to/undo-dvc-add.md index 94fed2e550..cdf899a630 100644 --- a/content/docs/user-guide/how-to/undo-dvc-add.md +++ b/content/docs/user-guide/how-to/undo-dvc-add.md @@ -3,7 +3,7 @@ There are situations where you `dvc add` a data file by mistake and want DVC to stop tracking that file. Follow the steps listed here to undo `dvc add`. -Let us first add a file into our example project: +Lets first add a data file into our example project: ```dvc $ dvc add data.csv From f0e4c79e25575b25184baf913ead16a67984ee12 Mon Sep 17 00:00:00 2001 From: imhardikj Date: Mon, 21 Sep 2020 01:18:23 +0530 Subject: [PATCH 14/25] updates --- content/docs/sidebar.json | 8 +--- .../user-guide/how-to/undo-adding-data.md | 46 ++++++++++++++++++ .../docs/user-guide/how-to/undo-dvc-add.md | 47 ------------------- 3 files changed, 47 insertions(+), 54 deletions(-) create mode 100644 content/docs/user-guide/how-to/undo-adding-data.md delete mode 100644 content/docs/user-guide/how-to/undo-dvc-add.md diff --git a/content/docs/sidebar.json b/content/docs/sidebar.json index 1c93f2b0b1..59cadea92a 100644 --- a/content/docs/sidebar.json +++ b/content/docs/sidebar.json @@ -102,13 +102,7 @@ "label": "How To", "slug": "how-to", "source": false, - "children": [ - { - "label": "Undo dvc add", - "slug": "undo-dvc-add" - }, - "update-tracked-files" - ] + "children": ["undo-adding-data", "update-tracked-files"] }, "setup-google-drive-remote", "large-dataset-optimization", diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md new file mode 100644 index 0000000000..3bc9f9c7d7 --- /dev/null +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -0,0 +1,46 @@ +# Undo Adding Data + +There are situations where you `dvc add` a data file and want DVC to stop +tracking that file. Follow the steps listed here to undo `dvc add`. + +Let's first add a data file into an example project using +`dvc add`, which creates a `.dvc` file to track the added data: + +```dvc +$ dvc add data.csv +$ ls +data.csv data.csv.dvc +``` + +Now let's reverse this action. + +> Note,if you are using `symlink` or `hardlink` as +> [link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> type for DVC cache, you will have to unprotect the tracked file +> (see `dvc unprotect`): +> +> ```dvc +> $ dvc unprotect data.csv +> ``` + +You'll need to remove the corresponding `.dvc` file and `.gitignore` entry using +`dvc remove`: + +```dvc +$ dvc remove data.csv.dvc +``` + +Data file `data.csv` is now no longer being tracked by DVC. + +```dvc +$ git status + Untracked files: + data.csv +``` + +You can run `dvc gc` with the `-w` option to remove the data from the cache that +isn't referenced in the current workspace: + +```dvc +$ dvc gc -w +``` diff --git a/content/docs/user-guide/how-to/undo-dvc-add.md b/content/docs/user-guide/how-to/undo-dvc-add.md deleted file mode 100644 index cdf899a630..0000000000 --- a/content/docs/user-guide/how-to/undo-dvc-add.md +++ /dev/null @@ -1,47 +0,0 @@ -# Undo dvc add - -There are situations where you `dvc add` a data file by mistake and want DVC to -stop tracking that file. Follow the steps listed here to undo `dvc add`. - -Lets first add a data file into our example project: - -```dvc -$ dvc add data.csv -$ tree -. -├── data.csv -└── data.csv.dvc -``` - -As you can see `dvc add` creates a `.dvc` file to track the added data. Now -let's reverse this action. - -In the first step, if you are using `symlink` or `hardlink` as -[link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -type for DVC cache, you will have to unprotect the tracked file -(see `dvc unprotect`): - -```dvc -$ dvc unprotect data.csv -``` - -Next, remove the corresponding `.dvc` file and `.gitignore` entry using -`dvc remove`: - -```dvc -$ dvc remove data.csv.dvc -``` - -Data file `data.csv` is now no longer being tracked by DVC. - -```dvc -$ git status - Untracked files: - data/data.xml -``` - -You can run `dvc gc` to remove the unused file contents from the cache. - -```dvc -$ dvc gc -w -``` From da456cd37335d9c5a0d1142d7225e56dbd97ab6f Mon Sep 17 00:00:00 2001 From: imhardikj Date: Mon, 21 Sep 2020 01:43:27 +0530 Subject: [PATCH 15/25] updates --- content/docs/command-reference/add.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index e2146257b1..d9bd9640ce 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -58,6 +58,8 @@ files that can be easily tracked with Git. It's possible to prevent files or directories from being added by DVC by adding the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file. +You can also undo the action of adding files or directories using `dvc add` by +following this [guide](/docs/user-guide/how-to/undo-adding-data). By default, DVC tries to use reflinks (see [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) From 288b6273eb1d2c78a9887973419df0dcc6d4d563 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Sep 2020 20:10:36 -0500 Subject: [PATCH 16/25] Update content/docs/user-guide/how-to/undo-adding-data.md --- content/docs/user-guide/how-to/undo-adding-data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 3bc9f9c7d7..51d4e67dc6 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -4,7 +4,7 @@ There are situations where you `dvc add` a data file and want DVC to stop tracking that file. Follow the steps listed here to undo `dvc add`. Let's first add a data file into an example project using -`dvc add`, which creates a `.dvc` file to track the added data: +`dvc add`, which creates a `.dvc` file to track the data: ```dvc $ dvc add data.csv From 7e3ac80da8da961e2b6afd3873dd75bf8cca2348 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Sep 2020 20:11:23 -0500 Subject: [PATCH 17/25] Update content/docs/user-guide/how-to/undo-adding-data.md --- content/docs/user-guide/how-to/undo-adding-data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 51d4e67dc6..6e0fe16477 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -17,7 +17,7 @@ Now let's reverse this action. > Note,if you are using `symlink` or `hardlink` as > [link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) > type for DVC cache, you will have to unprotect the tracked file -> (see `dvc unprotect`): +> first (see `dvc unprotect`): > > ```dvc > $ dvc unprotect data.csv From 3139491813fd1381a4df3a9c6e7829dd39557eaa Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Tue, 22 Sep 2020 20:11:49 -0500 Subject: [PATCH 18/25] Update content/docs/user-guide/how-to/undo-adding-data.md --- content/docs/user-guide/how-to/undo-adding-data.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 6e0fe16477..29a1042a9f 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -12,7 +12,7 @@ $ ls data.csv data.csv.dvc ``` -Now let's reverse this action. +Now let's reverse this action: > Note,if you are using `symlink` or `hardlink` as > [link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) From dfb98240a0259098ef014c72da9e6a19eeda84ae Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Thu, 24 Sep 2020 19:11:47 +0530 Subject: [PATCH 19/25] updates --- content/docs/command-reference/add.md | 5 +++-- content/docs/user-guide/how-to/undo-adding-data.md | 4 ++-- 2 files changed, 5 insertions(+), 4 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index d9bd9640ce..059c60ac14 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -58,8 +58,9 @@ files that can be easily tracked with Git. It's possible to prevent files or directories from being added by DVC by adding the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file. -You can also undo the action of adding files or directories using `dvc add` by -following this [guide](/docs/user-guide/how-to/undo-adding-data). + +You can also [undo 'dvc add'](/docs/user-guide/how-to/undo-adding-data) to stop +tracking added files or directories. By default, DVC tries to use reflinks (see [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 29a1042a9f..7564a426bf 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -38,8 +38,8 @@ $ git status data.csv ``` -You can run `dvc gc` with the `-w` option to remove the data from the cache that -isn't referenced in the current workspace: +You can run `dvc gc` with the `-w` option to remove the data that isn't +referenced in the current workspace from the cache: ```dvc $ dvc gc -w From 33504c962d0ee0c37f4583a88a376fb7e3f47cda Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Thu, 24 Sep 2020 19:17:03 +0530 Subject: [PATCH 20/25] updates --- content/docs/command-reference/add.md | 2 +- content/docs/user-guide/how-to/undo-adding-data.md | 4 ++-- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 059c60ac14..f6959fb53e 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -60,7 +60,7 @@ It's possible to prevent files or directories from being added by DVC by adding the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file. You can also [undo 'dvc add'](/docs/user-guide/how-to/undo-adding-data) to stop -tracking added files or directories. +tracking files or directories added previously. By default, DVC tries to use reflinks (see [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 7564a426bf..140ee6592f 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -1,7 +1,7 @@ # Undo Adding Data -There are situations where you `dvc add` a data file and want DVC to stop -tracking that file. Follow the steps listed here to undo `dvc add`. +There are situations where you want to stop tracking data added previously. +Follow the steps listed here to undo `dvc add`. Let's first add a data file into an example project using `dvc add`, which creates a `.dvc` file to track the data: From 8e15350d72b1fb6aff872c2f75916a701d5c4f80 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 24 Sep 2020 17:37:12 -0500 Subject: [PATCH 21/25] Update content/docs/command-reference/add.md --- content/docs/command-reference/add.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index f6959fb53e..35374e3e26 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -59,7 +59,7 @@ files that can be easily tracked with Git. It's possible to prevent files or directories from being added by DVC by adding the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file. -You can also [undo 'dvc add'](/docs/user-guide/how-to/undo-adding-data) to stop +You can also [undo `dvc add`](/docs/user-guide/how-to/undo-adding-data) to stop tracking files or directories added previously. By default, DVC tries to use reflinks (see From d5f422d53592d1ffda8554e586630f53cefe42c0 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Thu, 24 Sep 2020 17:37:34 -0500 Subject: [PATCH 22/25] Update content/docs/command-reference/add.md --- content/docs/command-reference/add.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/add.md b/content/docs/command-reference/add.md index 35374e3e26..128376b028 100644 --- a/content/docs/command-reference/add.md +++ b/content/docs/command-reference/add.md @@ -60,7 +60,7 @@ It's possible to prevent files or directories from being added by DVC by adding the corresponding patterns in a [`.dvcignore`](/doc/user-guide/dvcignore) file. You can also [undo `dvc add`](/docs/user-guide/how-to/undo-adding-data) to stop -tracking files or directories added previously. +tracking files or directories. By default, DVC tries to use reflinks (see [File link types](/doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) From e9edbddf044353bd162c1a30b6de7c229ee52ced Mon Sep 17 00:00:00 2001 From: Hardik Jaroli Date: Fri, 25 Sep 2020 17:04:48 +0530 Subject: [PATCH 23/25] updates --- content/docs/user-guide/how-to/undo-adding-data.md | 8 +++----- 1 file changed, 3 insertions(+), 5 deletions(-) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 140ee6592f..317a057218 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -12,9 +12,7 @@ $ ls data.csv data.csv.dvc ``` -Now let's reverse this action: - -> Note,if you are using `symlink` or `hardlink` as +> Note, if you are using `symlink` or `hardlink` as > [link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) > type for DVC cache, you will have to unprotect the tracked file > first (see `dvc unprotect`): @@ -23,8 +21,8 @@ Now let's reverse this action: > $ dvc unprotect data.csv > ``` -You'll need to remove the corresponding `.dvc` file and `.gitignore` entry using -`dvc remove`: +Now let's reverse `dvc add`. You'll need to remove the corresponding `.dvc` file +and `.gitignore` entry using `dvc remove`: ```dvc $ dvc remove data.csv.dvc From 5bfd2c870bcaaed1d5afc712d75d55b0da1f76bc Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 28 Sep 2020 00:07:52 -0500 Subject: [PATCH 24/25] Update content/docs/user-guide/how-to/undo-adding-data.md --- content/docs/user-guide/how-to/undo-adding-data.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index 317a057218..bd43e00135 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -21,8 +21,8 @@ data.csv data.csv.dvc > $ dvc unprotect data.csv > ``` -Now let's reverse `dvc add`. You'll need to remove the corresponding `.dvc` file -and `.gitignore` entry using `dvc remove`: +Now let's reverse `dvc add` by removing the corresponding `.dvc` file and +`.gitignore` entry using `dvc remove`: ```dvc $ dvc remove data.csv.dvc From c7f30b725009b52e3d815dce5e0fa6121bce23b8 Mon Sep 17 00:00:00 2001 From: Jorge Orpinel Date: Mon, 28 Sep 2020 00:11:16 -0500 Subject: [PATCH 25/25] Update content/docs/user-guide/how-to/undo-adding-data.md --- content/docs/user-guide/how-to/undo-adding-data.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/docs/user-guide/how-to/undo-adding-data.md b/content/docs/user-guide/how-to/undo-adding-data.md index bd43e00135..1d3935cad4 100644 --- a/content/docs/user-guide/how-to/undo-adding-data.md +++ b/content/docs/user-guide/how-to/undo-adding-data.md @@ -13,9 +13,9 @@ data.csv data.csv.dvc ``` > Note, if you are using `symlink` or `hardlink` as -> [link](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) -> type for DVC cache, you will have to unprotect the tracked file -> first (see `dvc unprotect`): +> [link type](doc/user-guide/large-dataset-optimization#file-link-types-for-the-dvc-cache) +> for DVC cache, you will have to unprotect the tracked file first +> (see `dvc unprotect`): > > ```dvc > $ dvc unprotect data.csv