From 21572ebab5e12a53ae0cbee8c89d810c8452b726 Mon Sep 17 00:00:00 2001 From: Saugat Pachhai Date: Tue, 25 Feb 2020 20:41:38 +0545 Subject: [PATCH 1/8] cmd-ref: document new gc behavior --- public/static/docs/command-reference/gc.md | 63 ++++++++++++++-------- 1 file changed, 41 insertions(+), 22 deletions(-) diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index 38b33f6eb9..2ad3cac259 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -5,7 +5,8 @@ Remove unused objects from cache or remote storage. ## Synopsis ```usage -usage: dvc gc [-h] [-q | -v] [-a] [-T] [-c] [-r ] +usage: dvc gc [-h] [-q | -v] + [-w] [-a] [-T] [--all-commits] [-c] [-r ] [-f] [-j ] [-p [ [ ...]]] ``` @@ -14,17 +15,25 @@ usage: dvc gc [-h] [-q | -v] [-a] [-T] [-c] [-r ] This command deletes (garbage collects) data files or directories that may exist in the cache (or [remote storage](/doc/command-reference/remote) if `-c` is used) but no longer referenced in [DVC-files](/doc/user-guide/dvc-file-format) -currently in the workspace. By default, this command only cleans up -the local cache, which is typically located on the same machine as the project -in question. This usually helps to free up disk space. +currently in the workspace. To avoid accidentally deleting data, +this command requires the explicit use of [option](#options) flags to determine +it's behavior (i.e. what "garbage" to collect). -There are important things to note when using Git to version the -project: +By default, this command won't delete anything at all to make it safe and +explicit. However, you can use different flags to change the behavior. + +Using the `--workspace` or `-w` option, it will only clean up the local cache, +which is typically located on the same machine as the DVC project +in question. This is an aggessive behavior that usually helps to free up disk +space. + +There are important things to note when using Git to version the project: - If the cache/remote holds several versions of the same data, all except the current one will be deleted. -- Use the `--all-branches` or `--all-tags` options to avoid collecting data - referenced in the tips of all branches or all tags, respectively. +- Use the `--all-branches`/`--all-tags`/`--all-commits` options to avoid + collecting data referenced in the tips of all branches or all tags, + respectively. The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. @@ -36,14 +45,23 @@ restored using `dvc fetch`, as long as they have previously been uploaded with ## Options -- `-a`, `--all-branches` - keep cached objects referenced in all Git branches. - Useful for keeping data for all the latest experiment versions. It's - recommended to consider including this option when using `-c` i.e. - `dvc gc -ac`. +- `-a`, `--all-branches` - keep cached objects referenced in all Git branches as + well as in the workspace (implies `-w`). Useful for keeping data for all the + latest experiment versions. It's recommended to consider including this option + when using `-c` i.e. `dvc gc -ac`. + +- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well + as in the workspace (implies `-w`). Useful if tags are used to track + "checkpoints" of an experiment or project. Note that both options can be + combined, for example using the `-aT` flag. + +- `--all-commits` - the same as `-a` or `-T` above, but applies to Git commits + as well as in the workspace (implies `-w`). Useful for keeping data for all + experiment versions ever used in the history of the project. -- `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags. It's - useful if tags are used to track "checkpoints" of an experiment or project. - Note that both options can be combined, for example using the `-aT` flag. +- `-w`, `--workspace` - remove files in local cache that are not referenced in + the workspace. **This behavior is dangerous.** This option is enabled + automatically if `--all-tags` or `--all-branches` are used. - `-p `, `--projects ` - if a single remote or a single cache is shared among different projects (e.g. a configuration like the one described @@ -51,10 +69,10 @@ restored using `dvc fetch`, as long as they have previously been uploaded with specify a list of them (each project is a path) to keep data that is currently referenced from them. -- `-c`, `--cloud` - also remove files in remote storage. _This operation is - dangerous._ It removes datasets, models, other files that are not linked in - the current commit (unless `-a` or `-T` are also used). The default remote is - used unless a specific one is given with `-r`. +- `-c`, `--cloud` - remove files in remote storage in addition to local cache. + **This behavior is dangerous.** It removes datasets, models or other files + that are not linked in the current commit (unless `-a` or `-T` are also used). + The default remote is used unless a specific one is given with `-r`. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) to collect unused objects from @@ -83,11 +101,12 @@ $ du -sh .dvc/cache/ 7.4G .dvc/cache/ ``` -When you run `dvc gc` it removes all objects from cache that are not referenced -in the workspace (by collecting hash values from the DVC-files): +When you run `dvc gc --workspace`, DVC removes all objects from cache that are +not referenced in the workspace (by collecting hash values from the +DVC-files): ```dvc -$ dvc gc +$ dvc gc --workspace '.dvc/cache/27e30965256ed4d3e71c2bf0c4caad2e' was removed '.dvc/cache/2e006be822767e8ba5d73ebad49ef082' was removed From 50156f1a17f4c6bccebad96e1968a424c0bcd981 Mon Sep 17 00:00:00 2001 From: Saugat Pachhai Date: Tue, 17 Mar 2020 18:27:49 +0545 Subject: [PATCH 2/8] mention about -w as scope specifier for garbage collection --- public/static/docs/command-reference/gc.md | 31 +++++++++------------- 1 file changed, 12 insertions(+), 19 deletions(-) diff --git a/public/static/docs/command-reference/gc.md b/public/static/docs/command-reference/gc.md index 2ad3cac259..a8153ea109 100644 --- a/public/static/docs/command-reference/gc.md +++ b/public/static/docs/command-reference/gc.md @@ -19,21 +19,13 @@ currently in the workspace. To avoid accidentally deleting data, this command requires the explicit use of [option](#options) flags to determine it's behavior (i.e. what "garbage" to collect). -By default, this command won't delete anything at all to make it safe and -explicit. However, you can use different flags to change the behavior. - -Using the `--workspace` or `-w` option, it will only clean up the local cache, -which is typically located on the same machine as the DVC project -in question. This is an aggessive behavior that usually helps to free up disk -space. - There are important things to note when using Git to version the project: - If the cache/remote holds several versions of the same data, all except the current one will be deleted. -- Use the `--all-branches`/`--all-tags`/`--all-commits` options to avoid - collecting data referenced in the tips of all branches or all tags, - respectively. +- Use the `--workspace/--all-branches`/`--all-tags`/`--all-commits` options to + avoid collecting data referenced in the workspace, tips of all branches or all + tags or all of the commits, respectively. The default remote is used (see `dvc config core.remote`) unless the `--remote` option is used. @@ -45,6 +37,10 @@ restored using `dvc fetch`, as long as they have previously been uploaded with ## Options +- `-w`, `--workspace` - keep cached objects referenced in the current workspace + This option is enabled automatically if + `--all-tags`/`--all-branches`/`--all-commits` are used. + - `-a`, `--all-branches` - keep cached objects referenced in all Git branches as well as in the workspace (implies `-w`). Useful for keeping data for all the latest experiment versions. It's recommended to consider including this option @@ -59,20 +55,17 @@ restored using `dvc fetch`, as long as they have previously been uploaded with as well as in the workspace (implies `-w`). Useful for keeping data for all experiment versions ever used in the history of the project. -- `-w`, `--workspace` - remove files in local cache that are not referenced in - the workspace. **This behavior is dangerous.** This option is enabled - automatically if `--all-tags` or `--all-branches` are used. - - `-p `, `--projects ` - if a single remote or a single cache is shared among different projects (e.g. a configuration like the one described [here](/doc/use-cases/shared-development-server)), this option can be used to specify a list of them (each project is a path) to keep data that is currently referenced from them. -- `-c`, `--cloud` - remove files in remote storage in addition to local cache. - **This behavior is dangerous.** It removes datasets, models or other files - that are not linked in the current commit (unless `-a` or `-T` are also used). - The default remote is used unless a specific one is given with `-r`. +- `-c`, `--cloud` - remove files in remote storage in addition to the local + cache. **This behavior is dangerous.** It removes datasets, models or other + files that are not referenced in the current workspace (unless + `-a`/`-T`/`--all-commits` are also used). The default remote is used unless a + specific one is given with `-r`. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) to collect unused objects from From e53027159ff10988f947c3281284234e645a2eda Mon Sep 17 00:00:00 2001 From: Ivan Shcheklein Date: Tue, 17 Mar 2020 22:23:17 -0700 Subject: [PATCH 3/8] apply some corrections to the latest gc doc --- content/docs/command-reference/gc.md | 60 +++++++++++++++------------- 1 file changed, 32 insertions(+), 28 deletions(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index a8153ea109..dcc7d73b65 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -1,6 +1,7 @@ # gc -Remove unused objects from cache or remote storage. +Remove unused files and directories from cache or +[remote storage](/doc/command-reference/remote). ## Synopsis @@ -12,39 +13,44 @@ usage: dvc gc [-h] [-q | -v] ## Description -This command deletes (garbage collects) data files or directories that may exist -in the cache (or [remote storage](/doc/command-reference/remote) if `-c` is -used) but no longer referenced in [DVC-files](/doc/user-guide/dvc-file-format) -currently in the workspace. To avoid accidentally deleting data, -this command requires the explicit use of [option](#options) flags to determine -it's behavior (i.e. what "garbage" to collect). +This command deletes (garbage collects) data files or directories that exist in +DVC cache but are no longer needed. With `--cloud` it also removes data in +[remote storage](/doc/command-reference/remote). -There are important things to note when using Git to version the project: +To avoid accidentally deleting data, it raises an error and doesn't touch any +files if no scope options are provided. It means it's user's responsibility to +explicitly provide the right set of options to specify what data is still in-use +(so that DVC can figure out what fils can be safely deleted). -- If the cache/remote holds several versions of the same data, all except the - current one will be deleted. -- Use the `--workspace/--all-branches`/`--all-tags`/`--all-commits` options to - avoid collecting data referenced in the workspace, tips of all branches or all - tags or all of the commits, respectively. +One of the scope options - `--workspace`, `--all-branches`, `--all-tags`, or +`--all-commits`, or any combination of them must be provided. Each of them +corresponds to the current workspace _and_ a set of commits to analyze what +files, directories and what versions are still needed and should be kept (by +analyzing DVC-files in those commits). -The default remote is used (see `dvc config core.remote`) unless the `--remote` -option is used. - -Unless the `--cloud` (`-c`) option is used, `dvc gc` does not remove data files -from any remote. This means that any files collected from the local cache can be +Unless the `--cloud` option is used, `dvc gc` does not remove data files from +any remote. This means that any files collected from the local cache can be restored using `dvc fetch`, as long as they have previously been uploaded with `dvc push`. +### Removing data in remote storage + +If `--cloud` option is provided, command deletes unused data not only in local +DVC cache, but also in remote storage. It means it can be dangerous since in +most cases removing data locally and in remote storage is irreversible. + +The default remote is cleaned (see `dvc config core.remote`) unless the +`--remote` option is used. + ## Options -- `-w`, `--workspace` - keep cached objects referenced in the current workspace - This option is enabled automatically if - `--all-tags`/`--all-branches`/`--all-commits` are used. +- `-w`, `--workspace` - keep cached objects _only_ referenced in the current + workspace This option is enabled automatically if `--all-tags`, + `--all-branches`, or `--all-commits` are used. - `-a`, `--all-branches` - keep cached objects referenced in all Git branches as well as in the workspace (implies `-w`). Useful for keeping data for all the - latest experiment versions. It's recommended to consider including this option - when using `-c` i.e. `dvc gc -ac`. + latest experiment versions if branches are used to track those. - `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well as in the workspace (implies `-w`). Useful if tags are used to track @@ -61,11 +67,9 @@ restored using `dvc fetch`, as long as they have previously been uploaded with specify a list of them (each project is a path) to keep data that is currently referenced from them. -- `-c`, `--cloud` - remove files in remote storage in addition to the local - cache. **This behavior is dangerous.** It removes datasets, models or other - files that are not referenced in the current workspace (unless - `-a`/`-T`/`--all-commits` are also used). The default remote is used unless a - specific one is given with `-r`. +- `-c`, `--cloud` - remove files in remote storage in addition to local cache. + **This option is dangerous.** The default remote is used unless a specific one + is given with `-r`. - `-r `, `--remote ` - name of the [remote storage](/doc/command-reference/remote) to collect unused objects from From e4f1b3c3e5643b34a9eb1075c1c8f5d141f055bf Mon Sep 17 00:00:00 2001 From: Ivan Shcheklein Date: Wed, 18 Mar 2020 15:51:37 -0700 Subject: [PATCH 4/8] Update content/docs/command-reference/gc.md Co-Authored-By: Jorge Orpinel --- content/docs/command-reference/gc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index dcc7d73b65..7e52fd7e5c 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -22,7 +22,7 @@ files if no scope options are provided. It means it's user's responsibility to explicitly provide the right set of options to specify what data is still in-use (so that DVC can figure out what fils can be safely deleted). -One of the scope options - `--workspace`, `--all-branches`, `--all-tags`, or +One of the scope options, `--workspace`, `--all-branches`, `--all-tags`, `--all-commits`, or any combination of them must be provided. Each of them corresponds to the current workspace _and_ a set of commits to analyze what files, directories and what versions are still needed and should be kept (by From 06445b804a4d0dd6f6481796dbbdd2c7bd6b56e7 Mon Sep 17 00:00:00 2001 From: Ivan Shcheklein Date: Wed, 18 Mar 2020 15:52:21 -0700 Subject: [PATCH 5/8] Update content/docs/command-reference/gc.md Co-Authored-By: Jorge Orpinel --- content/docs/command-reference/gc.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 7e52fd7e5c..447d223759 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -49,8 +49,9 @@ The default remote is cleaned (see `dvc config core.remote`) unless the `--all-branches`, or `--all-commits` are used. - `-a`, `--all-branches` - keep cached objects referenced in all Git branches as - well as in the workspace (implies `-w`). Useful for keeping data for all the - latest experiment versions if branches are used to track those. + well as in the workspace (implies `-w`). Useful if branches are used to + track + different experiments. - `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well as in the workspace (implies `-w`). Useful if tags are used to track From f90622d1b84e95116deb2d35792952c02f35f7c5 Mon Sep 17 00:00:00 2001 From: Ivan Shcheklein Date: Wed, 18 Mar 2020 15:52:59 -0700 Subject: [PATCH 6/8] Update content/docs/command-reference/gc.md Co-Authored-By: Jorge Orpinel --- content/docs/command-reference/gc.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 447d223759..437bb90b8c 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -54,7 +54,7 @@ The default remote is cleaned (see `dvc config core.remote`) unless the different experiments. - `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well - as in the workspace (implies `-w`). Useful if tags are used to track + as the workspace (implies `-w`). Useful if tags are used to track "checkpoints" of an experiment or project. Note that both options can be combined, for example using the `-aT` flag. From e2ea8a7ab7249f8e3ec551bb2b3f7c9ec01b9a98 Mon Sep 17 00:00:00 2001 From: Ivan Shcheklein Date: Wed, 18 Mar 2020 15:53:39 -0700 Subject: [PATCH 7/8] Update content/docs/command-reference/gc.md Co-Authored-By: Jorge Orpinel --- content/docs/command-reference/gc.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 437bb90b8c..1f1ba3a05b 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -59,8 +59,8 @@ The default remote is cleaned (see `dvc config core.remote`) unless the combined, for example using the `-aT` flag. - `--all-commits` - the same as `-a` or `-T` above, but applies to Git commits - as well as in the workspace (implies `-w`). Useful for keeping data for all - experiment versions ever used in the history of the project. + as well as the workspace (implies `-w`). Useful for keeping all the data + used in the entire existing commit history of the project. - `-p `, `--projects ` - if a single remote or a single cache is shared among different projects (e.g. a configuration like the one described From 49e4f53b1eb03fc8716af5d98e79c5649ccd16fe Mon Sep 17 00:00:00 2001 From: Ivan Shcheklein Date: Wed, 18 Mar 2020 19:48:17 -0700 Subject: [PATCH 8/8] address reviews feedback, always use need, mention use case for all-commits --- content/docs/command-reference/gc.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/content/docs/command-reference/gc.md b/content/docs/command-reference/gc.md index 1f1ba3a05b..e0a275608c 100644 --- a/content/docs/command-reference/gc.md +++ b/content/docs/command-reference/gc.md @@ -19,7 +19,7 @@ DVC cache but are no longer needed. With `--cloud` it also removes data in To avoid accidentally deleting data, it raises an error and doesn't touch any files if no scope options are provided. It means it's user's responsibility to -explicitly provide the right set of options to specify what data is still in-use +explicitly provide the right set of options to specify what data is still needed (so that DVC can figure out what fils can be safely deleted). One of the scope options, `--workspace`, `--all-branches`, `--all-tags`, @@ -44,13 +44,12 @@ The default remote is cleaned (see `dvc config core.remote`) unless the ## Options -- `-w`, `--workspace` - keep cached objects _only_ referenced in the current - workspace This option is enabled automatically if `--all-tags`, +- `-w`, `--workspace` - keep files and directories _only_ referenced in the + current workspace This option is enabled automatically if `--all-tags`, `--all-branches`, or `--all-commits` are used. - `-a`, `--all-branches` - keep cached objects referenced in all Git branches as - well as in the workspace (implies `-w`). Useful if branches are used to - track + well as in the workspace (implies `-w`). Useful if branches are used to track different experiments. - `-T`, `--all-tags` - the same as `-a` above, but applies to Git tags as well @@ -58,9 +57,14 @@ The default remote is cleaned (see `dvc config core.remote`) unless the "checkpoints" of an experiment or project. Note that both options can be combined, for example using the `-aT` flag. -- `--all-commits` - the same as `-a` or `-T` above, but applies to Git commits - as well as the workspace (implies `-w`). Useful for keeping all the data - used in the entire existing commit history of the project. +- `--all-commits` - the same as `-a` or `-T` above, but applies to _all_ Git + commits as well as the workspace (implies `-w`). Useful for keeping all the + data used in the entire existing commit history of the project. + + One of the use cases for this option is to safely delete all temporary data + DVC cached when `dvc run` and/or `dvc repro` were run without committing + changes to DVC-files (thus potentially caching data that is not referenced + from workspace or Git commits). - `-p `, `--projects ` - if a single remote or a single cache is shared among different projects (e.g. a configuration like the one described