Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add dry-run option for garbage collection #1511

Closed
drorata opened this issue Jan 18, 2019 · 13 comments
Closed

Add dry-run option for garbage collection #1511

drorata opened this issue Jan 18, 2019 · 13 comments
Labels
enhancement Enhances DVC feature request Requesting a new feature good first issue help wanted p2-medium Medium priority, should be done, but less important

Comments

@drorata
Copy link

drorata commented Jan 18, 2019

The garbage collection if, by design, very destructive and irreversible. Won't it make sense to add a --dry-run flag which will just list what would happen if gc is to be ran?

@efiop
Copy link
Contributor

efiop commented Jan 18, 2019

Hi @drorata !

Great idea! 🙂 We'll look into it. Thanks for the feedback!

@tdeboissiere
Copy link

At least for learning purposes, I'm also looking into it.

As far as I can tell, it is fairly straightforward to print a message to indicate which file in the cache is going to be deleted.

However this would be of limited use because cache files would be printed as an md5 number. Rather, the user is more interested in the actual file/directory that the md5 corresponds to.

Given that dvc gc would likely be called after dvc purge which removes the .dvc files, the .dvc files cannot be used to map a cache file to a user readable output.

Is it possible to recover the user-readable file/directory name even after .dvc files have been removed ?

@efiop
Copy link
Contributor

efiop commented Jan 23, 2019

@tdeboissiere Nope, that is not possible right now. However, we could try to utilize git history of a project in order to retrieve that information, similar to #1234 .

@drorata
Copy link
Author

drorata commented Jan 23, 2019

What is dvc purge?

@efiop
Copy link
Contributor

efiop commented Jan 23, 2019

@drorata @tdeboissiere probably meant dvc remove --purge.

@tdeboissiere
Copy link

@efiop Exactly.

Using git history is a possible way, but as mentioned by @dmpetrov, this would tie us to a specific SCM, and we would need to have committed the corresponding .dvc file.

In that case, perhaps introducing something like dvc remove --purge_gc, optionally with the --dry flag to carry out purge + corresponding garbage collection would be a more straightforward solution ?

@efiop
Copy link
Contributor

efiop commented Jan 25, 2019

@tdeboissiere sounds good. Maybe even --drop-cache or something, so it is even more straightforward.

@efiop efiop added the p2-medium Medium priority, should be done, but less important label Jul 23, 2019
@ghost
Copy link

ghost commented Oct 2, 2019

@iterative/engineering , is this going to be affected after #2325 ? (thinking about including this one for hacktoberfest, but might not be that clear)

@shcheklein
Copy link
Member

I don't think #2325 affects this, but also it's not exactly clear how the interface for this feature should look like. Can we specify the output of the dvc gc --dry-run? Essentially it can print only a lot of different hashsums, right? Is it valuable enough?

@pared
Copy link
Contributor

pared commented Oct 2, 2019

Ideally, I think it should be able to "describe" the removed file. For example removing model.pkl from branch master, revision: 1.0. Thought it surely will not be able to describe file which commit was somehow removed, for example, squashed.

@dashohoxha
Copy link
Contributor

As mentioned by @efiop (#1511 (comment)), by digging the Git history it may be possible to find the relevant information for each cached file (i.e. the name of the corresponding .dvc file, the commit message, the revision id, and maybe branch and tag name).
But this might be very inefficient for a big Git repo (with hundreds or thousands of commits).
Unless somehow this information is indexed and saved in a DB.

@shcheklein
Copy link
Member

The problem is that by definition GC is removing "garbage" - and in a lot of cases it won't be possible to find any references even if we were able to analyze the history. What should we do with it? Especially, if we change the default behavior (analyze history and keep it), it will be removing only files that are being referenced.

@efiop
Copy link
Contributor

efiop commented Oct 19, 2020

Closing as stale. gc --dry-run itself doesn't get much requests these days, and we have #2325 as an umbrella issue.

@efiop efiop closed this as completed Oct 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhances DVC feature request Requesting a new feature good first issue help wanted p2-medium Medium priority, should be done, but less important
Projects
None yet
Development

No branches or pull requests

6 participants