-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add dry-run option for garbage collection #1511
Comments
Hi @drorata ! Great idea! 🙂 We'll look into it. Thanks for the feedback! |
At least for learning purposes, I'm also looking into it. As far as I can tell, it is fairly straightforward to print a message to indicate which file in the cache is going to be deleted. However this would be of limited use because cache files would be printed as an md5 number. Rather, the user is more interested in the actual file/directory that the md5 corresponds to. Given that Is it possible to recover the user-readable file/directory name even after |
@tdeboissiere Nope, that is not possible right now. However, we could try to utilize git history of a project in order to retrieve that information, similar to #1234 . |
What is |
@drorata @tdeboissiere probably meant |
@efiop Exactly. Using git history is a possible way, but as mentioned by @dmpetrov, this would tie us to a specific SCM, and we would need to have committed the corresponding .dvc file. In that case, perhaps introducing something like |
@tdeboissiere sounds good. Maybe even |
@iterative/engineering , is this going to be affected after #2325 ? (thinking about including this one for hacktoberfest, but might not be that clear) |
I don't think #2325 affects this, but also it's not exactly clear how the interface for this feature should look like. Can we specify the output of the |
Ideally, I think it should be able to "describe" the removed file. For example removing |
As mentioned by @efiop (#1511 (comment)), by digging the Git history it may be possible to find the relevant information for each cached file (i.e. the name of the corresponding |
The problem is that by definition GC is removing "garbage" - and in a lot of cases it won't be possible to find any references even if we were able to analyze the history. What should we do with it? Especially, if we change the default behavior (analyze history and keep it), it will be removing only files that are being referenced. |
Closing as stale. |
The garbage collection if, by design, very destructive and irreversible. Won't it make sense to add a
--dry-run
flag which will just list what would happen ifgc
is to be ran?The text was updated successfully, but these errors were encountered: