Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

consider introducing dvc hooks #2363

Closed
efiop opened this issue Aug 3, 2019 · 12 comments
Closed

consider introducing dvc hooks #2363

efiop opened this issue Aug 3, 2019 · 12 comments
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint question I have a question? research

Comments

@efiop
Copy link
Contributor

efiop commented Aug 3, 2019

These would work similarly to git hooks. For example, one could configure post-repro hook that would run dvc push. Nice thing about these is that they would be tracked by git, so you won't have to re-install your hooks after each clone as you have to do with git hooks. The scenarios for which this feature could be useful are not yet that clear to me, so I would love to hear if any users have any scenarios in mind in which they would find such functionality handy.

Also, as @shcheklein noted #2359 (comment) if we actually decide to implement this, we'll need to consider using something better than single executables for hooks (e.g. yaml config like pre-commit does).

#2359 (comment)

@dashohoxha
Copy link
Contributor

The scenarios for which this feature could be useful are not yet that clear to me, so I would love to hear if any users have any scenarios in mind in which they would find such functionality handy.

This is a scenario where DVC hooks may be useful: #2330
Let me explain.

DVC hooks can be placed on .dvc/hooks/ and may be named like pre-run, post-run, pre-fetch, post-fetch, pre-push, post-push etc. In general, pre-<command> and post-<command> (maybe not for all the commands).

Before running the command fetch for example, DVC will check for the existence of the hooks. Then it will call pre-fetch passing it all the arguments of the command fetch. From the hook it will get DONE or a possibly modified list of arguments. If the return status is DONE, then DVC will stop, because this indicates that the hook has already done the job of fetching the data and there is nothing else left to be done. If the pre-hook returns a list of arguments, then the fetch command itself will be called with this (possibly-modified) list of arguments. Then the post-fetch will be called (if available), passing it the return status of the fetch command.

How is this related to the issue #2330? We can imagine a couple of hooks pre-fetch and pre-push which actually do the backup/restore of the data themselves, bypassing/replacing the commands fetch and pull. These hooks may use rclone for example, to do the job. While it is true that rclone may be suboptimal compared to the DVC commands push and fetch, it supports more storage providers than DVC does. So, this may provide a not-so-good solution in a case where DVC provides no solution.

Another example of using hooks may be with the command dvc run, to avoid duplication of the deps/outs. If the command is: dvc run -d <inputfile> -o <outputfile> <command>, then the hook pre-run will be called, which will get the arguments -d <inputfile> -o <outputfile> <command>, and will return the arguments -d <inputfile> -o <outputfile> <command> <inputfile> <outputfile>. Then the command run itself will be called with these (augmented) arguments.

Another scenario is the one described here: #2359 (comment)
When you call the command dvc repro it will also find and call the hook post-repro, which will automatically call dvc commit and dvc push.

@efiop efiop added p3-nice-to-have It should be done this or next sprint and removed p4 labels Sep 25, 2019
@efiop
Copy link
Contributor Author

efiop commented Feb 15, 2020

Closing for now, as there doesn't seem to be enough interest. The feature still seems useful, so maybe we'll reopen in the future, once there are more use cases.

@efiop efiop closed this as completed Feb 15, 2020
@evstratbg
Copy link

+1 for this feature
or, any other possibilities to get notification of new files/version of files available at dvc repo?

@efiop
Copy link
Contributor Author

efiop commented Jul 20, 2020

@evstratbg Could you share more details about your scenario, please? 🙂

@evstratbg
Copy link

@efiop sure
We currently store very large ML models in dvc. We are starting to switch to k8s and plan to build the models in a separate docker image, because the code changes more often than the models. We discussed several options for building models from dvc and it would be very convenient to build them using curl query from dvc hook. It may not be the most popular way, but we are still in the process :) If you suddenly have recommendations or experience "how to deploy an application with very large ML models to k8s", I will be happy to absorb the experience

@efiop
Copy link
Contributor Author

efiop commented Jul 20, 2020

@evstratbg Hm, might be missing something, but have you considered regular git hooks? E.g. when a .dvc(or dvc.yaml/lock) file is being committed to git - your git pre-commit hook will send a curl query.

@casperdcl
Copy link
Contributor

@WALEX2000
Copy link

I have a new use-case for this feature. I have a pretty strict deadline so I guess I'll have to find another way, but since this might be useful for someone, here goes:

I am creating a CLI tool that manages a fully local ML pipeline/workflow, to be used on small experimental projects.
When a user adds a data file through dvc add, I wanted to run some python scripts which generate metadata for that file (it's actually html source-code that shows some important metrics for that dataset).
If i had access to hooks I could just do this automatically. Since I don't, I think I'm going to have to use a different approach. Also, git hooks can't be used here because there are no hooks for staging, and I didn't want to wait until the user commits the dataset for the metadata to be generated.

If you're curious, my approach will be to simply add a new cli command that generates and shows the metadata on the browser. (I think I'll also be storing the html source code directly on the .dvc yaml file, to force the user to run this command instead of just clicking on an html file, if that's a bad idea pls let me know).

@efiop
Copy link
Contributor Author

efiop commented Apr 20, 2022

@WALEX2000 Have you considered using a git hook to generate those when you will be git committing corresponding dvc files?

@WALEX2000
Copy link

As I said in the previous post, I'd rather generate the metadata as soon as the file is added to DVC, instead of waiting for a commit.
This because, when someone is editing a dataset (for example, feature engineering), I want to have the metadata readily available, so that it can be consulted during that process.
If I force the programmer to commit then he'd be commiting useless changes left and right, only to be able to access the updated metadata.

@johnyaku
Copy link

A lot of tools fail to "see through" symlinks and we need to dvc unprotect them first.
A post-pull dvc hook would allow us to run dvc unprotect on a list of files that need to be made accessible like this. Unless there is a better way to achieve the same result?

@SolomidHero
Copy link

I have a really big datasets by the number of files. I don't want to ever push them unarchived to storage since it is really expensive (even single push). I better use some dvc pre-add hook to check if adding folder is not big and maybe tar them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request Requesting a new feature p3-nice-to-have It should be done this or next sprint question I have a question? research
Projects
None yet
Development

No branches or pull requests

7 participants