Running DVC in production #5924
Replies: 15 comments 8 replies
-
@woop, great question! Could you please clarify few things to make sure I understand your scenario correctly:
|
Beta Was this translation helpful? Give feedback.
-
Thanks for the fast reply!
|
Beta Was this translation helpful? Give feedback.
-
Thank you for sharing details! It is clear with 1-3. Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it). For reproducibility between container runs in prod you need to share state between runs. Git repo is one of the ways to share the state. To get the state in prod you need to clone a recent version of repo when you run container (not a repo based on some SHA). To update the state you need to run something like It might look unnatural to commit/push from prod. And it can create a mess in your git history. To avoid these issues you might use a separate Git branch for prod (like in traditional development - |
Beta Was this translation helpful? Give feedback.
-
We basically have two approaches.
Let's assume we don't pin the cloned repo to a specific git commit. Let's just take the latest one with I guess I need to think about this a bit more. I like the idea of being able to pick up from a very specific commit and completely reproduce the state of that system, but it seems this could be quite difficult to manage and standardize. |
Beta Was this translation helpful? Give feedback.
-
@woop can you elaborate on running multiple pipelines in parallel? If you have a single production system that is retraining some stuff on cron (let's say daily) it makes sense to me to have a separate "production" branch. Every commit of it should be the latest state of master with updates to the DVC-files forced to the top of it. |
Beta Was this translation helpful? Give feedback.
-
So the idea is to have endlessly growing branches? Every time the same step runs, the query will change, which will update the input data, which will rerun all the steps. Then it will commit all of this to the repository for each step in each pipeline every time it runs. Would need to squash those git repos eventually, they will become massive I think. |
Beta Was this translation helpful? Give feedback.
-
@woop I'm not sure I'm following you on "Then it will commit all of this to the repository for each step in each pipeline every time it runs.". Why do you need to commit every step separately? Usually, you would run the whole pipeline and then fo |
Beta Was this translation helpful? Give feedback.
-
If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs, and step 3 will have to rerun step 2 and step 1 when it runs. That is because they are still looking at old data hashes. The only way to get around this (it seems) is to commit at every step the latest hashes at each step. |
Beta Was this translation helpful? Give feedback.
-
This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run To some extent, Unless I'm missing something :) |
Beta Was this translation helpful? Give feedback.
-
This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? Correct. In my case we are running each step in a pipeline as a separate container (Kubeflow Pipelines, or Airflow with Kubernetes). What this means is that I need to somehow get the DVC files into the next container so that those previous steps don't rerun. One way is to do git commits, another is having a data management layer that does this between steps. |
Beta Was this translation helpful? Give feedback.
-
I think running DVC stages in totally isolated containers is an anti pattern. We can be creative with the workarounds but having each command run in isolated contexts defeats the purpose of a tool that has in its core the inspection of existing state. If you’re not persisting this state otherwise, by having a shared file system or committing changes to upstream, you’re better off having a single job that dvc repro the entire pipeline. |
Beta Was this translation helpful? Give feedback.
-
I solved this in my project by creating a small script to sync DVC stages between prod and dev. It's something like this:
It copies the stage config, but keeps the asset hashes unchanged where possible. I agree it would be great if DVC supported this out of the box. |
Beta Was this translation helpful? Give feedback.
-
I left a few comments in threads above and here are a few final notes:
Depending on a datetime would make the pipeline non-deterministic and thus no it would always run again no matter what.
Seems like the scenario was more about distributing pipeline execution among multiple environments. So my answer would be no, but it's designed to codify and reproduce full pipelines in a single execution env. The features to run only parts of a pipeline (even single stages) are indeed meant to save time in the development process, in general. |
Beta Was this translation helpful? Give feedback.
-
Related discussion: https://discord.com/channels/485586884165107732/872860674529845299/872860676736036874 |
Beta Was this translation helpful? Give feedback.
-
@woop I'm currently facing a similar issue as I am evaluating different tools we went to implement into our workflow. DVC and Kubeflow are amongst them but I also have the feeling that there is something missing. What did your approach end up to look like? At this point I am not sure if DVC and Kubeflow are tools which are compatible in the way they operate. Kubeflow defines its own DAG, independent from the DVC DAG and that's exactly where things get complicated. Kubeflow apparently does not allow us to execute steps independently from each other - unless I am missing something here - so we can't seem to be able to formulate them as a DVC stage (or wrap it into a DVC stage for that matter). I don't really have experience with Kubeflow yet so I might be overlooking something here. |
Beta Was this translation helpful? Give feedback.
-
I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.
One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I've created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.
However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let's assume that my pipeline is as follows
If I do all of my local development (
dvc run
,dvc repro
) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file calledparameters
as a dependency to theGet Data
step.So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.
The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the
*.dvc
files in the pipeline will be updated, meaning everything will rerun from scratch. That's because they are running in an isolated production environment and not committing code back to the project repo. Sodvc
looses it's value when wrapped in a scheduler.Am I missing something, or is DVC primarily useful in local development only?
Beta Was this translation helpful? Give feedback.
All reactions