Running DVC in production #5924

woop · 2019-06-30T05:10:15Z

woop
Jun 30, 2019

I see a lot of value in using DVC during the development phase of a DS project, especially having the ability to reproduce outputs only if dependencies have changed.

One of the problems we are trying to solve is how we can move a data scientists code back and forth from development and production. Ideally we would want their local development experience to be translated easily into production. I've created a toy project with DVC to see if we could use it for developing a multi-step pipeline which does data transfers between each step.

However, there is one thing that is unclear when scheduling this same pipeline in Kubeflow/Airflow. Let's assume that my pipeline is as follows

Get Data
Transform Data
Train Model
Evaluate Model

If I do all of my local development (dvc run, dvc repro) then everything works. But in a production setting I will have unique inputs to my pipeline. For example the datetime stamp or other input variables will change. I can integrate this with DVC by having a file called parameters as a dependency to the Get Data step.

So when I run the pipeline on Airflow on different days, then the dependencies for step 1 will be different, which means it will get recomputed.

The problem that I have is that all of the steps in the graph have their hashes hardcoded based on the local development environment. So even if I rerun this whole pipeline multiple times with the same input parameters, none of the *.dvc files in the pipeline will be updated, meaning everything will rerun from scratch. That's because they are running in an isolated production environment and not committing code back to the project repo. So dvc looses it's value when wrapped in a scheduler.

Am I missing something, or is DVC primarily useful in local development only?

dmpetrov · 2019-06-30T06:07:54Z

dmpetrov
Jun 30, 2019
Maintainer

@woop, great question!

Could you please clarify few things to make sure I understand your scenario correctly:

Do all the steps (Get Data, Transform Data, Train Model, Evaluate Model) work in a single Airflow job/k8s container?
Does Get Data step work outside dvc and updates data files in a dvc repo?
After data files are updated you run dvc repro which does the rest: Transform Data, Train Model, Evaluate Model. Right?
You don't want to add/dvc add, commit/git commit and push/dvc push in production? Is it correct?

0 replies

woop · 2019-06-30T09:03:01Z

woop
Jun 30, 2019
Author

Thanks for the fast reply!

Ideally they would be in multiple containers, but for a start I would be fine running them in a single container.
Yes
Locally I could just do dvc repro evaluate.dvc which would run everything. In production I could do the same (if I was using a single task/container), or what I wanted was to dvc repro/dvc run each step while running multiple containers. I'm not clear how that would help though, because it would seem like each step would recompute all depedencies up to get data.
I am fine with doing any of these, but it seems very unnatural to me to do git commit if I want reproducibility. Especially because our orchestrator is set up in such a way to git clone the project repo into a docker container based on a git SHA. So if we did a git commit then it would not update subsequent steps.

0 replies

dmpetrov · 2019-06-30T10:34:50Z

dmpetrov
Jun 30, 2019
Maintainer

Thank you for sharing details! It is clear with 1-3.

Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it).

For reproducibility between container runs in prod you need to share state between runs. Git repo is one of the ways to share the state. To get the state in prod you need to clone a recent version of repo when you run container (not a repo based on some SHA). To update the state you need to run something like git commit && git push && dvc push from prod.

It might look unnatural to commit/push from prod. And it can create a mess in your git history. To avoid these issues you might use a separate Git branch for prod (like in traditional development - master and prod branches).

0 replies

woop · 2019-06-30T12:26:26Z

woop
Jun 30, 2019
Author

Re (4)... When do you clone git repo based on SHA into a docker container? When you create the docker that needs to be run or when docker it is running (it gets SHA from somewhere and clones the repo based on it).

We basically have two approaches.

We build the codebase into the container image and just call it directly.
We call a base image which contains all dependencies with a startup script. This script pulls in the specified git repo and runs the startup script.

Let's assume we don't pin the cloned repo to a specific git commit. Let's just take the latest one with git clone. Wouldn't we create a race condition if we have multiple of these pipelines running at the same time, because they are presumably sharing the "latest" commit? Or is the suggestion that each pipeline type should have its own branch? We'd be creating a git commit for every step in the pipeline for every pipeline run.

I guess I need to think about this a bit more. I like the idea of being able to pick up from a very specific commit and completely reproduce the state of that system, but it seems this could be quite difficult to manage and standardize.

1 reply

jorgeorpinel May 26, 2021

Wouldn't we create a race condition if we have multiple of these pipelines running at the same time, because they are presumably sharing the "latest" commit? Or is the suggestion that each pipeline type should have its own branch

It seems that by "pipeline" here you mean a node dedicated to a single stage of pipeline. I'm not sure it makes sense to split the execution of a single DVC project like that.

Maybe a better way to split a long data pipeline with DVC is making several DVC repos and using dvc import (and dvc update) between them.

shcheklein · 2019-07-02T06:09:15Z

shcheklein
Jul 2, 2019
Maintainer

@woop can you elaborate on running multiple pipelines in parallel? If you have a single production system that is retraining some stuff on cron (let's say daily) it makes sense to me to have a separate "production" branch. Every commit of it should be the latest state of master with updates to the DVC-files forced to the top of it.

0 replies

woop · 2019-07-04T13:12:53Z

woop
Jul 4, 2019
Author

So the idea is to have endlessly growing branches?

Every time the same step runs, the query will change, which will update the input data, which will rerun all the steps. Then it will commit all of this to the repository for each step in each pipeline every time it runs.

Would need to squash those git repos eventually, they will become massive I think.

1 reply

jorgeorpinel May 26, 2021

So the idea is to have endlessly growing branches?

I think the idea was to have a single branch (where training happens regularly and serially) for use in all your prod envs (separate from the one you use in dev). In Ops terms this would be a release branch.

But it wouldn't apply to splitting a pipeline execution by stage per container.

shcheklein · 2019-07-04T18:54:53Z

shcheklein
Jul 4, 2019
Maintainer

@woop I'm not sure I'm following you on "Then it will commit all of this to the repository for each step in each pipeline every time it runs.". Why do you need to commit every step separately? Usually, you would run the whole pipeline and then fo git commit. Am I missing something?

0 replies

woop · 2019-07-05T06:18:16Z

woop
Jul 5, 2019
Author

If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs, and step 3 will have to rerun step 2 and step 1 when it runs. That is because they are still looking at old data hashes. The only way to get around this (it seems) is to commit at every step the latest hashes at each step.

0 replies

shcheklein · 2019-07-05T23:33:48Z

shcheklein
Jul 5, 2019
Maintainer

If step 1 doesn't commit, then step 2 will have to rerun step 1 when it runs

This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case? In general you can run dvc repro, interrupt it in the middle, and it won't run completed steps again if run dvc repro again.

To some extent, dvc repro is automatically doing "commits" locally by updating DVC-files as it processes the steps.

Unless I'm missing something :)

0 replies

woop · 2019-07-07T00:34:31Z

woop
Jul 7, 2019
Author

This is true, only if you are going to run step 2 in a different environment/separate machine? Is it your case?

Correct. In my case we are running each step in a pipeline as a separate container (Kubeflow Pipelines, or Airflow with Kubernetes). What this means is that I need to somehow get the DVC files into the next container so that those previous steps don't rerun. One way is to do git commits, another is having a data management layer that does this between steps.

0 replies

villasv · 2019-11-14T01:42:56Z

villasv
Nov 14, 2019

I think running DVC stages in totally isolated containers is an anti pattern. We can be creative with the workarounds but having each command run in isolated contexts defeats the purpose of a tool that has in its core the inspection of existing state.

If you’re not persisting this state otherwise, by having a shared file system or committing changes to upstream, you’re better off having a single job that dvc repro the entire pipeline.

0 replies

z0u · 2020-03-31T05:19:01Z

z0u
Mar 31, 2020

I solved this in my project by creating a small script to sync DVC stages between prod and dev. It's something like this:

sync-dvc data/dev/stage-1.dvc data/prod/stage-1.dvc

It copies the stage config, but keeps the asset hashes unchanged where possible. I agree it would be great if DVC supported this out of the box.

1 reply

jorgeorpinel May 26, 2021

copies the stage config, but keeps the asset hashes unchanged where possible

Interesting @z0u. Can yo clarify how does it help to keep previous hashes if other parts of the DVC file change? For example even if you only change the stage command to match your work on dev, the stage would still re-execute on prod (along with any ones downstream if the outputs change).

jorgeorpinel · 2021-05-26T01:03:14Z

jorgeorpinel
May 26, 2021

I left a few comments in threads above and here are a few final notes:

For example the datetime stamp or other input variables will change.

Depending on a datetime would make the pipeline non-deterministic and thus no it would always run again no matter what.

is DVC primarily useful in local development only?

Seems like the scenario was more about distributing pipeline execution among multiple environments. So my answer would be no, but it's designed to codify and reproduce full pipelines in a single execution env. The features to run only parts of a pipeline (even single stages) are indeed meant to save time in the development process, in general.

0 replies

dberenbaum · 2021-08-12T14:19:55Z

dberenbaum
Aug 12, 2021
Collaborator

Related discussion: https://discord.com/channels/485586884165107732/872860674529845299/872860676736036874

5 replies

jorgeorpinel Aug 12, 2021

Thanks @dberenbaum . Summary: using a DVC pipeline on production as a data logger to reproduce errors (data science debugging?). Each logged data version is (probably) a completely different chunks from a data stream (not versions). There needs to be a garbage collection policy to limit cache growth.

dberenbaum Aug 12, 2021
Collaborator

Good points. Data science debugging is a good phrase for it. I think it's part of model monitoring and a common part of a typical workflow.

Also, I agree about garbage collection, but data might need to be kept around for a long time (maybe years) for audit purposes.

dberenbaum Aug 12, 2021
Collaborator

@mnrozhkov What do you think about the above scenario? Basically, the user is running pipelines in production with new data each day (similar to a typical airflow job). He wants to be able to debug by checking out data for a given run and trying to reproduce locally. However, he needs to push the version info from every production run back to Github to make this work.

mnrozhkov Aug 20, 2021

@dberenbaum, thanks for sharing, scenarios above are very interesting! If I understand them correctly, they covers 2 common scenarios:

We use Airflow for scheduled model training (we need to commit/push outputs to Git (cml-pr may help here), prod branch may be useful)
We run a pipeline to get predictions (no need to sync DVC states back with Git, output artifacts and predictions are stored somewhere)

Few thoughts on these:

In both cases Airflow runs pipelines in production with new data, it's ok. It doesn't make the pipeline non-deterministic. We use some `execution_date' param to select data we use in pipelines.
Some source of reproducibility issues may be paths to data. Paths should be identical in local and in production environment. Few workarounds may be useful:
a) get data from a single source, by mounting data directory to Docker container or connection to remote DB/storage,
b) use a temporary directory to downloading data, the project code and running Airflow DAG from this directory (preserve local paths)
I don't really understand scenario with using DVC/Git workflow for debugging purposes. Usually, dvc repro is enough to run end-to-end DVC pipeline with Airflow. And the pipeline behavior is determined by code and params.yaml. So, we may find all metadata and params for DVC pipelines and data used (e.g `execution_date' selector) in Airflow DAG's interface. Seems it's enough to reproduce pipeline locally. So, this scenario seems odd. Hope, we could find a better solution if we would know more information about the use case.

Anyway, these scenarios are common and we are working on some prototypes to better understand need and possible solutions.

dberenbaum Aug 20, 2021
Collaborator

Thanks, @mnrozhkov! I may have misinterpreted the use case, but I thought the scenario was something like this (let's assume a prediction/inference pipeline):

Run airflow-dvc pipeline with latest data.
Pipeline breaks at intermediate stage n.
User wants to debug locally by doing dvc pull and then dvc repro, which ideally would do the following:
a. Pull any raw and intermediate data up to stage n.
b. Pull the stage results up to stage n.
c. Start the pipeline from stage n.

stefan-falk · 2022-10-05T08:53:17Z

stefan-falk
Oct 5, 2022

@woop I'm currently facing a similar issue as I am evaluating different tools we went to implement into our workflow. DVC and Kubeflow are amongst them but I also have the feeling that there is something missing. What did your approach end up to look like?

At this point I am not sure if DVC and Kubeflow are tools which are compatible in the way they operate. Kubeflow defines its own DAG, independent from the DVC DAG and that's exactly where things get complicated. Kubeflow apparently does not allow us to execute steps independently from each other - unless I am missing something here - so we can't seem to be able to formulate them as a DVC stage (or wrap it into a DVC stage for that matter).

I don't really have experience with Kubeflow yet so I might be overlooking something here.

0 replies

Running DVC in production #5924

Replies: 15 comments · 8 replies

dmpetrov Jun 30, 2019 Maintainer

woop Jun 30, 2019 Author

dmpetrov Jun 30, 2019 Maintainer

woop Jun 30, 2019 Author

shcheklein Jul 2, 2019 Maintainer

woop Jul 4, 2019 Author

shcheklein Jul 4, 2019 Maintainer

woop Jul 5, 2019 Author

shcheklein Jul 5, 2019 Maintainer

woop Jul 7, 2019 Author

dberenbaum Aug 12, 2021 Collaborator

dberenbaum Aug 12, 2021 Collaborator

dberenbaum Aug 12, 2021 Collaborator

dberenbaum Aug 20, 2021 Collaborator

Replies: 15 comments 8 replies

dmpetrov
Jun 30, 2019
Maintainer

woop
Jun 30, 2019
Author

dmpetrov
Jun 30, 2019
Maintainer

woop
Jun 30, 2019
Author

shcheklein
Jul 2, 2019
Maintainer

woop
Jul 4, 2019
Author

shcheklein
Jul 4, 2019
Maintainer

woop
Jul 5, 2019
Author

shcheklein
Jul 5, 2019
Maintainer

woop
Jul 7, 2019
Author

dberenbaum
Aug 12, 2021
Collaborator

dberenbaum Aug 12, 2021
Collaborator

dberenbaum Aug 12, 2021
Collaborator

dberenbaum Aug 20, 2021
Collaborator