Support for long-running asynchronous stages #5951

somewacko · 2019-08-28T03:12:35Z

somewacko
Aug 28, 2019

A conversation came up on Discord about what to do when you have a stage in your pipeline that either takes a massive amount of time or needs to be run asynchronously.

For projects that involve large datasets or require a lot of compute on specialized hardware (e.g. training large neural networks on GPUs/TPUs), it's common to have infrastructure where you submit jobs to a cluster or service that will handle scheduling and provisioning resources for you. Client-side, however, these tasks will exit immediately and won't directly produce any outputs, making them difficult to integrate in a DVC pipeline.

Currently the best workaround for our case is to rework the infrastructure so that dvc repro is run on a remote machine that has the resources you need and make every stage synchronous. This might not be an option for some teams though, what should be the right solution in this case?

Thanks!!

pared · 2019-08-28T08:01:21Z

pared
Aug 28, 2019

Mentioned conversation.

0 replies

efiop · 2019-08-28T16:57:31Z

efiop
Aug 28, 2019

To solve this, we would have to come up with a special marker for the dvc files, so that they know that the job got deployed and we are waiting for it to be completed. Maybe something like

cmd: ./spawn_remote_job_and_detach.sh
detached:
    cmd: ./check_if_remote_job_is_done.sh

/check_if_remote_job_is_done.sh would probably need to report proper return codes for us to differentiate the state. E.g. <0 errored, 0 succeeded and ready, >0 still running. Another way would be to do the same but with the ./spawn_remote_job_and_detach.sh itself(but we should def use detached: True flag), so it reports those return codes. E.g. when spawned it returns >0 so dvc knows that the job got deployed and is in progress, then on status or another repro attempt, it runs it again, sees that >0 is not ready yet and exists. And so on. Thoughts, guys?

0 replies

dashohoxha · 2019-08-31T16:22:41Z

dashohoxha
Aug 31, 2019

@zo7 What you describe seems like a situation that needs some workflow management tool. Have you tried any of them? I don't have any experience with such tools, but I could find some that maybe could do the job:

I don't think that adding functionality that is covered by other tools (reinventing the wheel) will make DVC better.

0 replies

drorata · 2019-09-03T07:30:13Z

drorata
Sep 3, 2019

The OP in this issue describes well my hurdles. Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost... Furthermore, this is rather inefficient approach for at least two reasons:

The local machine has to be awake and running during the whole time (which can be long)
The resulting artifact (which can be huge) ends on the local machine's storage

Thanks @efiop for pointing me to this thread. I copied my reply from discord for better visibility.

0 replies

drorata · 2019-09-03T07:41:22Z

drorata
Sep 3, 2019

@dashohoxha Why don't you consider dvc a workflow management tool? After all it is maintaining DAG(s). Do you have in mind somehow using dvc and airflow together?

0 replies

dashohoxha · 2019-09-03T10:06:05Z

dashohoxha
Sep 3, 2019

Do you have in mind somehow using dvc and airflow together?

Yes, exactly.

Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost...

I don't have any experience with airflow, but I believe that this is exactly the kind of situations that it is supposed to solve: running and monitoring tasks/processes on different hosts, and making sure that they follow the dependency constraints (like don't run this until this task is finished successfully). Why don't you give it a try? The time needed to get used to it may pay off. Your simple hacks will never be as good and mature as the general solutions provided by airflow and its community.

By the way, it seems that the DAG of airflow is a DAG of processes/tasks and their orders (what runs before what). It does not have the concept of input and output dependencies (like DVC).

0 replies

somewacko · 2019-09-03T18:54:15Z

somewacko
Sep 3, 2019
Author

@dashohoxha

I'm not experienced with tools like airflow/luigi/digag but how would these tools work alongside DVC? It looks like if you were to use one of these you would be executing your tasks in that framework instead, so intermediate states wouldn't be tracked or versioned with DVC.

@efiop

Adding a detached script that checks periodically might help, it still would have the problem like @drorata pointed out where it's running on a local machine for a potentially prohibitively long time. I'd also imagine that spawn_remote_job_and_detach needs some way to pass information to check_if_remote_job_is_done too (like a job ID) so it knows what to check for.

0 replies

dashohoxha · 2019-09-03T20:39:52Z

dashohoxha
Sep 3, 2019

how would these tools work alongside DVC

I guess dvc repro on a remote host can be one of the tasks of airflow. Airflow can monitor its execution and when it is done it can start another dvc repro on the local host.

0 replies

drorata · 2019-09-04T05:32:06Z

drorata
Sep 4, 2019

My intuition tells me that mixing dvc and airflow doesn't make sense, but I'm lacking knowhow about the latter so take it with a grain of salt. It sounds to me that trying to use both tools will quickly lead to some cyclic confusing setting. I think the title of this issue should be something like: "Support for remote asynchronous stages yielding outputs on remote locations" This will reflect the challenge that dvc runs on machine A, is trying to execute a stage on some remote machine/cluster B and the results of this stage are stored on some third remote environment C.

0 replies

Suor · 2019-12-04T23:33:13Z

Suor
Dec 4, 2019

Once we support concurrent runs/repros we may just run this synchronously.

0 replies

efiop · 2019-12-05T15:51:13Z

efiop
Dec 5, 2019

@Suor Good point. Related to #755

0 replies

drorata · 2019-12-20T09:43:57Z

drorata
Dec 20, 2019

I don't understand how @Suor comment is related to the underlying use case behind this issue. The main deal here is about streamlining dvc and remote infrastructures while @Suor comment is about concurrent processes running on the "local" machine where dvc is.

As far as I understand @zo7 concern (and it is also mine), the situation is that one is running dvc from a "local" machine. Assume that one of the stages in a pipeline has to be ran on a large Spark cluster, grid of GPU etc. In this setting, it is very much not straightforward how to integrate such a step into the dvc pipeline.

What I'm currently doing in my projects is wrapping the stage which depends on an external resource in a script which checks every x minutes whether the stage is completed or not.

N.B. In my case I'm running the stage on an EMR cluster. Each time, the cluster ID to which I'm submitting my job changes. I mitigate it by reading an environment variable which holds the cluster's ID. This way, this ID is not part of the .dvc file.

0 replies

efiop · 2019-12-24T16:30:34Z

efiop
Dec 24, 2019

The current solution is to make your command launch the job, save the metadata somewhere and return an error. After that, on next runs (dvc repro) it will see (through the metadata that it saved) that it already has the worker running and will continue returning errors until the worker is done. When it is done, it will do the rest of the actions and finish successfully.

0 replies

ghost · 2019-12-27T15:38:17Z

ghost
Dec 27, 2019

Adding to @efiop 's comment, here's an example of a script with such behavior:

#!/usr/bin/env bash

# name: example.sh

if [[ -f metadata ]] && grep -qs "running" metadata ; then
  echo "Command is already running"
  exit 1
fi

{
  echo "running" > metadata
  sleep 15
  echo "done" > metadata
} &

You can then do dvc run example.sh

0 replies

efiop · 2021-05-03T15:33:21Z

efiop
May 3, 2021

This should be handled by dvc exp run --temp these days.

0 replies

dberenbaum · 2021-05-03T18:16:19Z

dberenbaum
May 3, 2021
Collaborator

How does dvc exp run --temp handle this issue? I'd vote to keep this one open or mark it as a duplicate. Seems related to #1490 and somewhat to #5615.

0 replies

efiop · 2021-05-03T18:33:05Z

efiop
May 3, 2021

@dberenbaum Sorry for the poor reasoning. --temp for local long-running commands that can keep waiting, so that your repo is not blocked. For submit-like workflow, where we can't wait from the local machine, we'll indeed need something like #1490 But it could be bigger than that, this needs a product research. Let's move to discussions then.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for long-running asynchronous stages #5951

{{title}}

Replies: 17 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Support for long-running asynchronous stages #5951

Replies: 17 comments

somewacko Sep 3, 2019 Author

dberenbaum May 3, 2021 Collaborator

somewacko
Sep 3, 2019
Author

dberenbaum
May 3, 2021
Collaborator