Support for long-running asynchronous stages #5951
Replies: 17 comments
-
To solve this, we would have to come up with a special marker for the dvc files, so that they know that the job got deployed and we are waiting for it to be completed. Maybe something like
|
Beta Was this translation helpful? Give feedback.
-
@zo7 What you describe seems like a situation that needs some workflow management tool. Have you tried any of them? I don't have any experience with such tools, but I could find some that maybe could do the job:
I don't think that adding functionality that is covered by other tools (reinventing the wheel) will make DVC better. |
Beta Was this translation helpful? Give feedback.
-
The OP in this issue describes well my hurdles. Currently, the workaround I'm trying is a script which runs on a local machine, submits the job (in my case to an EMR cluster), waits until the job is finished (pinging it every minute) and finally copies the resulting artifacts from S3 to the local machine. However, this fails as my machine at some point idles and connection is lost... Furthermore, this is rather inefficient approach for at least two reasons:
Thanks @efiop for pointing me to this thread. I copied my reply from discord for better visibility. |
Beta Was this translation helpful? Give feedback.
-
@dashohoxha Why don't you consider dvc a workflow management tool? After all it is maintaining DAG(s). Do you have in mind somehow using dvc and airflow together? |
Beta Was this translation helpful? Give feedback.
-
Yes, exactly.
I don't have any experience with airflow, but I believe that this is exactly the kind of situations that it is supposed to solve: running and monitoring tasks/processes on different hosts, and making sure that they follow the dependency constraints (like don't run this until this task is finished successfully). Why don't you give it a try? The time needed to get used to it may pay off. Your simple hacks will never be as good and mature as the general solutions provided by airflow and its community. By the way, it seems that the DAG of airflow is a DAG of processes/tasks and their orders (what runs before what). It does not have the concept of input and output dependencies (like DVC). |
Beta Was this translation helpful? Give feedback.
-
I'm not experienced with tools like airflow/luigi/digag but how would these tools work alongside DVC? It looks like if you were to use one of these you would be executing your tasks in that framework instead, so intermediate states wouldn't be tracked or versioned with DVC. Adding a |
Beta Was this translation helpful? Give feedback.
-
I guess |
Beta Was this translation helpful? Give feedback.
-
My intuition tells me that mixing dvc and airflow doesn't make sense, but I'm lacking knowhow about the latter so take it with a grain of salt. It sounds to me that trying to use both tools will quickly lead to some cyclic confusing setting. I think the title of this issue should be something like: "Support for remote asynchronous stages yielding outputs on remote locations" This will reflect the challenge that dvc runs on machine A, is trying to execute a stage on some remote machine/cluster B and the results of this stage are stored on some third remote environment C. |
Beta Was this translation helpful? Give feedback.
-
Once we support concurrent runs/repros we may just run this synchronously. |
Beta Was this translation helpful? Give feedback.
-
I don't understand how @Suor comment is related to the underlying use case behind this issue. The main deal here is about streamlining As far as I understand @zo7 concern (and it is also mine), the situation is that one is running What I'm currently doing in my projects is wrapping the stage which depends on an external resource in a script which checks every x minutes whether the stage is completed or not. N.B. In my case I'm running the stage on an EMR cluster. Each time, the cluster ID to which I'm submitting my job changes. I mitigate it by reading an environment variable which holds the cluster's ID. This way, this ID is not part of the |
Beta Was this translation helpful? Give feedback.
-
The current solution is to make your command launch the job, save the metadata somewhere and return an error. After that, on next runs ( |
Beta Was this translation helpful? Give feedback.
-
Adding to @efiop 's comment, here's an example of a script with such behavior: #!/usr/bin/env bash
# name: example.sh
if [[ -f metadata ]] && grep -qs "running" metadata ; then
echo "Command is already running"
exit 1
fi
{
echo "running" > metadata
sleep 15
echo "done" > metadata
} & You can then do |
Beta Was this translation helpful? Give feedback.
-
This should be handled by |
Beta Was this translation helpful? Give feedback.
-
How does |
Beta Was this translation helpful? Give feedback.
-
@dberenbaum Sorry for the poor reasoning. |
Beta Was this translation helpful? Give feedback.
-
A conversation came up on Discord about what to do when you have a stage in your pipeline that either takes a massive amount of time or needs to be run asynchronously.
For projects that involve large datasets or require a lot of compute on specialized hardware (e.g. training large neural networks on GPUs/TPUs), it's common to have infrastructure where you submit jobs to a cluster or service that will handle scheduling and provisioning resources for you. Client-side, however, these tasks will exit immediately and won't directly produce any outputs, making them difficult to integrate in a DVC pipeline.
Currently the best workaround for our case is to rework the infrastructure so that
dvc repro
is run on a remote machine that has the resources you need and make every stage synchronous. This might not be an option for some teams though, what should be the right solution in this case?Thanks!!
Beta Was this translation helpful? Give feedback.
All reactions