repro: add scheduler for parallelising execution jobs #755

yukw777 · 2018-06-08T19:43:26Z

When I try to run multiple dvc run commands, I get the following error:

$ dvc run ...
Failed to lock before running a command: Cannot perform the cmd since DVC is busy and locked. Please retry the cmd later.

This is inconvenient b/c I'd love to run multiple experiments together using dvc. Anyway we can be more smart about locking?

The text was updated successfully, but these errors were encountered:

efiop · 2018-06-08T20:17:35Z

Hi @yukw777 !

I agree, it is inconvenient. For now dvc can only run single instance per repo, because it has to make sure that your dvc files/deps/outputs don't collide with each other and you don't get race conditions in your project. For dvc run case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands. I will rename this issue and add it to our TODO list, but I'm not sure whether or not it is going to make it into 0.9.8 release. Btw, do you only need multiprocessing for dvc run or for dvc repro as well?

If you really need to run a few heavy experiments simultaneously right now, I would suggest a rather ugly but simple workaround of just copying your project's directory and running one dvc run per copy. Would that be feasible for you?

Thanks,
Ruslan

yukw777 · 2018-06-11T15:04:41Z

Hi Ruslan,

I think it makes sense to add multiprocessing for dvc repro as well. This will allow me to reproduce multiple experiments at once.

Yeah I was kind of doing what you suggested already. Thanks again for your speedy response! Very excited to use 0.9.8.

efiop · 2018-06-20T22:57:25Z

Moving this to 0.9.9 TODO.

prihoda · 2018-08-08T10:44:50Z

Hi guys, is this still on the table? ;)

efiop · 2018-08-08T15:57:50Z

Hi @prihoda !

It is, we just didn't have time to tackle it yet. With such amount of demand, we will try to move it up the priority list. 🙂

Thank you.

prihoda · 2019-01-07T14:51:38Z

Hi @efiop

For dvc run case, all we need to do to enable "multiprocessing" is to make sure that dvcfiles, outputs and dependencies don't collide. That should be pretty easy to implement through creating per-file lock files, that dvc will take into account when executing commands.

So all that is needed for parallel execution is to make sure no running stage's output path is the same as another running stage's input path, right? Two running stages having the same input path should not be a problem. So a lock file would be only created for dependencies and only checked for outputs. Can that be solved using zc.lockfile just like with the global lock file?

efiop · 2019-01-07T15:55:04Z

@prihoda Yes, I think so :)

prihoda · 2019-01-07T16:00:08Z

Looks like Portalocker provides shared/exclusive locks which could be used as read/write locks https://portalocker.readthedocs.io/en/latest/portalocker.html

AlJohri · 2019-01-30T15:41:02Z

Once the multiprocessing is added, hopefully it can extend to dvc repro as well, perhaps controlled via the --jobs parameter.

mathemaphysics · 2019-05-08T04:06:13Z

I followed the step verbatim on https://dvc.org/doc/tutorial/define-ml-pipeline and in the previous step. This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip. I've also tried desperately deleting .dvc/lock and .dvc/updater.lock, which show up every single time I look just to see if it does something.

What's going on?

shcheklein · 2019-05-08T04:26:17Z

This is the official tutorial as far as I can see. And I'm getting a very, very long wait to run dvc add data/Posts.xml.zip

https://discordapp.com/channels/485586884165107732/485596304961962003/575535709490905101 - NFS locking issue as was discussed on Discord

ghost · 2019-05-10T00:19:03Z

Related ^^ #1918

mdekstrand · 2019-08-12T21:41:17Z

I would find this useful, particularly for dvc repro as well.

I run my experiments on an HPC cluster, and would like to be able to kick off multiple large, non-overlapping subcomponents of the experiment in parallel. Right now the locking prevents that; also, locking keeps me from performing other operations (such as uploading the results of a previous computation to a remote) while a job is working.

My workaround right now is to create Git worktrees. I have a script that creates a worktree, configures DVC to share a cache with the original repository, checks out the DVC outputs, and then starts the real job. When it is finished, I have to manually merge the results back into my project.

It works, but is quite ugly and tedious.

efiop · 2019-08-12T22:13:37Z

We've had a discussion about some quick-ish ways to help users with this and came up with two possible partial solutions:

As a quick general solution, we could implement --no-lock option for run and repro that would shift the responsibility to the user for making sure that his stages don't overlap. This solution seems trivial, except that we would need to revisit the way we acquire lock for our state db, because currently we take it really early and we sit there with it waiting for the command to finish. Seems like it would require a slight adjustment to take state db lock only when we really need it (self.state.get/save/etc in RemoteBASE).
As @shcheklein suggested, another option might be to implement locking per-pipeline, so you could at least run stages in unrelated pipelines. This one is simple, but not as trivial as 1) and would still require solving state db lock problem. Also this approach would only work in specific scenarios.

And the final option is to come up with general solution that would still require solving 1) and would probably look like per-output lock files.

What do you guys think?

charlesbaynham · 2020-09-02T12:20:31Z

Nothing particular to add here other than a +1 from me! My workflow usually involves lots and lots of quite small steps, executing on an HPC cluster, so parallelism would be a great help.

janvainer · 2020-11-13T12:08:32Z

I would also benefit very much from parallel executions

johnnychen94 · 2020-12-01T15:11:14Z

Cross-post from #3633 (comment)

We could go one step further based on option 1 in #755 (comment) by letting users to specify how jobs/stages are running:

# jobs.yaml
jobs:
  - name: extract_data@*
    limit: 1 # 1 means no concurrency, this can be set as default value
  - name: prepare@*
    limit: 4 # at most 4 concurrent jobs
    env:
      JULIA_NUM_THREADS: 8
  - train@*
    limit: 4
    env:
      # if it's a scalar, apply to all jobs
      JULIA_NUM_THREADS: 8
      # if it's a list, iterate one by one
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
        - 3
  - name: evaluate@*
    limit: 3
    env:
      JULIA_NUM_THREADS: 8
      CUDA_VISIBLE_DEVICES:
        - 0
        - 1
        - 2
  - name: summary

extract_data@* follows the same glob syntax in #4976

I never used slurm so I don't know if this is applicable to cluster cases. I feel like dvc should delegate distribution computation tasks to other existing tools.

turian · 2020-12-04T09:20:00Z

Do I understand correctly that if one step of the pipeline (e.g. retrieving webpages) requires 100 micro-tasks running in parallel, that this step is a bad fit for doing within dvc?

@dmpetrov in #755 (comment) describes 3 methods of parallelization and I assume you need 2 for this. Let me know if there is a workaround.

efiop · 2020-12-10T15:46:14Z

@turian The workaround is to simply parallelize it in your code(1 stage in dvc pipeline), and not create 100 micro stages in dvc pipeline.

turian · 2020-12-15T06:23:10Z

@efiop The issue is that each of the little jobs has a pipeline. Download webpage => extract text without boilerplate => further NLP.

jorgeorpinel · 2021-02-10T04:43:12Z

I haven't read the complete conversation but my 2c here are that while it sounds like an amazing feature, we prob should think about this several times.

We've already had some ongoing performance issues (mainly around file system manipulation and network connectivity) which required pretty sophisticated implementations. These are expensive to develop and maintain. Do we want to also worry about multithreading in this repo?

We can still provide a solution for this, just not necessarily in the DVC product (cc @dberenbaum). There's CML for example, so you can prob. setup a distributed CI system that runs DVC pipelines in parallel (cc @DavidGOrtega?) + adds a bunch of other benefits. If needed DVC could have small modifications to support this usage, or other alternative solutions (e.g. make a plugin/extension independent from this repo). And we can document/blog about setting up systems for parallel execution of DVC pipelines.

jorgeorpinel · 2021-02-17T04:43:18Z

That said if a multithreading refactor could speed up regular operations automatically (more general than just for executing commands) that would definitely be a welcome enhancement.

See user insight in https://groups.google.com/u/1/a/iterative.ai/g/support/c/zvJ4WnfTGKM/m/ZOckCIwIEwAJ

tweddielin · 2021-12-15T18:45:37Z

Just want to see if parallel execution is still on the roadmap? By parallel execution, I mean execute stages(jobs) in the defined pipeline(DAG) of a repo in parallel instead of parallelizing multiple pipelines. Just like what you can do with make -j4 or luigi --workers=4 for luigi.

I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".

dberenbaum · 2021-12-17T18:45:23Z

@tweddielin Thanks for the kind words and your support! It's still something we'd like to do, but a lot of efforts right now are focused on data optimization rather than pipeline optimization. Hopefully, we will have a chance to return to pipelines once we complete that work.

dberenbaum · 2022-02-18T22:27:45Z

@dmpetrov @efiop Let's discuss the priority of this one as we start to think about planning for next quarter. I think we should probably move it to p2 for now as we are unlikely to work on it for at least the rest of this quarter, but we should keep it in mind as a priority for next quarter.

Eisbrenner · 2022-03-16T16:18:19Z

Hi, I may add another use case here. I work partly using an HPC featuring SLURM. Using SLURM, I can and should define task dependencies directly while queuing jobs.

Assuming a graph like this:

A --> B --> C
  \-> D

I can schedule jobs of the form

#!/bin/bash
#slurm settings

dvc repro dvc.yaml:stageA

with

#!/bin/bash

# queue job A with no dependencies
stdout=$(sbatch <kwags> job_A.sh)
id_A=${stdout##* } # the PID is given as part of a sentence

# queue job B with id_A as dependency
stdout=$(sbatch <kwags> --dependency=afterok:id_A job_B.sh)
id_B=${stdout##* }

# queue job C with id_B as dependency
sbatch <kwags> --dependency=afterok:id_B job_C.sh

# queue job D with id_A as dependency
sbatch <kwags> --dependency=afterok:id_A job_D.sh

So here I can run B --> C and D in parallel on different nodes or otherwise separated units.

Only thing missing is a way to disable the dvc-repo lock.

osma · 2022-03-30T12:29:04Z

@tweddielin said above:

I appreciate what you guys have already done and hope to see more new functionalities, but this would be the only thing unfulfilled for me to call dvc a real "git + make for data and machine learning project".

I second this 100%! DVC is amazing, but the lack of parallel execution of pipeline stages is disappointing. I'm working on a machine with many cores and it would be much more efficient to be able to use something like make -j8.

I've tried some workarounds. It seems to be possible to run individual stages in parallel, using dvc repro -s <stagename>, as long as you don't start them at the exact same moment, because then you will hit the lock contention problem. I even tried automating parallel execution of pending stages outside DVC using the jq tool and GNU Parallel, like this:

dvc status --json | jq -r 'keys | join("\n")' | parallel dvc repro -s

but this fails because all the parallel dvc repro -s commands try to acquire the lock at nearly the same time and usually only one of them will succeed. Since the lock appears to be held only for a short while, adding a retry loop with a timeout could help, as mentioned above in several comments (there's also a closed issue #2031 where this was suggested).

I've also used a lot of foreach statements and at least for the use cases I can think of, all the iterations are independent from each other. So if it's difficult to schedule parallel execution of the whole pipeline/DAG, at least stages defined using foreach could be executed in parallel.

itcarroll · 2022-11-16T21:25:20Z

I am wondering whether the awesome dev team has decided whether the --jobs feature associated with dvc exp run --queue is planned to close this issue? You'd be awesome either way, but maybe more awesome if this issue is still on the table 😉.

Adding a jobs parameter to foreach blocks would be killer.

dberenbaum · 2022-11-17T14:14:07Z

@itcarroll It's not in our short-term plans for the rest of this year, but it's a highly requested feature, so it's still on the table, and there's no intent to close this issue without addressing it more directly.

efiop self-assigned this Jun 8, 2018

efiop changed the title ~~dvc run lock issue.~~ Allow running multiple dvc run instances at the same time Jun 8, 2018

efiop added the enhancement Enhances DVC label Jun 8, 2018

efiop added this to the 0.9.8 milestone Jun 8, 2018

efiop added the awaiting response we are waiting for your reply, please respond! :) label Jun 9, 2018

efiop removed the awaiting response we are waiting for your reply, please respond! :) label Jun 14, 2018

efiop modified the milestones: 0.9.8, 0.9.9 Jun 20, 2018

efiop mentioned this issue Jun 28, 2018

DVC pipeline parallel steps #647

Closed

efiop mentioned this issue Aug 13, 2018

Make dvc run handle files with same name but different path #973

Closed

efiop removed their assignment Nov 1, 2018

prihoda mentioned this issue Nov 22, 2018

dvc: consider introducing build matrix #1018

Closed

dmpetrov mentioned this issue Feb 5, 2019

Incremental processing or streaming in micro-batches #331

Closed

dmpetrov mentioned this issue Mar 11, 2019

Advice on adding parallelism #1710

Closed

efiop added the p2-medium Medium priority, should be done, but less important label Mar 20, 2019

efiop self-assigned this Aug 13, 2019

skshetry mentioned this issue Nov 27, 2020

introduce a readonly property for better parallelization #4979

Closed

johnnychen94 mentioned this issue Dec 1, 2020

Pipeline variables from params file #3633

Closed

jorgeorpinel mentioned this issue Feb 10, 2021

dvc.yaml: future of foreach stages #5440

Closed

pmrowla mentioned this issue Apr 12, 2021

repro: LockError due to unnecessary write lock #5795

Closed

karajan1001 mentioned this issue May 19, 2021

Run all variants of a foreach stage in parallel #6035

Closed

dberenbaum added p2-medium Medium priority, should be done, but less important and removed p1-important Important, aka current backlog of things to do labels Feb 18, 2022

dberenbaum mentioned this issue Nov 22, 2022

Preserve timestamps during caching #8602

Closed

daavoo added the A: pipelines Related to the pipelines feature label Dec 20, 2022

PythonFZ mentioned this issue Mar 23, 2023

HPC and non-python checkpointing #9235

Closed

dberenbaum mentioned this issue Apr 24, 2023

exp: set workers per stage #9363

Open

skshetry mentioned this issue Sep 21, 2023

Execution order for dvc DAG #9958

Closed

skshetry unassigned efiop Mar 6, 2024

dberenbaum mentioned this issue Jun 20, 2024

repro: parallel stage execution results in error #10465

Closed

bric-afisher mentioned this issue Nov 23, 2024

dvc commit is slow when there are many stages #10629

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

repro: add scheduler for parallelising execution jobs #755

repro: add scheduler for parallelising execution jobs #755

yukw777 commented Jun 8, 2018

efiop commented Jun 8, 2018 •

edited

Loading

yukw777 commented Jun 11, 2018

efiop commented Jun 20, 2018

prihoda commented Aug 8, 2018

efiop commented Aug 8, 2018

prihoda commented Jan 7, 2019 •

edited

Loading

efiop commented Jan 7, 2019

prihoda commented Jan 7, 2019

AlJohri commented Jan 30, 2019

mathemaphysics commented May 8, 2019

shcheklein commented May 8, 2019

ghost commented May 10, 2019

mdekstrand commented Aug 12, 2019

efiop commented Aug 12, 2019

charlesbaynham commented Sep 2, 2020

janvainer commented Nov 13, 2020

johnnychen94 commented Dec 1, 2020 •

edited

Loading

turian commented Dec 4, 2020

efiop commented Dec 10, 2020 •

edited

Loading

turian commented Dec 15, 2020

jorgeorpinel commented Feb 10, 2021

jorgeorpinel commented Feb 17, 2021

tweddielin commented Dec 15, 2021

dberenbaum commented Dec 17, 2021

dberenbaum commented Feb 18, 2022

Eisbrenner commented Mar 16, 2022

osma commented Mar 30, 2022

itcarroll commented Nov 16, 2022 •

edited

Loading

dberenbaum commented Nov 17, 2022

repro: add scheduler for parallelising execution jobs #755

repro: add scheduler for parallelising execution jobs #755

Comments

yukw777 commented Jun 8, 2018

efiop commented Jun 8, 2018 • edited Loading

yukw777 commented Jun 11, 2018

efiop commented Jun 20, 2018

prihoda commented Aug 8, 2018

efiop commented Aug 8, 2018

prihoda commented Jan 7, 2019 • edited Loading

efiop commented Jan 7, 2019

prihoda commented Jan 7, 2019

AlJohri commented Jan 30, 2019

mathemaphysics commented May 8, 2019

shcheklein commented May 8, 2019

ghost commented May 10, 2019

mdekstrand commented Aug 12, 2019

efiop commented Aug 12, 2019

charlesbaynham commented Sep 2, 2020

janvainer commented Nov 13, 2020

johnnychen94 commented Dec 1, 2020 • edited Loading

turian commented Dec 4, 2020

efiop commented Dec 10, 2020 • edited Loading

turian commented Dec 15, 2020

jorgeorpinel commented Feb 10, 2021

jorgeorpinel commented Feb 17, 2021

tweddielin commented Dec 15, 2021

dberenbaum commented Dec 17, 2021

dberenbaum commented Feb 18, 2022

Eisbrenner commented Mar 16, 2022

osma commented Mar 30, 2022

itcarroll commented Nov 16, 2022 • edited Loading

dberenbaum commented Nov 17, 2022

efiop commented Jun 8, 2018 •

edited

Loading

prihoda commented Jan 7, 2019 •

edited

Loading

johnnychen94 commented Dec 1, 2020 •

edited

Loading

efiop commented Dec 10, 2020 •

edited

Loading

itcarroll commented Nov 16, 2022 •

edited

Loading